CN108647322A

CN108647322A - The method that word-based net identifies a large amount of Web text messages similarities

Info

Publication number: CN108647322A
Application number: CN201810445807.4A
Authority: CN
Inventors: 靳宇倡; 安俊秀; 文仁强
Original assignee: Sichuan Normal University
Current assignee: Sichuan Normal University
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-10-12
Anticipated expiration: 2038-05-11
Also published as: CN108647322B

Abstract

The invention discloses a kind of methods that word-based net identifies a large amount of Web text messages similarities, include the following steps：(1) word net is built；(2) new Web page text message similarity identification, includes the following steps：Text message is extracted from new Web page and constitutes new document, and Feature Words f is extracted from new document₁、f₂、…、f_m；Solve the set of the similar word of each Feature Words f；Solve the similar document set of each Feature Words f；It determines the similar document of new document and calculates the similarity value that similar document concentrates document；The document that similar document is concentrated is filtered, final similarity collection of document is obtained；(3) method for pressing step (1) carries out word net update to new Web page.It using this method, can be used for finding that information is plagiarized or information is imitated, distorted, can be used for excavating implicit existing correlativity between different field, the webpage of repetition can be eliminated, reduce the burden of search engine, optimization storage and index structure.

Description

Method for identifying similarity of mass Web text information based on word network

Technical Field

The invention relates to a method for identifying the similarity of internet text information, in particular to a method for identifying the similarity of a large number of Web text information based on a word network.

Background

The change of internet technology provides a platform for information publishing, communication and communication for offline and online users while transmitting information and knowledge, introduces common users to participate in rapid growth of a large amount of online information, and facilitates the internet to become one of important components of an information resource library.

In an effort to address the rapid growth of internet information, many research projects have targeted how to efficiently organize this vast amount of information, enable end users to quickly and accurately obtain desired information, and reduce the cost of organizing the information. Web information in the internet is reasonably formatted and displayed to an end user through an HTML tag in a text form, so many Web document processing systems implement a generalized processing mode for Web documents using common texts based on a text data processing technology. Web document processing technology includes many processes, in general form: the method comprises the steps of webpage crawling, HTML mark removal, redundant blank line elimination, interference word removal, word stem extraction, text data mining, information display and the like. If special processing is carried out on related webpages with link relations, the link relations among the webpages also need to be analyzed, and the core of the whole processing process is text data mining. Text data mining and conventional data mining techniques have many common places in methods, including analyzing the potential inherent structure of data, clustering similar data, and when applied to general text data, clustering methods try to identify the group to which text documents belong, and then form different clusters according to the degree of similarity between documents, wherein documents within a cluster have high similarity and documents between clusters have low similarity.

Therefore, aiming at the organization of the internet for orderly and normatively organizing the information, the method for improving the transparence and the orderliness of the internet public information is one of the first solutions for preventing the terminal user from being submerged by massive information in the process of providing the information acquisition service. Especially, in the era of mobile internet, it is one of the necessary works to improve the experience of internet users to quickly acquire and display valuable information from a large amount of information oceans to the end users, and simultaneously to ensure the property security and privacy of individuals or organizations involved in the generation of information, and to eliminate useless information, repeated information and sensitive information in the internet.

A common means adopted in the process of comparing similar texts by the conventional method is a text literal content comparison-based method, and representative methods include a hash method based on text content, a document vector space model method, an edit distance method, and the like, for example: (1) comparing the query key words provided by the terminal user with subject terms provided by each document in a pre-established document index library, and if the similarity between the query key words and the document subject terms is within a certain preset threshold value, considering the document containing the subject terms as a return result required by the user; (2) or a more efficient and simple method for comparing the similarity between a large number of Web text documents, namely the simhash method, in the method, Google practices remove a large number of repeated Web pages in a search engine when crawling the Web pages, experience proves that the method has good similar text recognition capability and is suitable for the requirement of Google for rapidly processing a large number of Web pages, and the butterfly effect cannot be generated. Considering that a text document is composed of a series of words, a K-gram method is adopted to select K successive words to form a subsequence, then the subsequence is converted into a hash value to form a shift, finally, a document is composed of a plurality of shift and represents all feature sets of the document, and the feature sets are used as unique identifications of the document different from other documents, so that the shift method based on text content realizes identification of similar documents by comparing the shift hash values of different documents. (3) Corresponding to the hash method, another method preprocesses the document to only contain a plurality of characteristic words, and the characteristic words are characterized in that the occurrence frequency of certain text documents is high, and the occurrence frequency of the characteristic words in other text documents is low, so that the characteristic words have good distinguishing capability for different texts. The document vector space model method is to extract all the characteristic words in the document, then calculate the TF-IDF values of all the characteristic words, convert the document into the text characteristic vector formed by the TF-IDF values of a series of characteristic words, and calculate the similarity of two documents by comparing the difference of the text characteristic vectors of the two documents; (4) the idea of the edit distance method is to change one text string into another text string through editing operations such as insertion, deletion, replacement and the like, the way of calculating the similarity between the two text strings is to count the total times of the editing operations, compared with the hashing method and the vector space model method, the method can more directly compare the similarity between different text strings based on the content of the text strings, the calculated result is more accurate, but the method is not suitable for long text strings, and the memory and the CPU time required by the calculation process are multiplied along with the increase of the length of the text strings.

The method for determining the similarity of the text according to the literal content of the text can return a basic result required by a query request or compare the similarity of the text from the literal content, but has the following defects: (1) the ambiguity of the end user query method is not considered, namely the target result which should be returned by the query request is not clear, so that the content of the input query request is not targeted, and the returned query result is probably not the result expected by the user; (2) although the forms of the content characters of the two documents are greatly different, the contained information or meanings are greatly different, and the two documents are described from different angles by using different words and even are synonyms of the same problem.

The similar identification of the Web page includes, in addition to the above methods, a method based on a Web page link relationship, an anchor text (anchor text), an anchor window (anchor window), and the like. For example, in order to identify related Web pages, the existing link relationship between Web pages suggests that two Web pages are considered to be related if they contain the same pointed-to links (incoming links) or if they contain the same pointed-to links (outgoing links) to other Web pages. For another example, between different objects having reference relationships, if two objects are referenced by similar objects, the two objects are considered to be similar, and the method for identifying object similarity is applied to the link relationship between web pages and the reference relationship between scientific and technical papers. Havelivala et al indicate that a method for identifying related web pages based on web page link relationships does not have a good effect in the case of a small number of links, and propose to combine an anchor text and anchor window method to compensate for the small number of web page links, which is easily affected by the number of links between web pages, the type or quality of web pages.

Disclosure of Invention

The invention aims to solve the problems and provide a method for identifying the similarity of a large amount of Web text information based on a word network.

The invention realizes the purpose through the following technical scheme:

a method for recognizing similarity of a large number of Web text messages based on a word network comprises the following steps:

(1) constructing a word network, comprising the following steps:

1.1, extracting text information from a Web page to form a document set D consisting of a plurality of documents D, extracting characteristic words from one document D in the document set D, and calculating any two f in all the characteristic words_i、f_jNormalized mutual information value norm _ I between two_ijAnd norm _ I_jiFrom the calculated norm _ I_ijAnd norm _ I_jiValue-based respective construction of feature words f_i、f_jThe mutual information relation between the words is < f_i,f_j> and < f_j,f_i＞， norm_I_ijWord pair < f as mutual information relationship_i,f_jWeight of > norm _ I_jiWord pair < f as mutual information relationship_j,f_iA weight of > but norm _ I_ij＝norm_I_jiThe mutual information relation word pair is less than f_i,f_j> and word pair < f_j,f_iAdding a word-entering net;

1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d 'is introduced, extracting the feature words f'_i、 f'_jMeter for measuringCalculating any two f 'in all feature words of the Chinese character'_i、f'_jTwo equal normalized mutual information values norm _ I 'between two'_ijAnd norm _ I'_jiAnd establishing a mutual information relationship word pair < f 'between the two'_i,f'_j> and < f'_j,f'_iIf the mutual information relationship word pair is < f'_i,f'_j> and < f'_j,f'_iIs present in the word network, then norm _ I'_ijUpdating the weight of the mutual information relationship in the word network, if the word pair of the mutual information relationship is < f'_i,f'_j> and < f'_j,f'_iIf the word network does not exist in the word network, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system;

(2) the identification of the similarity of the text information of the new Web page comprises the following steps:

2.1, extracting text information from a new Web page to form a new document, and extracting a characteristic word f from the new document: segmenting new documents, calculating the weight measurement TF-IDF value of each word, and selecting a characteristic word f according to the TF-IDF value₁、f₂、…、f_m；

2.2, solving a set of similar words of each feature word f: aiming at each characteristic word f, searching words having direct mutual information relation with the characteristic word f in a word network in a database system, and simultaneously recording mutual information values of the words to form a similar word set corresponding to each characteristic word, namely f₁→{t₁₁:I₁₁,t₁₂:I₁₂,...}， f₂→{t₂₁:I₂₁,t₂₂:I₂₂,...}，…，f_m→{t_m1:I_m1,t_m2:I_m2,., wherein the same feature word f_mCorresponding set of similar words t_m1,t_m2,.. } all words are different, there may be common similar words between feature words f, i.e. for any two feature words f_lAnd f_kThe intersection operation of the corresponding similar word sets between (1 is less than or equal to l, k is less than or equal to m) satisfiesWhereinRepresenting an empty set;

2.3, solving a similar document set of each characteristic word f: aiming at the similar word sets { t) corresponding to all the characteristic words f₁:I₁,t₂:I₂,...,t_n:I_nAnd solving document sets corresponding to all words in the similar word sets respectively to form document sets corresponding to the similar word sets, and calculating mutual information values accumulated by the documents in the document sets. I.e. for the set of similar words t₁:I₁,t₂:I₂,...,t_n:I_nEvery word t in_iSolving includes t_iAll documents of { I_i:(d_i1,d_i2,..) }, wherein I_iFor corresponding mutual information values, d_i1,d_i2,.. each contain t_iA different document of (2); after all t completes the above process, the union set of all the documents corresponding to t is obtained, i.e. { I }₁:(d₁₁,d₁₂,...)}∪{I₂:(d₂₁,d₂₂,...)}∪...∪{I_n:(d_n1,d_n2,..) }, to obtain a new set { d }₁:I_d1,d₂:I_d2,. for item d in the set_i:I_diAll d are different documents, I_diIncluding d for union_iCorresponding mutual information value I and corresponding t at d_iOf the tf-idf value in (d), when d is equal to₁:I_d1,d₂:I_d2,., namely a document set with a certain mutual information relationship with the characteristic word f, i.e. f → { d }₁:I_d1,d₂:I_d2,.. }; suppose f₁→{d₁₁:I₁₁,d₁₂:I₁₂,...}， f₂→{d₂₁:I₂₁,d₂₂:I₂₂,...}，…，f_m→{d_m1:I_m1,d_m2:I_m1,., wherein d_i1,d_i2,...,d_ijFor different documents in the document library, the document set may contain the same document between every two documents, namely for any two characteristic words f_lAnd f_k(1 is less than or equal to l, k is less than or equal to m) the intersection operation of the document sets related to the mutual information satisfies

2.4, determining similar documents of the new document: and (4) applying intersection operation to the document set which is obtained in the step 2.3 and has mutual information relationship with the characteristic word f, namely obtaining a similar document set omega ═ { d ═ d₁₁:I₁₁,d₁₂:I₁₂,...}∧{d₂₁:I₂₁,d₂₂:I₂₂,...}∧...∧{d_m1:I_m1,d_m2:I_m1,.., assuming that the calculation of Ω isWhereinFor a document that exists in all collections, I_iAs documentsThe corresponding similarity value is the sum of mutual information values corresponding to the corresponding documents in all the sets when the intersection is solved; then the character string includes the feature word f₁、f₂、…、f_mThe document similar to the document is

2.5, filtering the documents in the similar document set to obtain a final similarity document set: for similarity document setEach document in (1)According to the corresponding similarity value I_iComparing the value with a threshold value delta, if the value is smaller than delta, filtering and discarding, otherwise, keeping, and obtaining a filtered similar document setThe set is a final similarity document set;

(3) and (4) performing word Web updating on the new Web page according to the method in the step (1) to prepare for identifying the text information similarity of the Web page updated next time.

Description of the drawings: the initial weight value of the feature word in the document is measured by a weight measurement, namely a TF-IDF measurement method, commonly used in the field of traditional information retrieval; the correlation between the characteristic words is quantified by mutual information, the mutual information indicates that the appearance condition of two different characteristic words in the text is two random events, the information quantity provided by one event for eliminating the uncertainty of the other event is known, and the size of the mutual information between the two characteristic words is defined as the measure of the correlation degree or similarity between the two characteristic words.

Preferably, in step 1.1 and step 2.1, extracting the feature word f includes the following steps:

A. firstly, extracting text information;

B. filtering symbols and segmenting words;

C. a word segmentation list;

D. converting each word into a lower case;

E. restoring words by using a baud stem algorithm;

F. and filtering the numbers and the stop words to obtain the feature words f.

In the step 1.1, any two f in all the feature words are calculated_i、f_jNormalized mutual information value norm _ I between two_ijComprises the following steps：

① constructing two feature words f_iAnd f_jThe weights in all documents D within the document set D measure the TF-IDF vector: according to two characteristic words f in a specific document set D_iAnd f_jAnd respectively constructing TF-IDF vectors with the same dimension of the two characteristic words on the TF-IDF value of each document d, if the characteristic words are in the document d_iIf it occurs, the value at the ith position in the TF-IDF vector is the word in the corresponding document d_iTF-IDF value of (1); if the word feature word is in the document d_iIf the value is not present, the value at the ith position in the TF-IDF vector is represented by 0;

② calculating two feature words f_iAnd f_jTF-IDF vector distance of: calculating cosine values of the two TF-IDF vectors as TF-IDF vector distances for measuring the two words, wherein the calculation mode is shown as formula (I), and the vector distances quantitatively illustrate the similarity of the two TF-IDF vectors and reflect the f of the two characteristic words_iAnd f_jDegree of similarity of information expressed within the document set D:

wherein,the expression f_iTF-IDF vectors within the document set D;the expression f_jTF-IDF vectors within the document set D;

③ calculating two feature words f_iAnd f_jNormalized mutual information value norm _ I of_ij: using two characteristic words f_iAnd f_jThe TF-IDF vector distance pair of the two feature words f calculated according to the formula (II)_iAnd f_jThe mutual information value is normalized, and finally two characteristic words f are obtained according to a formula (III)_iAnd f_jNormalized mutual information value n oform_I_ij：

Wherein X, Y respectively represent the word f_iAnd f_jTwo random events of occurrence, "0" means the word f_iOr f_jNot present in a particular document of the document collection, "1" means the word f_iOr f_jOccurring in a particular document of a collection of documents, p (x, y) represents the word f_iAnd f_jJoint probability of simultaneous occurrence in certain specific documents of a document collection, p (x) and p (y) representing the word f_iAnd f_jEdge probabilities that occur in certain particular documents of the document collection;

in step 2.1, taking calculating the TF-IDF value of the weight metric of the word w as an example, calculating the TF-IDF value of the weight metric of each word includes the following steps:

a. calculating the frequency TF of the word w in the document d according to the following formula, namely the ratio of the number of times the word w appears in the document d to the total number of words in the document d:

TF(w,d)＝count(w,d)/size(d)

wherein, TF (w, d) represents the frequency of the word w appearing in the document d, count (w, d) represents the number of times the word w appears in the document d, and size (d) represents the total number of words contained in the document d;

b. calculating the inverse text frequency IDF of the word w in the whole document set D according to the following formula, namely calculating the ratio of the total number of documents in the document set to the number of documents containing the word w and then taking the logarithm:

wherein IDF (w, D; D) represents the inverse text frequency of the word w in the document set D, sum (D) represents the total number of documents in the document set D, and count (w, D; D) represents the number of documents containing the word w in the document set D;

c. the TF-IDF value of word w in document d, i.e., the product of the TF value and the IDF value of word w, is calculated as follows:

TF-IDF(w,d)＝TF×IDF。

in the step 2.5, the value range of the threshold value delta is 0.5-0.7.

The database system in step 1.2 is a distributed database HBase.

The invention has the beneficial effects that:

the invention provides a word network model constructed by mutual information relation between words from a word correlation statistical method, then compares the similarity of two different text documents based on a certain number of words with mutual information relation respectively positioned in the two different text documents, namely establishes a 'document-mutual information word-document' relation model, can be used as a supplement of a method for comparing the similarity of the text documents based on a content face form, realizes text similarity comparison with a certain real meaning by applying the 'mutual information word' relation model, provides more optional results for query requests with fuzziness, simultaneously solves the problem that a synonymy description mode based on the same information in the traditional method can not be identified, eliminates content plagiarism, simulation or expands the extraction range of effective information, and returning more relevant results for the terminal inquiry user.

The method can be used for discovering information plagiarism or information imitation and tampering, and simultaneously can also be used for discovering the implicit relevant relation among different fields; repeated Web pages can be eliminated through research of repeated Web text information, the load of a search engine is reduced, a storage and index structure is optimized, and the retrieval efficiency of a search engine system and the quality of retrieval results are improved.

Drawings

FIG. 1 is a graph of time taken to construct a word net as a function of text content size in an embodiment of the present invention;

FIG. 2 is a graph of accuracy, recall, and F1 metrics as a function of a similarity threshold δ in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram showing the comparison between the experiment effects of the Naive Bayes method in Mahout and the method of the present invention in the embodiment of the present invention;

FIG. 4 is a graph of the inter-cluster density, intra-cluster density, and F1 metrics as a function of the similarity threshold δ in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram showing the comparison between the k-means method in Mahout and the experimental effect of the method of the present invention in the embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

the invention discloses a method for identifying similarity of a large number of Web text messages based on a word network, which comprises the following steps:

(1) constructing a word network, comprising the following steps:

calculating any two f in all the characteristic words_i、f_jNormalized mutual information value norm _ I between two_ijThe method comprises the following steps:

③ calculating two feature words f_iAnd f_jNormalized mutual information value norm _ I of_ij: using two characteristic words f_iAnd f_jThe TF-IDF vector distance pair of the two feature words f calculated according to the formula (II)_iAnd f_jThe mutual information value is normalized, and finally two characteristic words f are obtained according to a formula (III)_iAnd f_jNormalized mutual information value norm _ I of_ij：

1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d 'is introduced, extracting the feature words f'_i、 f'_jCalculating any two f 'in all feature words of the current word'_i、f'_jTwo equal normalized mutual information values norm _ I 'between two'_ijAnd norm _ I'_jiAnd establishing a mutual information relationship word pair < f 'between the two'_i,f'_j> and < f'_j,f'_iIf the mutual information relationship word pair is < f'_i,f'_j> and < f'_j,f'_iIs present in the word network, then norm _ I'_ijUpdating the weight of the mutual information relationship in the word network, if the word pair of the mutual information relationship is < f'_i,f'_j> and < f'_j,f'_iIf the word network does not exist in the word network, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system, wherein the database system is preferably a distributed database HBase;

the extracting of the feature word f comprises the following steps:

A. firstly, extracting text information;

B. filtering symbols and segmenting words;

C. a word segmentation list;

D. converting each word into a lower case;

E. restoring words by using a baud stem algorithm;

F. filtering the numbers and the stop words to obtain a feature word f;

2.1, extracting text information from a new Web page to form a new document, extracting characteristic words f from the new document, segmenting the new document, calculating the weight measurement TF-IDF value of each word, and selecting the characteristic words f according to the TF-IDF value₁、f₂、…、f_m；

Taking the calculation of the TF-IDF value for the word w as an example, the above-mentioned calculation of the TF-IDF value for the weight metric of each word includes the following steps:

TF(w,d)＝count(w,d)/size(d)

TF-IDF(w,d)＝TF×IDF；

2.5, filtering the documents in the similar document set to obtain a final similarity document set: for similarity document setEach document in (1)According to the corresponding similarity value I_iComparing the value with a threshold value delta, if the value is smaller than delta, filtering and discarding, otherwise, keeping, and obtaining a filtered similar document setThe set is a final similarity document set, and the value range of the threshold value delta is 0.5-0.7;

The effectiveness of the method was verified experimentally as follows:

experiments are respectively carried out on the data sets 20-NewsGroups and Reuters-21578, and are compared with experiments carried out on the data sets 20-NewsGroups by a Naive Bayes text classification method provided by Mahout, and are compared with experiments carried out on the data sets Reuters-21578 by a K-means text clustering method provided by Mahout. The experimental operation process is divided into two stages, wherein the first stage is to construct a word network with topic classification according to all existing documents, and is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage.

Experimental setup:

the experimental environment is a Hadoop cluster with 19 machine nodes, the total configuration capacity is 6.42TB, one node is a NameNode node, one node is a SecondardyNameNode node, and the rest are DataNode nodes; a distributed database HBase cluster used for experiments is provided with 13 machine nodes, wherein one machine node is an HMaster node, and the other machine nodes are HRegionServer nodes. The cluster uses Hadoop with a version number of 2.2.0 and HBase with a version number of 0.98.6. The whole Hadoop cluster environment structure and the machine node performance are shown in table 1, and the HBase cluster environment structure and the machine node performance are shown in table 2.

TABLE 1 Hadoop Cluster architecture and machine node Performance

TABLE 2 HBase Cluster Structure and machine node Performance

Currently, the most used Web data sets in text clustering or classification techniques are 20-news groups.

The data set 20-News groups consists of 20 predefined categories, 997 files are contained under a category directory soc. Different categories in the data set contain different degrees of similar information, for example, the information contained between comp.sys.ibm.pc.hardware and comp.sys.mac.hardware is very close, and the information contained between misc.forsale and soc.region.christian is very different. The 20-News groups dataset is commonly used as a corpus of data for text classification algorithms.

The data set Reuters-21578 is manually collected from the news line of the street nameko and organized, and is distributed in 22 data files, from the reut2-000.sgm to the reut2-020.sgm, each data file contains 1000 documents, and the data file reut2-021.sgm contains 578 documents, so the data set is called Reuters 21578. Each data file beginning in the format <! A DOTYPE Lewis System "Lewis.dtd" > declares a document type, immediately starting with < REUTERS >, and ending with < REUTERS > defines a scope to which the document content belongs. The Reuters-21578 data sets are files in the SGML format, preprocessing operation is needed before text contents in the files are used, SGML format characters are removed, the text contents are extracted, and then subsequent analysis operation can be carried out. The Reuters-21578 data set is commonly used as a corpus of data for text clustering algorithms.

And (3) experimental evaluation:

often, the performance of a computer system or program is evaluated from the resource conditions, such as space or time, it consumes. However, in addition to considering the time or space resource consumption of the overall system, text mining systems also compare the ability of the system to mine relevant documents. Recall (Recall) and Precision (Precision) are two measurement methods for measuring the ability of a text processing system to discover relevant documents, and the two measurement methods have advantages and disadvantages respectively and are complementary to each other. Recall is defined as the proportion of relevant documents in the system that are retrieved, with emphasis on considering the recall ratio of the system, and is calculated as shown in equation (4-1); the accuracy rate is defined as the proportion of the documents retrieved that are related to each other, and is calculated by considering the precision rate of the system as shown in equation (4-2).

Currently, a common measurement method is to take the recall rate and the accuracy rate into consideration, i.e., F1 measurement method:

the method of the invention is adopted to calculate the similarity among all documents, the documents with the similarity within a certain threshold value range are classified together to form a new document set theta, and the proportion of the documents in the formed new document set theta, which are correctly classified into the same category pre-established with the original data set, is used as the standard of the performance correctness of the metric calculation method. For example, the recall rate of a classification directory of a data set is the ratio of the number of documents with the similarity within a certain threshold value delta range, which belong to the classification directory, to the total number of documents in the classification directory of an original document set; the accuracy is the ratio of the number of documents with the similarity within a certain threshold value delta range belonging to the classification catalogue to the total number of retrieved documents with the similarity within the threshold value delta range.

Experimental results and analysis of processing 20-Newsgroups:

the data in 20-Newsgroups is mostly the text content of the webpage retrieval result, and many empty lines or spaces exist inside the text content, and text attributes are used for explaining the text content. Therefore, the original data set needs to be preprocessed, such as removing empty lines, punctuation marks, single letters, numbers, useless words, etc., performing operations of segmenting words, extracting word stems, calculating weights of words in the documents, etc.

The experimental operation process is divided into two stages, wherein the first stage is to construct a word network from all preprocessed documents, and the word network is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage.

In the first stage of constructing the word network, the relationship between the time used by the system and the length of the text content is shown in fig. 1.

The number of word net edges in a word net formed by the data set 20-News groups is 77053480, namely the number of word pairs with mutual information relationship is 7700 and more than ten thousand.

Before identifying similar documents by using a word network model, documents under various classification catalogues in a 20-News groups data set are divided into a training set and a test set according to the proportion of 3:2, the training set is used for training a document similarity threshold delta, and the test set finally verifies the accuracy of the model in the range of the similarity greater than the threshold delta.

And in the training stage, the delta value is 0.1-1, the recall rate and the accuracy rate of the files in the 20 different classified catalogues in the training set in the classification are respectively solved, and finally, the average value of the recall rate and the accuracy rate of all the classifications is obtained to be used as the integral recall rate and the accuracy rate value of the data set. The recall average and precision average, and the corresponding F1 metric as a function of delta, are shown in fig. 2.

As can be seen from FIG. 2, the F1 metric value integrating the precision rate and the recall rate takes the maximum value between δ being equal to or greater than 0.5 and equal to or less than 0.6, and when δ is equal to or greater than 0.55, the precision rate, the recall rate and the F1 intersect at one point, and the F1 metric value can take the maximum value. Therefore, in the process of classifying the data set 20-NewsGroups by identifying similar documents by using the method of the invention, the threshold value delta of the similarity of the documents is taken as 0.55, namely when the similarity between the documents is more than 0.55, the text information contained in the documents is considered to be similar.

The test data of the data set used herein were then experimentally compared using the Naive Bayes text classification method provided by Mahout and the method of the present invention with a similarity threshold δ of 0.55, with the accuracy rate average, recall rate average and F1-Measure metric shown in fig. 3.

As can be seen from FIG. 3, the method of the present invention performs similar classification on texts, the accuracy rate of the method is higher than that of the Naive Bayes text classification method in Mahout, the recall rate is slightly lower than that of the Naive Bayes text classification method in Mahout, but the comprehensive evaluation index F1 values of the accuracy rate and the recall rate are higher than that of the method in Mahout, which indicates that the method of the present invention is more suitable for text similar classification.

In terms of time efficiency, the time used by the Naive Bayes text classification method in Mahout is 95642 seconds, the time used by the word correlation method is 128397 seconds, and all experimental operations are performed in a Hadoop distributed cluster environment. The word correlation method requires access to the distributed database every time the related information is acquired from the word network, and therefore, a large amount of time is consumed.

Experimental results and analysis of treatment Reuters-21578:

the text contents in the Reuters-25178 data set are all stored in the data files in an SGML format, the text contents are evenly distributed in 22 data files from rout 2-000.sgm to rout 2-021.sgm according to the generation time sequence, except that the data files rout 2-021.sgm contain 578 different document contents, each of the rest data files contains 1000 different document contents on average, therefore, the information of different types is not evenly distributed among different data files, and the document contents in all the data files must be extracted into each separate file before the relevant operation is carried out on the Reuters-25178 data set.

At present, the best partitioning method is considered to be that a Reuters-21578 data set is divided into 10 theme types, but the partitioning method has a problem that certain documents contain more cross information and the theme classification to which the documents belong is difficult to determine, and for example, the information cross degree between the two theme classifications of corn and wheat and the grain theme classification is difficult to define. Ana adopts a simpler and more intuitive method, discards a document containing more than one topic information, classifies the document containing three topic information of corn, while the document containing the white and the grain into a grain topic classification, finally a Reuters-21578 data set is divided into 8 topic types, the document distribution condition under each topic type is shown in table 3, in the experiment, namely, a training set and a test set obtained by adopting the division method, a k-means clustering algorithm in Mahout is used for obtaining document clusters under each topic type on the training set, and Inter-Cluster Density (Intra-Cluster Density) and Intra-Cluster Density (Intra-Cluster Density) of each type are calculated, and a word correlation method provided by the experiment is used for training a threshold value delta of text similarity in each Cluster on the training data set. Finally, experimental comparisons were performed on the test data sets using the k-means clustering algorithm in Mahout and the method described in the present invention.

TABLE 3 distribution of documents for Reuters-21578 dataset partitioned into 8 topic types

The experimental operation process aiming at the data set Reuters-21578 is still divided into two stages, wherein the first stage is that a word network is constructed by all preprocessed documents, and is equivalent to a word network model generation stage; the second stage is to search similar documents of a certain document according to the word network model generated in the first stage. The process of constructing word nets in the first stage is consistent with section 4.3, and the number of word net edges in the constructed word net is 27526742, that is, the number of word pairs with mutual information relationship is 2700 more than ten thousand.

After text clustering is carried out on the training data set based on a k-means method, the minimum similarity value among the documents in each cluster is 0.527 on average, the inter-cluster density is 0.5969, and the intra-cluster density is 0.7038. Experiments are carried out on a training set by adopting the method disclosed by the invention, and the change conditions of the inter-cluster density, the intra-cluster density and the F1 metric along with the similarity threshold value delta are shown in FIG. 4, wherein the values are the average values among all cluster categories.

When the method is used for processing the Reuters-21578 data set, the maximum value can be obtained by the F1 metric value within the range that the similarity threshold value is more than or equal to 0.5 and less than or equal to 0.7. Meanwhile, in order to ensure that the F1 metric value of the algorithm can obtain a maximum value, the inter-cluster density is smaller, and the intra-cluster density is larger, the delta value is taken as 0.7 in the experiment, namely when the similarity between the documents is larger than 0.7, the documents are considered to belong to the same cluster, and if the similarities between the documents and a plurality of clusters are larger than the delta value, the documents are classified into the cluster with the maximum similarity.

For test data, the k-means text clustering method provided by Mahout and the method provided by the invention are used for experimental comparison with the similarity threshold value delta of 0.7, and the inter-cluster average density, the intra-cluster average density and the F1-Measure measurement between the k-means text clustering method and the method are shown in FIG. 5.

As can be seen from FIG. 5, the inter-cluster density of the method is lower than that of the k-means algorithm in text clustering, and the intra-cluster density is slightly higher than that of the k-means algorithm, which means that the clustering clusters generated by the word correlation method are compact. But from its F1 metric analysis, the superiority of the k-means method in text clustering applications is higher than the word relevance method. Subsequently, the experiment further uses the Naive Bayes algorithm and the C4.5 algorithm to process the documents under the corresponding subjects according to the data set partition method shown in table 3, and the distribution of F1 metric values of the Naive Bayes algorithm and the C4.5 algorithm is obtained as shown in table 4.

TABLE 4 Naive Bayes and C4.5 algorithms handle F1 metric results (%) -for the data set shown in Table 3

It can be seen from table 4 that some documents in the Reuters-21578 data set have a large degree of information skew, such as documents under the subject trade, whose F1 measures are very different using different methods.

In terms of time efficiency, the time used by using a k-means text clustering method in the Mahout is 2342 seconds, the time used by using a word correlation method is 3971 seconds, and all experimental operations are performed in a Hadoop distributed cluster environment. The word correlation method requires accessing the distributed database every time the related information is acquired from the word net, and therefore, a large amount of time is consumed.

The invention provides a word correlation fuzzy recognition algorithm based on the context relationship of the text document to recognize the similar information of the document aiming at the characteristics of information propagation in the Internet, and solves the limitation of recognizing the similar information according to the literal content of the text segment in the traditional method.

A large amount of information with free forms and irregular contents exists in the Internet, so that the difficulty of obtaining effective information is greatly increased. However, in the conventional method, either the designed algorithm is very complex for improving the problem solving precision, or the precision of the problem solving result is neglected for improving the problem solving efficiency, so that the balance among simplicity, high efficiency and precision is difficult to realize. The invention provides a similar document fuzzy recognition method based on word correlation based on an excellent open source distributed processing platform Hadoop, and the method is used for recognizing documents with similar information in a broad sense, namely information of synonymy conversion types commonly existing in the documents, from a statistical language processing model by constructing a word network in a certain information subject field, so that the recognition range of the similar information is expanded.

In future research work, more extensive data corpora can be adopted to carry out more in-depth research on the word correlation model provided by the text, and parameters in the model are subjected to multi-layer sub-optimization, so that the mutual information relationship between words and the attenuation process of the mutual information relationship size established by intermediate words are researched. In addition, because the model needs to perform sufficient correlation training between words representing each type of information topic in the process of constructing the word network in the early stage, the word network construction in the early stage needs to consume a large amount of time, which is also a research direction in the future.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A method for recognizing similarity of a large amount of Web text information based on word network is characterized in that: the method comprises the following steps:

(1) constructing a word network, comprising the following steps:

1.1, extracting text information from a Web page to form a document set D consisting of a plurality of documents D, extracting characteristic words from one document D in the document set D, and calculating any two f in all the characteristic words_i、f_jNormalized mutual information value norm _ I between two_ijAnd norm _ I_jiFrom the calculated norm _ I_ijAnd norm _ I_jiValue-based respective construction of feature words f_i、f_jThe mutual information relation between the words is < f_i,f_j> and < f_j,f_i＞，norm_I_ijWord pair < f as mutual information relationship_i,f_jWeight of > norm _ I_jiWord pair < f as mutual information relationship_j,f_iA weight of > but norm _ I_ij＝norm_I_jiThe mutual information relation word pair is less than f_i,f_j> and word pair < f_j,f_iAdding a word-entering net;

1.2, executing the operation in the step 1.1 on all the documents D in the document set D until all the documents D in the document set D are processed; in the process, when a new document d' is introduced, the characteristic words f are extracted_i'、f'_jCalculating any two f of all the feature words_i'、f'_jTwo equal normalized mutual information values norm _ I 'between two'_ijAnd norm _ I'_jiAnd establishing a mutual information relationship word pair < f between the two_i',f'_j> and < f'_j,f_i' >, if the mutual information relationship word pair < f_i',f'_j> and < f'_j,f_i'> exists in the word network, then with norm _ I'_ijUpdating the weight of the mutual information relation in the word network by the value, if the word pair of the mutual information relation is less than f_i',f'_j> and < f'_j,f_iIf the word network does not exist, adding the word network into the word network to finally form the whole word network, and storing the word network in a database system;

2.2, solving a set of similar words of each feature word f: aiming at each characteristic word f, searching words with direct mutual information relation in a word network in a database system and searching the wordsSimultaneously recording the mutual information value of each word to form a similar word set corresponding to each characteristic word, namely f₁→{t₁₁:I₁₁,t₁₂:I₁₂,...}，f₂→{t₂₁:I₂₁,t₂₂:I₂₂,...}，…，f_m→{t_m1:I_m1,t_m2:I_m2,., wherein the same feature word f_mCorresponding set of similar words t_m1,t_m2,.. } all words are different, there may be common similar words between feature words f, i.e. for any two feature words f_lAnd f_kThe intersection operation of the corresponding similar word sets between (1 is less than or equal to l, k is less than or equal to m) satisfiesWhereinRepresenting an empty set;

2.3, solving a similar document set of each characteristic word f: aiming at the similar word set { t corresponding to all the characteristic words f₁:I₁,t₂:I₂,...,t_n:I_nAnd solving document sets corresponding to all words in the similar word sets respectively to form document sets corresponding to the similar word sets, and calculating mutual information values accumulated by the documents in the document sets. I.e. for the set of similar words t₁:I₁,t₂:I₂,...,t_n:I_nEvery word t in_iSolving includes t_iAll documents of { I_i:(d_i1,d_i2,..) }, wherein I_iFor corresponding mutual information values, d_i1,d_i2,.. each contain t_iA different document of (2); after all t completes the above process, the union set of all the documents corresponding to t is obtained, i.e. { I }₁:(d₁₁,d₁₂,...)}∪{I₂:(d₂₁,d₂₂,...)}∪...∪{I_n:(d_n1,d_n2,..) }, to obtain a new set { d }₁:I_d1,d₂:I_d2,. for item d in the set_i:I_diAll d are different documents, I_diIncluding d for union_iCorresponding mutual information value I and corresponding t at d_iOf the tf-idf value in (d), when d is equal to₁:I_d1,d₂:I_d2,., namely a document set with a certain mutual information relationship with the characteristic word f, i.e. f → { d }₁:I_d1,d₂:I_d2,.. }; suppose f₁→{d₁₁:I₁₁,d₁₂:I₁₂,...}，f₂→{d₂₁:I₂₁,d₂₂:I₂₂,...}，…，f_m→{d_m1:I_m1,d_m2:I_m1,., wherein d_i1,d_i2,...,d_ijFor different documents in the document library, the document set may contain the same document between every two documents, i.e. for any two feature words f_lAnd f_k(1 is less than or equal to l, k is less than or equal to m) satisfies the intersection operation of the document sets related to the mutual information

2. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in step 1.1 and step 2.1, extracting the feature word f includes the following steps:

A. firstly, extracting text information;

B. filtering symbols and segmenting words;

C. a word segmentation list;

D. converting each word into a lower case;

E. restoring words by using a baud stem algorithm;

F. and filtering the numbers and the stop words to obtain the feature words f.

3. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in the step 1.1, any two f in all the feature words are calculated_i、f_jNormalized mutual information value norm _ I between two_ijThe method comprises the following steps:

① constructing two feature words f_iAnd f_jThe weights in all documents D within the document set D measure the TF-IDF vector: according to two characteristic words f in a specific document set D_iAnd f_jThe TF-IDF value of each document d respectively forms a TF-IDF vector with the same dimension of the two characteristic words if the characteristic words are in the document d_iIf it occurs, the value at the ith position in the TF-IDF vector is the word in the corresponding document d_iTF-IDF value of (1); if the word feature word is in the document d_iIf the value is not present, the value at the ith position in the TF-IDF vector is represented by 0;

Wherein X, Y respectively represent the word f_iAnd f_jTwo random events of occurrence, "0" means the word f_iOr f_jNot present in a particular document of the document collection, "1" means the word f_iOr f_jOccurring in a particular document of a collection of documents, p (x, y) represents the word f_iAnd f_jJoint probabilities of simultaneous occurrence in certain specific documents of a document collection, p (x) and p (y) representing the word f_iAnd f_jA marginal probability of occurring in certain specific documents of the set of documents;

4. the method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in step 2.1, taking calculating the TF-IDF value of the weight metric of the word w as an example, calculating the TF-IDF value of the weight metric of each word includes the following steps:

TF(w,d)＝count(w,d)/size(d)

TF-IDF(w,d)＝TF×IDF。

5. the method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: in the step 2.5, the value range of the threshold value delta is 0.5-0.7.

6. The method for identifying similarity of a plurality of Web text messages based on word network as claimed in claim 1, wherein: the database system in step 1.2 is a distributed database HBase.