CN106095737A

CN106095737A - Documents Similarity computational methods and similar document the whole network retrieval tracking

Info

Publication number: CN106095737A
Application number: CN201610398902.4A
Authority: CN
Inventors: 姚洲鹏
Original assignee: Hangzhou Fan Wen Science And Technology Ltd
Current assignee: Hangzhou Fan Wen Science And Technology Ltd
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2016-11-09

Abstract

The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is an object of the invention to provide a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.The technical scheme is that a kind of Documents Similarity computational methods, it is characterised in that: S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set；S02, pretreatment and characteristic weighing: utilize TF IDF technology that each participle is calculated weight, extract kernel keyword；Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis；S03, vector space model and cosine similarity algorithm: utilizing in vector space two vectorial angle cosine values as weighing the similarity degree of two documents, cosine value is between 0～1, and two documents of the biggest explanation of cosine value are the most similar.The present invention is applicable to Domestic News and reprints tracking and transmissibility statistics.

Description

Documents Similarity computational methods and similar document the whole network retrieval tracking

Technical field

The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is applicable to news Information is reprinted to follow the tracks of and is added up with transmissibility.

Background technology

Traditional media, as the main producers of Domestic News, contribute to the original news of more than 80%, but is limited to it Propagating the restriction of platform, original document is reprinted by substantial amounts of door and some new medias, and new media is reprinting these document processes In, it is achieved that flow and the multiplication effect of power of influence, also achieve preferable economic benefit simultaneously, and as the work of original document Person, the most therefrom obtains interests.But during solving copyright problem by legal means, the literary composition finding to be reprinted to be removed Shelves are equal to look for a needle in a haystack, and need to consume substantial amounts of manpower, and the most difficult to evidence obtaining.

Meanwhile, media also are intended to, by his media of all reprintings, analyze its transmissibility, and current media are the most well Way goes to add up its all propagation paths, can only manually go statistics, and this statistic is the hugest.

At present, China is to use the highest country of social media ratio in the world, have the most for each person every day 5.8 hours time Between surf the Net.Former, masses learn information source in TV, newspaper, magazine and broadcast, sky masses are more by micro-now The social software such as rich, wechat, QQ, forum obtains information.Cut-off first quarter Mo in this year, Sina's microblogging moon any active ues reaches 2.6 hundred million, wechat monthly any active ues has reached 5.49 hundred million.Microblogging, wechat become the optimal utilization instrument of chip time.

From the point of view of today, in the mobile Internet epoch, there are content, form, social activity, and are that strong relation is social, mass media Power of influence slowly declining, and the power of influence of new media is deepened constantly, and this is the epoch of mobile Internet.

When each individuality has transmission capacity, traditional media structure begins to disintegrate, and message is learnt by consumer Pipeline rely on mass media the most significantly, " from the media " age be born.Can create so this is an ordinary people In the epoch of miracle, Ye Shi consumer obtains the epoch of sovereignty, so being also everybody in especially media people chance is most epoch.

In today fast-developing from media, for the copyright protection from media individual, more seem important, due to from matchmaker Body is powerless, and it is for the copyright protection of the document of oneself, the way not had.

Summary of the invention

The technical problem to be solved in the present invention is: for the problem of above-mentioned existence, it is provided that a kind of Documents Similarity calculating side Method and similar document the whole network retrieval tracking, to judge the similarity degree of two documents more accurately, it is achieved the most complete The papers published of document followed the tracks of by net, lays a solid foundation for copyright protection.

The technical solution adopted in the present invention is: a kind of Documents Similarity computational methods, it is characterised in that:

S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set；

S02, pretreatment and characteristic weighing:

Utilize TF-IDF technology that each participle is calculated weight, extract kernel keyword；

Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis；

S03, vector space model and cosine similarity algorithm:

Original document and destination document are reduced to two N-dimensional vectors with keyword weight as component；

Document cosine similarity algorithm is based on vector model, utilizes two vectorial angle cosine values in vector space to make For weighing the similarity degree of two documents, cosine value is between 0～1, and two documents of the biggest explanation of cosine value are the most similar.

Step S01 includes

Data prepare, and are cleaned the interference information of document by ETL Data clean system, and carry out document at structuring Reason, resolves into least unit structure；

Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses Chinese word segmentation Fine granularity participle in storehouse creates index.

Step S02 utilizes TF-IDF technology according in inverse document dictionary word delete in document content of text known Do not have little significance but the highest participle of the frequency of occurrences.

A kind of similar document the whole network retrieval tracking, it is characterised in that:

A, setting range of search；

B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Matching rate, carry out full library searching based on ES full-text search engine；

C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and document Degree of association weighted value does descending sort；

D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, profile similarity meter Calculation method calculates the similarity of two documents；

Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two For different documents.

Step a includes setting time range, the carrier of issue that the document that is retrieved is issued, and the word of the document that is retrieved Number, type.

The invention has the beneficial effects as follows: the present invention uses TF-IDF+word2vec technology to make Documents Similarity and processes On obtain effect more accurately, so that copyright is followed the tracks of with the analytic statistics of transmissibility more precisely and closing to reality situation. The present invention is reduced to two N-dimensional vectors with keyword weight as component original document and destination document, utilizes vector space In two vectorial angle cosine values as weighing the similarity degree of two documents, judge two documents the most accurately Similarity degree.Present invention setting with good conditionsi range of search, cleans interference information by ETL Data clean system, improves retrieval Efficiency.

Accompanying drawing explanation

Fig. 1 is the system architecture diagram of document similarity calculating method in embodiment.

Fig. 2 is pretreatment and characteristic weighing flow chart in embodiment.

Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in embodiment.

Fig. 4 is the flow chart of similar document the whole network retrieval tracking in embodiment.

Detailed description of the invention

Fig. 1 is the system architecture diagram of document similarity calculating method in the present embodiment.Documents Similarity meter in the present embodiment Calculation method includes:

(1) data preparation-ETL

Real-time Collection the whole network media data, cleans interference information by " ETL Data clean system ", and data obtain sublimate While Press release is carried out structuring process, resolve into the structure of least unit, obtain participle set, referred to as data former Sub-ization process.

(2) capital construction-ElasticSearch full-text index+Chinese word segmentation

Using ElasticSearch search engine as the basic component of whole system, the algorithm in later stage is all at ES On basis.ElasticSearch is a distributed multi-user full-text search engine based on Lucene, distributed storage Extensibility can effectively solve the storage problem that every day, mass data converged, and ElasticSearch is again one and connects simultaneously The search platform of near real-time, is calculated in actual applications and just starts the most time-consuming about 1 second time from one contribution of index Searched can arrive, so can be able to be applied efficiently in later stage propagation path analysis, distributed fortune can also be utilized simultaneously The characteristic calculated, improves arithmetic speed in conjunction with increasing hardware device, improves retrieval performance.

During building full-text index, the fine granularity participle in Chinese word segmentation storehouse is used to create index, to ensure The decomposition integrity degree of document key word.

(3) pretreatment and characteristic weighing-TF-IDF+word2vec

Fig. 2 is pretreatment and characteristic weighing flow chart in the present embodiment.TF-IDF is a kind of for information retrieval not data The weighting technique excavated.In order to assess a words, one weight against a copy of it document in document sets is guarded against for a document sets Wanting degree, the weighted value of words is directly proportional increase along with the number of times that it occurs in a document, but simultaneously can be along with it is at inverse document The frequency of middle appearance is inversely proportional to decline.Based on TF-IDF technology, according in inverse document dictionary word by document to text Content recognition has little significance but the highest word, symbol, punctuate and the mess code of the frequency of occurrences etc. are deleted.

By decomposing the key word of every document, and add up the word frequency of each word, utilize TF-IDF technology for each point Word calculates weight, extracts kernel keyword.

TF-IDF is the computational methods of correlation degree between a kind of analysing word not document, is mainly used in improving from magnanimity number Need to carry out the scope of statistical analysis similar document according to middle hit, analyze tracking for follow-up reprinting and prepare.

Do not possess the ability processing similar synonym vocabulary in view of cosine similarity algorithm, the present embodiment is in pretreatment link Quote Word2vec algorithm in advance and carry out semantic analysis for every document, to remove the semantic interference in later stage statistical analysis. Word2vec algorithm is a kind of being levied by vocabulary as to the highly effective algorithm of numerical quantity, and it utilizes the thought that degree of depth sons and daughters practises, by instruction Practice, the vector operation that the process of document key word is reduced in vector space and different crucial by excavating in document Correlation degree between word, improves accuracy semantically.

(4) vector space model and cosine similarity algorithm

Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in the present embodiment.By original document and target literary composition Shelves are reduced to two N-dimensional vectors with keyword weight as component, then utilize vector model to carry out cosine similarity calculating.Literary composition Shelves cosine similarity algorithm, based on vector, utilizes in vector space two vectorial angle cosine values as weighing two literary compositions Shelves similarity degree, focus on two vectors difference on direction, cosine value between 0～1, two documents of the biggest explanation of numerical value The most similar.

As shown in Figure 4, the present embodiment provides a kind of similar document the whole network retrieval tracking, the method be embodied as step Rapid as follows:

A, setting range of search；

A01, time range is set: such as the document issued in 3 days (72 hours) of current time；

A02, document scope is set: select the carrier of retrieval, such as newspaper, website, wechat etc.；

A03, document alternative condition: set the be retrieved number of words of document, types entail, such as article number of words >=200；Get rid of Article's style: forum, special.

B, search condition set: extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Matching rate, carry out full library searching based on ES full-text search engine；

C, do fall sequence according to key word and file correlation weighted value: the document retrieved is according to key word and document phase Pass degree weighted value does descending sort；

D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one: the literary composition of application the present embodiment Shelves similarity calculating method calculates the similarity of highest weight weight values document and another document；

Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical；Otherwise judge two For different documents.

Claims

1. Documents Similarity computational methods, it is characterised in that:

S02, pretreatment and characteristic weighing:

S03, vector space model and cosine similarity algorithm:

Document cosine similarity algorithm is based on vector model, utilizes in vector space two vectorial angle cosine values as weighing apparatus The similarity degree of two articles of amount, cosine value is between 0～1, and two documents of the biggest explanation of cosine value are the most similar.

Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S01 includes

Data prepare, and cleaned the interference information of document by ETL Data clean system, and document is carried out structuring process, point Solution becomes least unit structure；

Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses in Chinese word segmentation storehouse Fine granularity participle create index.

Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S02 utilizes TF-IDF skill Art according in inverse document dictionary word delete content of text identification is had little significance by document but the frequency of occurrences the highest point Word.

4. similar document the whole network retrieval tracking, it is characterised in that:

A, setting range of search；

B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain Join rate, carry out full library searching based on ES full-text search engine；

C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and file correlation Weighted value does descending sort；

D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, and application claims 1 to 3 is any One described Documents Similarity computational methods calculates the similarity of two documents；

Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two as not Same document.

Similar document the whole network the most according to claim 4 retrieval tracking, it is characterised in that: step S01 includes setting The time range of the document that is retrieved issue, the carrier of issue, and the number of words of the document that is retrieved, type.