CN106055614A

CN106055614A - Similarity analysis method of content similarities based on multiple semantic abstracts

Info

Publication number: CN106055614A
Application number: CN201610356867.XA
Authority: CN
Inventors: 李红全
Original assignee: Tianjin Mass Information Technology Ltd By Share Ltd
Current assignee: Tianjin Mass Information Technology Ltd By Share Ltd
Priority date: 2016-05-26
Filing date: 2016-05-26
Publication date: 2016-10-26

Abstract

The invention discloses a similarity analysis method of content similarities based on multiple semantic abstracts. The method comprises following steps: 1) cutting input information into multiple fragments; 2) selecting multiple key fragments out of multiple fragments of input information; 3) obtaining multiple semantic abstracts from each key fragment and converting them into abstract vectors; 4) recalling candidate information according to abstract vectors of input information; 5), comparing input information with candidate information and determining whether input information is similar to candidate information. By adoption of the above technical scheme, semantics of information content can be accurately determined and by collecting multiple semantic abstracts instead of information content, same or similar information content can be gathered into a cluster of search result so that storage and re-processing of information content can be conveniently carried out.

Description

Content similarities based on multiple semantic summaries analyze method

Technical field

The present invention relates to a kind of content similarities based on multiple semantic summaries and analyze method, belong to internet information acquisition Technical field.

Background technology

The information content propagated on the Internet in communication process would generally modified, the business operation such as editor again, Thus cause original information content and amended information content to there are some differences；But its main contents are the most close or similar. For the identification of this Similar content in prior art, the most also it is dependent on title similarity and identifies, such as in search engine Conventional headline function of search, it is common that collect according to the information content that title is identical and concentrate at cluster Search Results, and The information of the most same content in communication process often edit-modify by different media or platform be multiple different Title, the operation of now title amendment may result in same or similar information content and is identified as difference, and then is dispersed in many bunches Search Results is concentrated.The method that prior art is only identified by title, on the one hand can cause storing excessively taking, separately of resource On the one hand the Search Results being directed to same information content is the most easily made to be fully used.And it is disclosed in prior art The analysis method of many judgement content similarities, also and unresolved above-mentioned technical problem, simultaneously to entirely in similarity analysis The all words of literary composition are analyzed, and expend resource the most more.Such as Chinese patent literature CN1470047A discloses a kind of for literary composition The method of vector analysis of shelves, for the document extraction important sentences given from one and the similarity determining two documents, specifically, The word that monitoring occurs in each input document, is divided into document section by each input document, generates document segment vectors, each Vector comprises the described word frequency of occurrences in document section each described as its element value, every in many two input documents Individual calculating be contained in each input document in described document segment vectors all combination of two inner product square, and according to Described inner product square and determine the two input document between described similarity.For another example Chinese patent literature CN1959671A Also disclose that a kind of file similarity measure method based on file structure, file structure is utilized for two documents to be compared Analysis method respectively obtains the sub-topics sequence of two documents, to each height master in the sub-topics sequence of one of them document Topic utilizes method for measuring similarity to calculate Similarity value respectively with each sub-topics in another document sub-topics sequence, then builds Vertical cum rights bigraph (bipartite graph) also solves Optimum Matching, the total weight value of Optimum Matching is carried out standardization processing, i.e. obtains two documents Similarity value.For another example Chinese patent literature CN103389987A also discloses that a kind of text similarity comparative approach, by extracting Each characteristic vector of each file to be analyzed and the value of each characteristic vector by the value of each characteristic vector and sane each spy to be compared The value levying vector compares, and obtains the similarity between each file to be analyzed.Similarity analysis method like this also has very Many, but the most unresolved above-mentioned technical problem.

Summary of the invention

Therefore, it is an object of the invention to provide a kind of content similarities based on multiple semantic summaries and analyze method, both Can overcome the method being identified by title in prior art easily make the Search Results being directed to same information content without The defect that method is fully used, can overcome again and identify the defect causing consuming resource more in full.

To achieve these goals, a kind of based on multiple semantic summaries the content similarities of the present invention analyze method, bag Include following steps:

1) input information cutting is become some fragments；

2) in some fragments of input information, some critical segments are selected；

3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector；

4) candidate's information is recalled according to the summary Vector Groups of input information；

5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.

Described step 5 comprises the following steps:

51) candidate is seeked advice from cutting and become some fragments；Some passes tab is selected in some fragments of candidate's consulting Section；In the critical segment that each candidate seeks advice from, obtain some semantic summaries respectively and be converted into summary vector；

52) by the summary vector in the critical segment of input information and the summary in the critical segment of corresponding candidate's information Vector is compared, and obtains the similarity of the critical segment compared, when similarity then judges more than when specifying threshold value Critical segment for comparing is similar；

53) similarity of input information and candidate's information is obtained, when similarity is then judged to input more than when specifying threshold value Information is similar to candidate's information.

In described step 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by input information Critical segment is converted into element set A and B with two vectors in the critical segment of corresponding candidate's information, compares The similarity of critical segment is then the common factor element number ratio with union element number of element set A and element set B；

In described step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:

531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and And the volume residual of the critical segment after calculating duplicate removal；

532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, inputted Information and the similarity of candidate's information.

In described step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese Statement, each Chinese statement is fragment described in.

In described step 2 or step 51, the position that occurs in paragraph or in article with reference to segment, the length of segment contents Degree and combine the result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and, Thus select crucial segment.

Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, entity phrase that weight is high The semantic summary become, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.

Using technique scheme, the content similarities based on multiple semantic summaries of the present invention analyze method, it is possible to accurate Really the semanteme of information content is judged, by multiple semantic summaries rather than title, information content is collected, thus Identical or approximation information content is collected and concentrates at cluster Search Results, it is simple to storage and the reprocessing of information content use.

Detailed description of the invention

Below by way of detailed description of the invention, the present invention is described in further detail.

The present embodiment provides a kind of content similarities based on multiple semantic summaries to analyze method, comprises the following steps:

1) input information cutting is become some fragments；

Information content, as the content pages text of website orientation, conforms generally to Chinese grammatical rules. for can in this this step With based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, each Chinese statement be sheet described in Section.Trying one's best information content text dividing during cutting is complete Chinese statement, such as, carry out cutting based on punctuation mark, as asked Number, fullstop etc., in dicing process, need to consider the full half-angle form of punctuation mark.

The position, the length of segment contents that occur in paragraph or in article with reference to segment and combine syntactic analysis As a result, and these factors are arranged to different weights, calculate each segment weight and, thus select crucial segment.According to The rule of " article head or afterbody > paragraph head or afterbody > in the middle of paragraph " adjust position weight.Sentence constituent is by word Or phrase serves as, wherein the weight of phrase is higher than the weight of word；For various types of words, its entity word, such as place name, people Name, noun etc., weight is the highest；Text fragment effect length word number, thus weighing factor calculates.Calculate each content of text The weight of segment, and select critical segment according to the length of content, usual critical segment number is the 1/5～1/3 of total segments.

After crucial segment is carried out participle, based on phrase, the semantic summary of entity word composition that weight is high, it is converted into this interior Hold the summary vector of segment, be indicated with the crc32 of phrase, entity word, so for a content segments, then by one Vector (a1, a2, a3...) represents.So for a single information, then can be by the vector of multiple key content segments Group represents, such as:

Crucial segment a:(a1, a2, a3...)；

Crucial segment b:(b1, b2, b3...)；

Crucial segment c:(c1, c2, c3...).

Described step 5 comprises the following steps:

531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and Calculate the volume residual of the critical segment after duplicate removal；

For the similar threshold values of crucial segment, mainly adjust according to union element number.As respectively comprised 10 for two The crucial segment of individual element, generally arranging its similar threshold values is 0.65, the most at least needs 8 elements identical, i.e. computing formula is Common factor element number 8 and union element number 12 ratio, equal to 0.67.

For the similar threshold values of information, mainly adjust according to crucial segment number.For the crucial less information of segment number, Its threshold values is high, such as, when crucial segment number is 6, generally arranging its threshold values is 0.7, the most at least needs 5 crucial segments similar；Sheet Disconnected more information, its threshold values is relatively low, such as, when crucial segment number is 10, generally arranging its threshold values is 0.4, the most at least needs 6 Individual crucial segment is similar.

Need to be adjusted the corresponding relation of segment number threshold values similar to information based on large quantities of Concordance results.

Obviously, above-described embodiment is only for clearly demonstrating example, and not restriction to embodiment.For For those of ordinary skill in the field, change or the change of other multi-form can also be made on the basis of the above description Dynamic.Here without also cannot all of embodiment be given exhaustive.And the obvious change thus extended out or change Move among still in the protection domain of the invention.

Claims

1. content similarities based on multiple semantic summaries analyze method, it is characterised in that comprise the following steps:

1) input information cutting is become some fragments；

2. content similarities based on multiple semantic summaries analyze method as claimed in claim 1, it is characterised in that described step Rapid 5 comprise the following steps:

51) candidate is seeked advice from cutting and become some fragments；Some critical segments are selected in some fragments of candidate's consulting；Point In the critical segment that each candidate seeks advice from, do not obtain some semantic summaries and be converted into summary vector；

52) by the summary vector in the critical segment of input information and the summary vector in the critical segment of corresponding candidate's information Compare, and obtain the similarity of critical segment compared, when similarity more than be then judged to when specifying threshold value into The critical segment of row comparison is similar；

53) similarity of input information and candidate's information is obtained, when similarity is then judged to input information more than when specifying threshold value Similar to candidate's information.

3. content similarities based on multiple semantic summaries analyze method as claimed in claim 2, it is characterised in that described step In rapid 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by the critical segment of input information with corresponding Candidate's information critical segment in two vectors be converted into element set A and B, the critical segment compared similar Degree is then the common factor element number ratio with union element number of element set A and element set B.

4. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:

531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and count Calculate the volume residual of the critical segment after duplicate removal；

532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, obtain inputting information Similarity with candidate's information.

5. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, Mei Yizhong Literary composition statement is fragment described in.

6. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 2 or step 51, position, the length of segment contents and the combination occurred in paragraph or in article with reference to segment The result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and, thus select key Segment.

7. the content similarities based on multiple semantic summaries as described in any one of claim 1 analyze method, it is characterised in that Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, the semanteme of entity word composition that weight is high Summary, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.