CN106055614A - Similarity analysis method of content similarities based on multiple semantic abstracts - Google Patents

Similarity analysis method of content similarities based on multiple semantic abstracts Download PDF

Info

Publication number
CN106055614A
CN106055614A CN201610356867.XA CN201610356867A CN106055614A CN 106055614 A CN106055614 A CN 106055614A CN 201610356867 A CN201610356867 A CN 201610356867A CN 106055614 A CN106055614 A CN 106055614A
Authority
CN
China
Prior art keywords
information
candidate
segment
input information
critical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201610356867.XA
Other languages
Chinese (zh)
Inventor
李红全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Mass Information Technology Ltd By Share Ltd
Original Assignee
Tianjin Mass Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Mass Information Technology Ltd By Share Ltd filed Critical Tianjin Mass Information Technology Ltd By Share Ltd
Priority to CN201610356867.XA priority Critical patent/CN106055614A/en
Publication of CN106055614A publication Critical patent/CN106055614A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a similarity analysis method of content similarities based on multiple semantic abstracts. The method comprises following steps: 1) cutting input information into multiple fragments; 2) selecting multiple key fragments out of multiple fragments of input information; 3) obtaining multiple semantic abstracts from each key fragment and converting them into abstract vectors; 4) recalling candidate information according to abstract vectors of input information; 5), comparing input information with candidate information and determining whether input information is similar to candidate information. By adoption of the above technical scheme, semantics of information content can be accurately determined and by collecting multiple semantic abstracts instead of information content, same or similar information content can be gathered into a cluster of search result so that storage and re-processing of information content can be conveniently carried out.

Description

Content similarities based on multiple semantic summaries analyze method
Technical field
The present invention relates to a kind of content similarities based on multiple semantic summaries and analyze method, belong to internet information acquisition Technical field.
Background technology
The information content propagated on the Internet in communication process would generally modified, the business operation such as editor again, Thus cause original information content and amended information content to there are some differences;But its main contents are the most close or similar. For the identification of this Similar content in prior art, the most also it is dependent on title similarity and identifies, such as in search engine Conventional headline function of search, it is common that collect according to the information content that title is identical and concentrate at cluster Search Results, and The information of the most same content in communication process often edit-modify by different media or platform be multiple different Title, the operation of now title amendment may result in same or similar information content and is identified as difference, and then is dispersed in many bunches Search Results is concentrated.The method that prior art is only identified by title, on the one hand can cause storing excessively taking, separately of resource On the one hand the Search Results being directed to same information content is the most easily made to be fully used.And it is disclosed in prior art The analysis method of many judgement content similarities, also and unresolved above-mentioned technical problem, simultaneously to entirely in similarity analysis The all words of literary composition are analyzed, and expend resource the most more.Such as Chinese patent literature CN1470047A discloses a kind of for literary composition The method of vector analysis of shelves, for the document extraction important sentences given from one and the similarity determining two documents, specifically, The word that monitoring occurs in each input document, is divided into document section by each input document, generates document segment vectors, each Vector comprises the described word frequency of occurrences in document section each described as its element value, every in many two input documents Individual calculating be contained in each input document in described document segment vectors all combination of two inner product square, and according to Described inner product square and determine the two input document between described similarity.For another example Chinese patent literature CN1959671A Also disclose that a kind of file similarity measure method based on file structure, file structure is utilized for two documents to be compared Analysis method respectively obtains the sub-topics sequence of two documents, to each height master in the sub-topics sequence of one of them document Topic utilizes method for measuring similarity to calculate Similarity value respectively with each sub-topics in another document sub-topics sequence, then builds Vertical cum rights bigraph (bipartite graph) also solves Optimum Matching, the total weight value of Optimum Matching is carried out standardization processing, i.e. obtains two documents Similarity value.For another example Chinese patent literature CN103389987A also discloses that a kind of text similarity comparative approach, by extracting Each characteristic vector of each file to be analyzed and the value of each characteristic vector by the value of each characteristic vector and sane each spy to be compared The value levying vector compares, and obtains the similarity between each file to be analyzed.Similarity analysis method like this also has very Many, but the most unresolved above-mentioned technical problem.
Summary of the invention
Therefore, it is an object of the invention to provide a kind of content similarities based on multiple semantic summaries and analyze method, both Can overcome the method being identified by title in prior art easily make the Search Results being directed to same information content without The defect that method is fully used, can overcome again and identify the defect causing consuming resource more in full.
To achieve these goals, a kind of based on multiple semantic summaries the content similarities of the present invention analyze method, bag Include following steps:
1) input information cutting is become some fragments;
2) in some fragments of input information, some critical segments are selected;
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
Described step 5 comprises the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some passes tab is selected in some fragments of candidate's consulting Section;In the critical segment that each candidate seeks advice from, obtain some semantic summaries respectively and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary in the critical segment of corresponding candidate's information Vector is compared, and obtains the similarity of the critical segment compared, when similarity then judges more than when specifying threshold value Critical segment for comparing is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input more than when specifying threshold value Information is similar to candidate's information.
In described step 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by input information Critical segment is converted into element set A and B with two vectors in the critical segment of corresponding candidate's information, compares The similarity of critical segment is then the common factor element number ratio with union element number of element set A and element set B;
In described step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and And the volume residual of the critical segment after calculating duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, inputted Information and the similarity of candidate's information.
In described step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese Statement, each Chinese statement is fragment described in.
In described step 2 or step 51, the position that occurs in paragraph or in article with reference to segment, the length of segment contents Degree and combine the result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and, Thus select crucial segment.
Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, entity phrase that weight is high The semantic summary become, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.
Using technique scheme, the content similarities based on multiple semantic summaries of the present invention analyze method, it is possible to accurate Really the semanteme of information content is judged, by multiple semantic summaries rather than title, information content is collected, thus Identical or approximation information content is collected and concentrates at cluster Search Results, it is simple to storage and the reprocessing of information content use.
Detailed description of the invention
Below by way of detailed description of the invention, the present invention is described in further detail.
The present embodiment provides a kind of content similarities based on multiple semantic summaries to analyze method, comprises the following steps:
1) input information cutting is become some fragments;
Information content, as the content pages text of website orientation, conforms generally to Chinese grammatical rules. for can in this this step With based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, each Chinese statement be sheet described in Section.Trying one's best information content text dividing during cutting is complete Chinese statement, such as, carry out cutting based on punctuation mark, as asked Number, fullstop etc., in dicing process, need to consider the full half-angle form of punctuation mark.
2) in some fragments of input information, some critical segments are selected;
The position, the length of segment contents that occur in paragraph or in article with reference to segment and combine syntactic analysis As a result, and these factors are arranged to different weights, calculate each segment weight and, thus select crucial segment.According to The rule of " article head or afterbody > paragraph head or afterbody > in the middle of paragraph " adjust position weight.Sentence constituent is by word Or phrase serves as, wherein the weight of phrase is higher than the weight of word;For various types of words, its entity word, such as place name, people Name, noun etc., weight is the highest;Text fragment effect length word number, thus weighing factor calculates.Calculate each content of text The weight of segment, and select critical segment according to the length of content, usual critical segment number is the 1/5~1/3 of total segments.
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
After crucial segment is carried out participle, based on phrase, the semantic summary of entity word composition that weight is high, it is converted into this interior Hold the summary vector of segment, be indicated with the crc32 of phrase, entity word, so for a content segments, then by one Vector (a1, a2, a3...) represents.So for a single information, then can be by the vector of multiple key content segments Group represents, such as:
Crucial segment a:(a1, a2, a3...);
Crucial segment b:(b1, b2, b3...);
Crucial segment c:(c1, c2, c3...).
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
Described step 5 comprises the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some passes tab is selected in some fragments of candidate's consulting Section;In the critical segment that each candidate seeks advice from, obtain some semantic summaries respectively and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary in the critical segment of corresponding candidate's information Vector is compared, and obtains the similarity of the critical segment compared, when similarity then judges more than when specifying threshold value Critical segment for comparing is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input more than when specifying threshold value Information is similar to candidate's information.
In described step 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by input information Critical segment is converted into element set A and B with two vectors in the critical segment of corresponding candidate's information, compares The similarity of critical segment is then the common factor element number ratio with union element number of element set A and element set B;
In described step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and Calculate the volume residual of the critical segment after duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, inputted Information and the similarity of candidate's information.
For the similar threshold values of crucial segment, mainly adjust according to union element number.As respectively comprised 10 for two The crucial segment of individual element, generally arranging its similar threshold values is 0.65, the most at least needs 8 elements identical, i.e. computing formula is Common factor element number 8 and union element number 12 ratio, equal to 0.67.
For the similar threshold values of information, mainly adjust according to crucial segment number.For the crucial less information of segment number, Its threshold values is high, such as, when crucial segment number is 6, generally arranging its threshold values is 0.7, the most at least needs 5 crucial segments similar;Sheet Disconnected more information, its threshold values is relatively low, such as, when crucial segment number is 10, generally arranging its threshold values is 0.4, the most at least needs 6 Individual crucial segment is similar.
Need to be adjusted the corresponding relation of segment number threshold values similar to information based on large quantities of Concordance results.
Using technique scheme, the content similarities based on multiple semantic summaries of the present invention analyze method, it is possible to accurate Really the semanteme of information content is judged, by multiple semantic summaries rather than title, information content is collected, thus Identical or approximation information content is collected and concentrates at cluster Search Results, it is simple to storage and the reprocessing of information content use.
Obviously, above-described embodiment is only for clearly demonstrating example, and not restriction to embodiment.For For those of ordinary skill in the field, change or the change of other multi-form can also be made on the basis of the above description Dynamic.Here without also cannot all of embodiment be given exhaustive.And the obvious change thus extended out or change Move among still in the protection domain of the invention.

Claims (7)

1. content similarities based on multiple semantic summaries analyze method, it is characterised in that comprise the following steps:
1) input information cutting is become some fragments;
2) in some fragments of input information, some critical segments are selected;
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
2. content similarities based on multiple semantic summaries analyze method as claimed in claim 1, it is characterised in that described step Rapid 5 comprise the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some critical segments are selected in some fragments of candidate's consulting;Point In the critical segment that each candidate seeks advice from, do not obtain some semantic summaries and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary vector in the critical segment of corresponding candidate's information Compare, and obtain the similarity of critical segment compared, when similarity more than be then judged to when specifying threshold value into The critical segment of row comparison is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input information more than when specifying threshold value Similar to candidate's information.
3. content similarities based on multiple semantic summaries analyze method as claimed in claim 2, it is characterised in that described step In rapid 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by the critical segment of input information with corresponding Candidate's information critical segment in two vectors be converted into element set A and B, the critical segment compared similar Degree is then the common factor element number ratio with union element number of element set A and element set B.
4. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and count Calculate the volume residual of the critical segment after duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, obtain inputting information Similarity with candidate's information.
5. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, Mei Yizhong Literary composition statement is fragment described in.
6. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute State in step 2 or step 51, position, the length of segment contents and the combination occurred in paragraph or in article with reference to segment The result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and, thus select key Segment.
7. the content similarities based on multiple semantic summaries as described in any one of claim 1 analyze method, it is characterised in that Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, the semanteme of entity word composition that weight is high Summary, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.
CN201610356867.XA 2016-05-26 2016-05-26 Similarity analysis method of content similarities based on multiple semantic abstracts Withdrawn CN106055614A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610356867.XA CN106055614A (en) 2016-05-26 2016-05-26 Similarity analysis method of content similarities based on multiple semantic abstracts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610356867.XA CN106055614A (en) 2016-05-26 2016-05-26 Similarity analysis method of content similarities based on multiple semantic abstracts

Publications (1)

Publication Number Publication Date
CN106055614A true CN106055614A (en) 2016-10-26

Family

ID=57174787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610356867.XA Withdrawn CN106055614A (en) 2016-05-26 2016-05-26 Similarity analysis method of content similarities based on multiple semantic abstracts

Country Status (1)

Country Link
CN (1) CN106055614A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708807A (en) * 2017-02-10 2017-05-24 深圳市空谷幽兰人工智能科技有限公司 Non-supervision word segmentation mode training method and device
CN109102167A (en) * 2018-07-23 2018-12-28 长沙知了信息科技有限公司 Information processing method and device
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN104615705A (en) * 2015-01-30 2015-05-13 百度在线网络技术(北京)有限公司 Web page quality detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN104615705A (en) * 2015-01-30 2015-05-13 百度在线网络技术(北京)有限公司 Web page quality detection method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708807A (en) * 2017-02-10 2017-05-24 深圳市空谷幽兰人工智能科技有限公司 Non-supervision word segmentation mode training method and device
CN106708807B (en) * 2017-02-10 2019-11-15 广东惠禾科技发展有限公司 Unsupervised participle model training method and device
CN110020421A (en) * 2018-01-10 2019-07-16 北京京东尚科信息技术有限公司 The session information method of abstracting and system of communication software, equipment and storage medium
CN109102167A (en) * 2018-07-23 2018-12-28 长沙知了信息科技有限公司 Information processing method and device
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN110321466B (en) * 2019-06-14 2023-09-15 广发证券股份有限公司 Securities information duplicate checking method and system based on semantic analysis

Similar Documents

Publication Publication Date Title
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
US20180300315A1 (en) Systems and methods for document processing using machine learning
CN106445998B (en) Text content auditing method and system based on sensitive words
Elhawary et al. Mining Arabic business reviews
CN103605815B (en) A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
US20160189057A1 (en) Computer implemented system and method for categorizing data
Rahab et al. Siaac: Sentiment polarity identification on arabic algerian newspaper comments
Rill et al. A generic approach to generate opinion lists of phrases for opinion mining applications
CN106055614A (en) Similarity analysis method of content similarities based on multiple semantic abstracts
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
Rohini et al. Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm
CN108363694B (en) Keyword extraction method and device
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
Kama et al. Analyzing implicit aspects and aspect dependent sentiment polarity for aspect-based sentiment analysis on informal Turkish texts
Wrzalik et al. GerDaLIR: A German dataset for legal information retrieval
Scholz et al. Opinion mining on a german corpus of a media response analysis
Prasetyo et al. Rating of Indonesian sinetron based on public opinion in Twitter using Cosine similarity
Sharma et al. Classification of actual and fake news in pandemic
KR101451108B1 (en) Method and apparatus for extracting alternative words
CN104166712B (en) Indexing of Scien. and Tech. Literature method and system
Di Giovanni et al. Content-based stance classification of tweets about the 2020 Italian constitutional referendum
CN108021595B (en) Method and device for checking knowledge base triples

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20161026