CN106055614A - Similarity analysis method of content similarities based on multiple semantic abstracts - Google Patents
Similarity analysis method of content similarities based on multiple semantic abstracts Download PDFInfo
- Publication number
- CN106055614A CN106055614A CN201610356867.XA CN201610356867A CN106055614A CN 106055614 A CN106055614 A CN 106055614A CN 201610356867 A CN201610356867 A CN 201610356867A CN 106055614 A CN106055614 A CN 106055614A
- Authority
- CN
- China
- Prior art keywords
- information
- candidate
- segment
- input information
- critical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a similarity analysis method of content similarities based on multiple semantic abstracts. The method comprises following steps: 1) cutting input information into multiple fragments; 2) selecting multiple key fragments out of multiple fragments of input information; 3) obtaining multiple semantic abstracts from each key fragment and converting them into abstract vectors; 4) recalling candidate information according to abstract vectors of input information; 5), comparing input information with candidate information and determining whether input information is similar to candidate information. By adoption of the above technical scheme, semantics of information content can be accurately determined and by collecting multiple semantic abstracts instead of information content, same or similar information content can be gathered into a cluster of search result so that storage and re-processing of information content can be conveniently carried out.
Description
Technical field
The present invention relates to a kind of content similarities based on multiple semantic summaries and analyze method, belong to internet information acquisition
Technical field.
Background technology
The information content propagated on the Internet in communication process would generally modified, the business operation such as editor again,
Thus cause original information content and amended information content to there are some differences;But its main contents are the most close or similar.
For the identification of this Similar content in prior art, the most also it is dependent on title similarity and identifies, such as in search engine
Conventional headline function of search, it is common that collect according to the information content that title is identical and concentrate at cluster Search Results, and
The information of the most same content in communication process often edit-modify by different media or platform be multiple different
Title, the operation of now title amendment may result in same or similar information content and is identified as difference, and then is dispersed in many bunches
Search Results is concentrated.The method that prior art is only identified by title, on the one hand can cause storing excessively taking, separately of resource
On the one hand the Search Results being directed to same information content is the most easily made to be fully used.And it is disclosed in prior art
The analysis method of many judgement content similarities, also and unresolved above-mentioned technical problem, simultaneously to entirely in similarity analysis
The all words of literary composition are analyzed, and expend resource the most more.Such as Chinese patent literature CN1470047A discloses a kind of for literary composition
The method of vector analysis of shelves, for the document extraction important sentences given from one and the similarity determining two documents, specifically,
The word that monitoring occurs in each input document, is divided into document section by each input document, generates document segment vectors, each
Vector comprises the described word frequency of occurrences in document section each described as its element value, every in many two input documents
Individual calculating be contained in each input document in described document segment vectors all combination of two inner product square, and according to
Described inner product square and determine the two input document between described similarity.For another example Chinese patent literature CN1959671A
Also disclose that a kind of file similarity measure method based on file structure, file structure is utilized for two documents to be compared
Analysis method respectively obtains the sub-topics sequence of two documents, to each height master in the sub-topics sequence of one of them document
Topic utilizes method for measuring similarity to calculate Similarity value respectively with each sub-topics in another document sub-topics sequence, then builds
Vertical cum rights bigraph (bipartite graph) also solves Optimum Matching, the total weight value of Optimum Matching is carried out standardization processing, i.e. obtains two documents
Similarity value.For another example Chinese patent literature CN103389987A also discloses that a kind of text similarity comparative approach, by extracting
Each characteristic vector of each file to be analyzed and the value of each characteristic vector by the value of each characteristic vector and sane each spy to be compared
The value levying vector compares, and obtains the similarity between each file to be analyzed.Similarity analysis method like this also has very
Many, but the most unresolved above-mentioned technical problem.
Summary of the invention
Therefore, it is an object of the invention to provide a kind of content similarities based on multiple semantic summaries and analyze method, both
Can overcome the method being identified by title in prior art easily make the Search Results being directed to same information content without
The defect that method is fully used, can overcome again and identify the defect causing consuming resource more in full.
To achieve these goals, a kind of based on multiple semantic summaries the content similarities of the present invention analyze method, bag
Include following steps:
1) input information cutting is become some fragments;
2) in some fragments of input information, some critical segments are selected;
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
Described step 5 comprises the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some passes tab is selected in some fragments of candidate's consulting
Section;In the critical segment that each candidate seeks advice from, obtain some semantic summaries respectively and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary in the critical segment of corresponding candidate's information
Vector is compared, and obtains the similarity of the critical segment compared, when similarity then judges more than when specifying threshold value
Critical segment for comparing is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input more than when specifying threshold value
Information is similar to candidate's information.
In described step 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by input information
Critical segment is converted into element set A and B with two vectors in the critical segment of corresponding candidate's information, compares
The similarity of critical segment is then the common factor element number ratio with union element number of element set A and element set B;
In described step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and
And the volume residual of the critical segment after calculating duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, inputted
Information and the similarity of candidate's information.
In described step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese
Statement, each Chinese statement is fragment described in.
In described step 2 or step 51, the position that occurs in paragraph or in article with reference to segment, the length of segment contents
Degree and combine the result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and,
Thus select crucial segment.
Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, entity phrase that weight is high
The semantic summary become, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.
Using technique scheme, the content similarities based on multiple semantic summaries of the present invention analyze method, it is possible to accurate
Really the semanteme of information content is judged, by multiple semantic summaries rather than title, information content is collected, thus
Identical or approximation information content is collected and concentrates at cluster Search Results, it is simple to storage and the reprocessing of information content use.
Detailed description of the invention
Below by way of detailed description of the invention, the present invention is described in further detail.
The present embodiment provides a kind of content similarities based on multiple semantic summaries to analyze method, comprises the following steps:
1) input information cutting is become some fragments;
Information content, as the content pages text of website orientation, conforms generally to Chinese grammatical rules. for can in this this step
With based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, each Chinese statement be sheet described in
Section.Trying one's best information content text dividing during cutting is complete Chinese statement, such as, carry out cutting based on punctuation mark, as asked
Number, fullstop etc., in dicing process, need to consider the full half-angle form of punctuation mark.
2) in some fragments of input information, some critical segments are selected;
The position, the length of segment contents that occur in paragraph or in article with reference to segment and combine syntactic analysis
As a result, and these factors are arranged to different weights, calculate each segment weight and, thus select crucial segment.According to
The rule of " article head or afterbody > paragraph head or afterbody > in the middle of paragraph " adjust position weight.Sentence constituent is by word
Or phrase serves as, wherein the weight of phrase is higher than the weight of word;For various types of words, its entity word, such as place name, people
Name, noun etc., weight is the highest;Text fragment effect length word number, thus weighing factor calculates.Calculate each content of text
The weight of segment, and select critical segment according to the length of content, usual critical segment number is the 1/5~1/3 of total segments.
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
After crucial segment is carried out participle, based on phrase, the semantic summary of entity word composition that weight is high, it is converted into this interior
Hold the summary vector of segment, be indicated with the crc32 of phrase, entity word, so for a content segments, then by one
Vector (a1, a2, a3...) represents.So for a single information, then can be by the vector of multiple key content segments
Group represents, such as:
Crucial segment a:(a1, a2, a3...);
Crucial segment b:(b1, b2, b3...);
Crucial segment c:(c1, c2, c3...).
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
Described step 5 comprises the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some passes tab is selected in some fragments of candidate's consulting
Section;In the critical segment that each candidate seeks advice from, obtain some semantic summaries respectively and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary in the critical segment of corresponding candidate's information
Vector is compared, and obtains the similarity of the critical segment compared, when similarity then judges more than when specifying threshold value
Critical segment for comparing is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input more than when specifying threshold value
Information is similar to candidate's information.
In described step 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by input information
Critical segment is converted into element set A and B with two vectors in the critical segment of corresponding candidate's information, compares
The similarity of critical segment is then the common factor element number ratio with union element number of element set A and element set B;
In described step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and
Calculate the volume residual of the critical segment after duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, inputted
Information and the similarity of candidate's information.
For the similar threshold values of crucial segment, mainly adjust according to union element number.As respectively comprised 10 for two
The crucial segment of individual element, generally arranging its similar threshold values is 0.65, the most at least needs 8 elements identical, i.e. computing formula is
Common factor element number 8 and union element number 12 ratio, equal to 0.67.
For the similar threshold values of information, mainly adjust according to crucial segment number.For the crucial less information of segment number,
Its threshold values is high, such as, when crucial segment number is 6, generally arranging its threshold values is 0.7, the most at least needs 5 crucial segments similar;Sheet
Disconnected more information, its threshold values is relatively low, such as, when crucial segment number is 10, generally arranging its threshold values is 0.4, the most at least needs 6
Individual crucial segment is similar.
Need to be adjusted the corresponding relation of segment number threshold values similar to information based on large quantities of Concordance results.
Using technique scheme, the content similarities based on multiple semantic summaries of the present invention analyze method, it is possible to accurate
Really the semanteme of information content is judged, by multiple semantic summaries rather than title, information content is collected, thus
Identical or approximation information content is collected and concentrates at cluster Search Results, it is simple to storage and the reprocessing of information content use.
Obviously, above-described embodiment is only for clearly demonstrating example, and not restriction to embodiment.For
For those of ordinary skill in the field, change or the change of other multi-form can also be made on the basis of the above description
Dynamic.Here without also cannot all of embodiment be given exhaustive.And the obvious change thus extended out or change
Move among still in the protection domain of the invention.
Claims (7)
1. content similarities based on multiple semantic summaries analyze method, it is characterised in that comprise the following steps:
1) input information cutting is become some fragments;
2) in some fragments of input information, some critical segments are selected;
3) in each critical segment, obtain some semantic summaries respectively and be converted into summary vector;
4) candidate's information is recalled according to the summary Vector Groups of input information;
5) input information is compared with candidate's information, and it is the most similar to candidate's information to judge to input information.
2. content similarities based on multiple semantic summaries analyze method as claimed in claim 1, it is characterised in that described step
Rapid 5 comprise the following steps:
51) candidate is seeked advice from cutting and become some fragments;Some critical segments are selected in some fragments of candidate's consulting;Point
In the critical segment that each candidate seeks advice from, do not obtain some semantic summaries and be converted into summary vector;
52) by the summary vector in the critical segment of input information and the summary vector in the critical segment of corresponding candidate's information
Compare, and obtain the similarity of critical segment compared, when similarity more than be then judged to when specifying threshold value into
The critical segment of row comparison is similar;
53) similarity of input information and candidate's information is obtained, when similarity is then judged to input information more than when specifying threshold value
Similar to candidate's information.
3. content similarities based on multiple semantic summaries analyze method as claimed in claim 2, it is characterised in that described step
In rapid 52, it is thus achieved that the similarity of the critical segment compared comprises the following steps: by the critical segment of input information with corresponding
Candidate's information critical segment in two vectors be converted into element set A and B, the critical segment compared similar
Degree is then the common factor element number ratio with union element number of element set A and element set B.
4. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute
State in step 53, it is thus achieved that input information comprises the following steps with the similarity of candidate's information:
531) obtain in input information and candidate's information, the total quantity of critical segment, the quantity of similar critical segment, and count
Calculate the volume residual of the critical segment after duplicate removal;
532) calculate the ratio of the volume residual of the critical segment after the quantity of similar critical segment and duplicate removal, obtain inputting information
Similarity with candidate's information.
5. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute
State in step 1 or step 51, based on grammatical rules by input information or candidate's information cutting be complete Chinese statement, Mei Yizhong
Literary composition statement is fragment described in.
6. content similarities based on multiple semantic summaries analyze method as claimed in claim 2 or claim 3, it is characterised in that institute
State in step 2 or step 51, position, the length of segment contents and the combination occurred in paragraph or in article with reference to segment
The result of syntactic analysis, and these factors are arranged to different weights, calculate each segment weight and, thus select key
Segment.
7. the content similarities based on multiple semantic summaries as described in any one of claim 1 analyze method, it is characterised in that
Described step 3 comprises the following steps: after crucial segment is carried out participle, based on phrase, the semanteme of entity word composition that weight is high
Summary, is converted into the summary vector of this content segments, is indicated with the crc32 of phrase, entity word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610356867.XA CN106055614A (en) | 2016-05-26 | 2016-05-26 | Similarity analysis method of content similarities based on multiple semantic abstracts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610356867.XA CN106055614A (en) | 2016-05-26 | 2016-05-26 | Similarity analysis method of content similarities based on multiple semantic abstracts |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106055614A true CN106055614A (en) | 2016-10-26 |
Family
ID=57174787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610356867.XA Withdrawn CN106055614A (en) | 2016-05-26 | 2016-05-26 | Similarity analysis method of content similarities based on multiple semantic abstracts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106055614A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106708807A (en) * | 2017-02-10 | 2017-05-24 | 深圳市空谷幽兰人工智能科技有限公司 | Non-supervision word segmentation mode training method and device |
CN109102167A (en) * | 2018-07-23 | 2018-12-28 | 长沙知了信息科技有限公司 | Information processing method and device |
CN110020421A (en) * | 2018-01-10 | 2019-07-16 | 北京京东尚科信息技术有限公司 | The session information method of abstracting and system of communication software, equipment and storage medium |
CN110321466A (en) * | 2019-06-14 | 2019-10-11 | 广发证券股份有限公司 | A kind of security information duplicate checking method and system based on semantic analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1341899A (en) * | 2000-09-07 | 2002-03-27 | 国际商业机器公司 | Method for automatic generating abstract from word or file |
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
CN104615705A (en) * | 2015-01-30 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Web page quality detection method and device |
-
2016
- 2016-05-26 CN CN201610356867.XA patent/CN106055614A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1341899A (en) * | 2000-09-07 | 2002-03-27 | 国际商业机器公司 | Method for automatic generating abstract from word or file |
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
CN104615705A (en) * | 2015-01-30 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Web page quality detection method and device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106708807A (en) * | 2017-02-10 | 2017-05-24 | 深圳市空谷幽兰人工智能科技有限公司 | Non-supervision word segmentation mode training method and device |
CN106708807B (en) * | 2017-02-10 | 2019-11-15 | 广东惠禾科技发展有限公司 | Unsupervised participle model training method and device |
CN110020421A (en) * | 2018-01-10 | 2019-07-16 | 北京京东尚科信息技术有限公司 | The session information method of abstracting and system of communication software, equipment and storage medium |
CN109102167A (en) * | 2018-07-23 | 2018-12-28 | 长沙知了信息科技有限公司 | Information processing method and device |
CN110321466A (en) * | 2019-06-14 | 2019-10-11 | 广发证券股份有限公司 | A kind of security information duplicate checking method and system based on semantic analysis |
CN110321466B (en) * | 2019-06-14 | 2023-09-15 | 广发证券股份有限公司 | Securities information duplicate checking method and system based on semantic analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hill et al. | Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study | |
US20180300315A1 (en) | Systems and methods for document processing using machine learning | |
CN106445998B (en) | Text content auditing method and system based on sensitive words | |
Elhawary et al. | Mining Arabic business reviews | |
CN103605815B (en) | A kind of merchandise news being applicable to B2B E-commerce platform is classified recommendation method automatically | |
KR101715432B1 (en) | Word pair acquisition device, word pair acquisition method, and recording medium | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
US20160189057A1 (en) | Computer implemented system and method for categorizing data | |
Rahab et al. | Siaac: Sentiment polarity identification on arabic algerian newspaper comments | |
Rill et al. | A generic approach to generate opinion lists of phrases for opinion mining applications | |
CN106055614A (en) | Similarity analysis method of content similarities based on multiple semantic abstracts | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
Rohini et al. | Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm | |
CN108363694B (en) | Keyword extraction method and device | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
WO2014002774A1 (en) | Synonym extraction system, method, and recording medium | |
Kama et al. | Analyzing implicit aspects and aspect dependent sentiment polarity for aspect-based sentiment analysis on informal Turkish texts | |
Wrzalik et al. | GerDaLIR: A German dataset for legal information retrieval | |
Scholz et al. | Opinion mining on a german corpus of a media response analysis | |
Prasetyo et al. | Rating of Indonesian sinetron based on public opinion in Twitter using Cosine similarity | |
Sharma et al. | Classification of actual and fake news in pandemic | |
KR101451108B1 (en) | Method and apparatus for extracting alternative words | |
CN104166712B (en) | Indexing of Scien. and Tech. Literature method and system | |
Di Giovanni et al. | Content-based stance classification of tweets about the 2020 Italian constitutional referendum | |
CN108021595B (en) | Method and device for checking knowledge base triples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20161026 |