CN106095737A - Documents Similarity computational methods and similar document the whole network retrieval tracking - Google Patents
Documents Similarity computational methods and similar document the whole network retrieval tracking Download PDFInfo
- Publication number
- CN106095737A CN106095737A CN201610398902.4A CN201610398902A CN106095737A CN 106095737 A CN106095737 A CN 106095737A CN 201610398902 A CN201610398902 A CN 201610398902A CN 106095737 A CN106095737 A CN 106095737A
- Authority
- CN
- China
- Prior art keywords
- document
- documents
- similarity
- computational methods
- cosine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is an object of the invention to provide a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.The technical scheme is that a kind of Documents Similarity computational methods, it is characterised in that: S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;S02, pretreatment and characteristic weighing: utilize TF IDF technology that each participle is calculated weight, extract kernel keyword;Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;S03, vector space model and cosine similarity algorithm: utilizing in vector space two vectorial angle cosine values as weighing the similarity degree of two documents, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.The present invention is applicable to Domestic News and reprints tracking and transmissibility statistics.
Description
Technical field
The present invention relates to a kind of Documents Similarity computational methods and similar document the whole network retrieval tracking.It is applicable to news
Information is reprinted to follow the tracks of and is added up with transmissibility.
Background technology
Traditional media, as the main producers of Domestic News, contribute to the original news of more than 80%, but is limited to it
Propagating the restriction of platform, original document is reprinted by substantial amounts of door and some new medias, and new media is reprinting these document processes
In, it is achieved that flow and the multiplication effect of power of influence, also achieve preferable economic benefit simultaneously, and as the work of original document
Person, the most therefrom obtains interests.But during solving copyright problem by legal means, the literary composition finding to be reprinted to be removed
Shelves are equal to look for a needle in a haystack, and need to consume substantial amounts of manpower, and the most difficult to evidence obtaining.
Meanwhile, media also are intended to, by his media of all reprintings, analyze its transmissibility, and current media are the most well
Way goes to add up its all propagation paths, can only manually go statistics, and this statistic is the hugest.
At present, China is to use the highest country of social media ratio in the world, have the most for each person every day 5.8 hours time
Between surf the Net.Former, masses learn information source in TV, newspaper, magazine and broadcast, sky masses are more by micro-now
The social software such as rich, wechat, QQ, forum obtains information.Cut-off first quarter Mo in this year, Sina's microblogging moon any active ues reaches
2.6 hundred million, wechat monthly any active ues has reached 5.49 hundred million.Microblogging, wechat become the optimal utilization instrument of chip time.
From the point of view of today, in the mobile Internet epoch, there are content, form, social activity, and are that strong relation is social, mass media
Power of influence slowly declining, and the power of influence of new media is deepened constantly, and this is the epoch of mobile Internet.
When each individuality has transmission capacity, traditional media structure begins to disintegrate, and message is learnt by consumer
Pipeline rely on mass media the most significantly, " from the media " age be born.Can create so this is an ordinary people
In the epoch of miracle, Ye Shi consumer obtains the epoch of sovereignty, so being also everybody in especially media people chance is most epoch.
In today fast-developing from media, for the copyright protection from media individual, more seem important, due to from matchmaker
Body is powerless, and it is for the copyright protection of the document of oneself, the way not had.
Summary of the invention
The technical problem to be solved in the present invention is: for the problem of above-mentioned existence, it is provided that a kind of Documents Similarity calculating side
Method and similar document the whole network retrieval tracking, to judge the similarity degree of two documents more accurately, it is achieved the most complete
The papers published of document followed the tracks of by net, lays a solid foundation for copyright protection.
The technical solution adopted in the present invention is: a kind of Documents Similarity computational methods, it is characterised in that:
S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;
S02, pretreatment and characteristic weighing:
Utilize TF-IDF technology that each participle is calculated weight, extract kernel keyword;
Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;
S03, vector space model and cosine similarity algorithm:
Original document and destination document are reduced to two N-dimensional vectors with keyword weight as component;
Document cosine similarity algorithm is based on vector model, utilizes two vectorial angle cosine values in vector space to make
For weighing the similarity degree of two documents, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.
Step S01 includes
Data prepare, and are cleaned the interference information of document by ETL Data clean system, and carry out document at structuring
Reason, resolves into least unit structure;
Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses Chinese word segmentation
Fine granularity participle in storehouse creates index.
Step S02 utilizes TF-IDF technology according in inverse document dictionary word delete in document content of text known
Do not have little significance but the highest participle of the frequency of occurrences.
A kind of similar document the whole network retrieval tracking, it is characterised in that:
A, setting range of search;
B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain
Matching rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and document
Degree of association weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, profile similarity meter
Calculation method calculates the similarity of two documents;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two
For different documents.
Step a includes setting time range, the carrier of issue that the document that is retrieved is issued, and the word of the document that is retrieved
Number, type.
The invention has the beneficial effects as follows: the present invention uses TF-IDF+word2vec technology to make Documents Similarity and processes
On obtain effect more accurately, so that copyright is followed the tracks of with the analytic statistics of transmissibility more precisely and closing to reality situation.
The present invention is reduced to two N-dimensional vectors with keyword weight as component original document and destination document, utilizes vector space
In two vectorial angle cosine values as weighing the similarity degree of two documents, judge two documents the most accurately
Similarity degree.Present invention setting with good conditionsi range of search, cleans interference information by ETL Data clean system, improves retrieval
Efficiency.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of document similarity calculating method in embodiment.
Fig. 2 is pretreatment and characteristic weighing flow chart in embodiment.
Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in embodiment.
Fig. 4 is the flow chart of similar document the whole network retrieval tracking in embodiment.
Detailed description of the invention
Fig. 1 is the system architecture diagram of document similarity calculating method in the present embodiment.Documents Similarity meter in the present embodiment
Calculation method includes:
(1) data preparation-ETL
Real-time Collection the whole network media data, cleans interference information by " ETL Data clean system ", and data obtain sublimate
While Press release is carried out structuring process, resolve into the structure of least unit, obtain participle set, referred to as data former
Sub-ization process.
(2) capital construction-ElasticSearch full-text index+Chinese word segmentation
Using ElasticSearch search engine as the basic component of whole system, the algorithm in later stage is all at ES
On basis.ElasticSearch is a distributed multi-user full-text search engine based on Lucene, distributed storage
Extensibility can effectively solve the storage problem that every day, mass data converged, and ElasticSearch is again one and connects simultaneously
The search platform of near real-time, is calculated in actual applications and just starts the most time-consuming about 1 second time from one contribution of index
Searched can arrive, so can be able to be applied efficiently in later stage propagation path analysis, distributed fortune can also be utilized simultaneously
The characteristic calculated, improves arithmetic speed in conjunction with increasing hardware device, improves retrieval performance.
During building full-text index, the fine granularity participle in Chinese word segmentation storehouse is used to create index, to ensure
The decomposition integrity degree of document key word.
(3) pretreatment and characteristic weighing-TF-IDF+word2vec
Fig. 2 is pretreatment and characteristic weighing flow chart in the present embodiment.TF-IDF is a kind of for information retrieval not data
The weighting technique excavated.In order to assess a words, one weight against a copy of it document in document sets is guarded against for a document sets
Wanting degree, the weighted value of words is directly proportional increase along with the number of times that it occurs in a document, but simultaneously can be along with it is at inverse document
The frequency of middle appearance is inversely proportional to decline.Based on TF-IDF technology, according in inverse document dictionary word by document to text
Content recognition has little significance but the highest word, symbol, punctuate and the mess code of the frequency of occurrences etc. are deleted.
By decomposing the key word of every document, and add up the word frequency of each word, utilize TF-IDF technology for each point
Word calculates weight, extracts kernel keyword.
TF-IDF is the computational methods of correlation degree between a kind of analysing word not document, is mainly used in improving from magnanimity number
Need to carry out the scope of statistical analysis similar document according to middle hit, analyze tracking for follow-up reprinting and prepare.
Do not possess the ability processing similar synonym vocabulary in view of cosine similarity algorithm, the present embodiment is in pretreatment link
Quote Word2vec algorithm in advance and carry out semantic analysis for every document, to remove the semantic interference in later stage statistical analysis.
Word2vec algorithm is a kind of being levied by vocabulary as to the highly effective algorithm of numerical quantity, and it utilizes the thought that degree of depth sons and daughters practises, by instruction
Practice, the vector operation that the process of document key word is reduced in vector space and different crucial by excavating in document
Correlation degree between word, improves accuracy semantically.
(4) vector space model and cosine similarity algorithm
Fig. 3 is vector space model and cosine similarity algorithm graph of a relation in the present embodiment.By original document and target literary composition
Shelves are reduced to two N-dimensional vectors with keyword weight as component, then utilize vector model to carry out cosine similarity calculating.Literary composition
Shelves cosine similarity algorithm, based on vector, utilizes in vector space two vectorial angle cosine values as weighing two literary compositions
Shelves similarity degree, focus on two vectors difference on direction, cosine value between 0~1, two documents of the biggest explanation of numerical value
The most similar.
As shown in Figure 4, the present embodiment provides a kind of similar document the whole network retrieval tracking, the method be embodied as step
Rapid as follows:
A, setting range of search;
A01, time range is set: such as the document issued in 3 days (72 hours) of current time;
A02, document scope is set: select the carrier of retrieval, such as newspaper, website, wechat etc.;
A03, document alternative condition: set the be retrieved number of words of document, types entail, such as article number of words >=200;Get rid of
Article's style: forum, special.
B, search condition set: extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain
Matching rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value: the document retrieved is according to key word and document phase
Pass degree weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one: the literary composition of application the present embodiment
Shelves similarity calculating method calculates the similarity of highest weight weight values document and another document;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical;Otherwise judge two
For different documents.
Claims (5)
1. Documents Similarity computational methods, it is characterised in that:
S01, document decomposition: original document and destination document are cut respectively word and processes, obtain respective participle set;
S02, pretreatment and characteristic weighing:
Utilize TF-IDF technology that each participle is calculated weight, extract kernel keyword;
Utilize the correlation degree between different participles in Word2vec excavation document, every document is carried out semantic analysis;
S03, vector space model and cosine similarity algorithm:
Original document and destination document are reduced to two N-dimensional vectors with keyword weight as component;
Document cosine similarity algorithm is based on vector model, utilizes in vector space two vectorial angle cosine values as weighing apparatus
The similarity degree of two articles of amount, cosine value is between 0~1, and two documents of the biggest explanation of cosine value are the most similar.
Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S01 includes
Data prepare, and cleaned the interference information of document by ETL Data clean system, and document is carried out structuring process, point
Solution becomes least unit structure;
Capital construction, based on ElasticSearch search engine, full-text index built by component, and uses in Chinese word segmentation storehouse
Fine granularity participle create index.
Documents Similarity computational methods the most according to claim 1, it is characterised in that: step S02 utilizes TF-IDF skill
Art according in inverse document dictionary word delete content of text identification is had little significance by document but the frequency of occurrences the highest point
Word.
4. similar document the whole network retrieval tracking, it is characterised in that:
A, setting range of search;
B, search condition set, and extract N number of kernel keyword that in original document, in TF-IDF, weighted value is the highest, with certain
Join rate, carry out full library searching based on ES full-text search engine;
C, do fall sequence according to key word and file correlation weighted value, by the document that retrieves according to key word and file correlation
Weighted value does descending sort;
D, every the document utilizing highest weight weight values document to obtain retrieval contrast one by one, and application claims 1 to 3 is any
One described Documents Similarity computational methods calculates the similarity of two documents;
Whether e, similarity comparison result be higher than N%, if higher than N%, then judges that two documents are identical, otherwise judge two as not
Same document.
Similar document the whole network the most according to claim 4 retrieval tracking, it is characterised in that: step S01 includes setting
The time range of the document that is retrieved issue, the carrier of issue, and the number of words of the document that is retrieved, type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610398902.4A CN106095737A (en) | 2016-06-07 | 2016-06-07 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610398902.4A CN106095737A (en) | 2016-06-07 | 2016-06-07 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106095737A true CN106095737A (en) | 2016-11-09 |
Family
ID=57227368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610398902.4A Pending CN106095737A (en) | 2016-06-07 | 2016-06-07 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095737A (en) |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933787A (en) * | 2017-03-20 | 2017-07-07 | 上海智臻智能网络科技股份有限公司 | Adjudicate the computational methods of document similarity, search device and computer equipment |
CN107330820A (en) * | 2017-08-28 | 2017-11-07 | 北京智诚律法科技有限公司 | A kind of forecasting system and method for lawsuit result |
CN107506204A (en) * | 2017-09-30 | 2017-12-22 | 福建星瑞格软件有限公司 | A kind of function reconstructing method of the code similarity-rough set based on the cosine law |
CN107577774A (en) * | 2017-09-08 | 2018-01-12 | 北京智诚律法科技有限公司 | A kind of intelligent selection draws up a contract the system of lawyer |
CN107749034A (en) * | 2017-11-17 | 2018-03-02 | 浙江工业大学 | A kind of safe friend recommendation method in social networks |
CN107943762A (en) * | 2017-11-24 | 2018-04-20 | 四川长虹电器股份有限公司 | A kind of text similarity sort method based on ES search |
CN108009599A (en) * | 2017-12-27 | 2018-05-08 | 福建中金在线信息科技有限公司 | A kind of original document determination methods, device, electronic equipment and storage medium |
CN108206020A (en) * | 2016-12-16 | 2018-06-26 | 北京智能管家科技有限公司 | A kind of audio recognition method, device and terminal device |
CN108241699A (en) * | 2016-12-26 | 2018-07-03 | 百度在线网络技术(北京)有限公司 | For the method and apparatus of pushed information |
CN108932228A (en) * | 2018-06-06 | 2018-12-04 | 武汉斗鱼网络科技有限公司 | INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live |
CN109117474A (en) * | 2018-06-25 | 2019-01-01 | 广州多益网络股份有限公司 | Calculation method, device and the storage medium of statement similarity |
CN109255018A (en) * | 2018-08-31 | 2019-01-22 | 沈文策 | A kind of method and apparatus identifying similar article |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109376231A (en) * | 2018-09-29 | 2019-02-22 | 杭州凡闻科技有限公司 | A kind of media hotspot tracking and system |
CN109460415A (en) * | 2018-11-26 | 2019-03-12 | 江苏科技大学 | A kind of similar fixture search method based on N-dimensional vector included angle cosine |
CN109508373A (en) * | 2018-11-13 | 2019-03-22 | 深圳前海微众银行股份有限公司 | Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index |
CN109582964A (en) * | 2018-11-29 | 2019-04-05 | 天津工业大学 | Intelligent legal advice auxiliary system based on marriage law judicial decision document big data |
CN109614478A (en) * | 2018-12-18 | 2019-04-12 | 北京中科闻歌科技股份有限公司 | Construction method, key word matching method and the device of term vector model |
CN109948121A (en) * | 2017-12-20 | 2019-06-28 | 北京京东尚科信息技术有限公司 | Article similarity method for digging, system, equipment and storage medium |
CN109977196A (en) * | 2019-03-29 | 2019-07-05 | 云南电网有限责任公司电力科学研究院 | A kind of detection method and device of magnanimity document similarity |
CN110532569A (en) * | 2019-09-05 | 2019-12-03 | 浪潮软件股份有限公司 | A kind of data collision method and system based on Chinese word segmentation |
CN110674388A (en) * | 2018-07-03 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Mapping method and device for push item, storage medium and terminal equipment |
CN110737839A (en) * | 2019-10-22 | 2020-01-31 | 京东数字科技控股有限公司 | Short text recommendation method, device, medium and electronic equipment |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN111104790A (en) * | 2018-10-10 | 2020-05-05 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for extracting key relation and computer readable medium |
CN111144068A (en) * | 2019-11-26 | 2020-05-12 | 方正璞华软件(武汉)股份有限公司 | Similar arbitration case recommendation method and device |
CN111444450A (en) * | 2019-01-16 | 2020-07-24 | 北大方正集团有限公司 | Method and device for determining reprinted data |
CN111666428A (en) * | 2020-06-04 | 2020-09-15 | 杭州凡闻科技有限公司 | Network media propagation evaluation method |
CN111767365A (en) * | 2019-03-12 | 2020-10-13 | 株式会社理光 | Document retrieval apparatus and method |
CN111859896A (en) * | 2019-04-01 | 2020-10-30 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN112270183A (en) * | 2020-10-21 | 2021-01-26 | 北京钛氪新媒体科技有限公司 | News spreading effect monitoring system based on text |
CN112949304A (en) * | 2021-03-24 | 2021-06-11 | 中新国际联合研究院 | Construction case knowledge reuse query method and device |
CN113254634A (en) * | 2021-02-04 | 2021-08-13 | 天津德尔塔科技有限公司 | File classification method and system based on phase space |
WO2021253873A1 (en) * | 2020-06-15 | 2021-12-23 | 语联网(武汉)信息技术有限公司 | Method and apparatus for retrieving similar document |
CN115686432A (en) * | 2022-12-30 | 2023-02-03 | 药融云数字科技(成都)有限公司 | Document evaluation method for retrieval sorting, storage medium and terminal |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1790321A (en) * | 2005-10-28 | 2006-06-21 | 北大方正集团有限公司 | Fast similarity-based retrieval method for mass text |
CN101055580A (en) * | 2006-04-13 | 2007-10-17 | Lg电子株式会社 | System, method and user interface for retrieving documents |
US7440947B2 (en) * | 2004-11-12 | 2008-10-21 | Fuji Xerox Co., Ltd. | System and method for identifying query-relevant keywords in documents with latent semantic analysis |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN102567364A (en) * | 2010-12-24 | 2012-07-11 | 鸿富锦精密工业(深圳)有限公司 | File search system and method |
CN103294693A (en) * | 2012-02-27 | 2013-09-11 | 华为技术有限公司 | Searching method, server and system |
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN104679728A (en) * | 2015-02-06 | 2015-06-03 | 中国农业大学 | Text similarity detection device |
CN105095430A (en) * | 2015-07-22 | 2015-11-25 | 深圳证券信息有限公司 | Method and device for setting up word network and extracting keywords |
CN105488151A (en) * | 2015-11-27 | 2016-04-13 | 小米科技有限责任公司 | Reference document recommendation method and apparatus |
-
2016
- 2016-06-07 CN CN201610398902.4A patent/CN106095737A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7440947B2 (en) * | 2004-11-12 | 2008-10-21 | Fuji Xerox Co., Ltd. | System and method for identifying query-relevant keywords in documents with latent semantic analysis |
CN1790321A (en) * | 2005-10-28 | 2006-06-21 | 北大方正集团有限公司 | Fast similarity-based retrieval method for mass text |
CN101055580A (en) * | 2006-04-13 | 2007-10-17 | Lg电子株式会社 | System, method and user interface for retrieving documents |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN102567364A (en) * | 2010-12-24 | 2012-07-11 | 鸿富锦精密工业(深圳)有限公司 | File search system and method |
CN103294693A (en) * | 2012-02-27 | 2013-09-11 | 华为技术有限公司 | Searching method, server and system |
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN104679728A (en) * | 2015-02-06 | 2015-06-03 | 中国农业大学 | Text similarity detection device |
CN105095430A (en) * | 2015-07-22 | 2015-11-25 | 深圳证券信息有限公司 | Method and device for setting up word network and extracting keywords |
CN105488151A (en) * | 2015-11-27 | 2016-04-13 | 小米科技有限责任公司 | Reference document recommendation method and apparatus |
Non-Patent Citations (4)
Title |
---|
于天恩: "《Lucene搜索引擎开发权威经典》", 31 October 2008, 中国铁道出版社 * |
吉志薇: "改进的TF-IDF算法在作品抄袭判定中的应用", 《文教资料》 * |
庄毅: "《面向互联网的多媒体大数据信息高效查询处理》", 1 June 2015 * |
潘华,项同德: "《数据仓库与数据挖掘原理、工具及应用》", 31 December 2007, 中国电力出版社 * |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108206020A (en) * | 2016-12-16 | 2018-06-26 | 北京智能管家科技有限公司 | A kind of audio recognition method, device and terminal device |
CN108241699B (en) * | 2016-12-26 | 2022-03-11 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
CN108241699A (en) * | 2016-12-26 | 2018-07-03 | 百度在线网络技术(北京)有限公司 | For the method and apparatus of pushed information |
CN106933787A (en) * | 2017-03-20 | 2017-07-07 | 上海智臻智能网络科技股份有限公司 | Adjudicate the computational methods of document similarity, search device and computer equipment |
CN107330820A (en) * | 2017-08-28 | 2017-11-07 | 北京智诚律法科技有限公司 | A kind of forecasting system and method for lawsuit result |
CN107577774A (en) * | 2017-09-08 | 2018-01-12 | 北京智诚律法科技有限公司 | A kind of intelligent selection draws up a contract the system of lawyer |
CN107506204A (en) * | 2017-09-30 | 2017-12-22 | 福建星瑞格软件有限公司 | A kind of function reconstructing method of the code similarity-rough set based on the cosine law |
CN107506204B (en) * | 2017-09-30 | 2020-08-25 | 福建星瑞格软件有限公司 | Code similarity comparison function reconstruction method based on cosine theorem |
CN107749034A (en) * | 2017-11-17 | 2018-03-02 | 浙江工业大学 | A kind of safe friend recommendation method in social networks |
CN107943762A (en) * | 2017-11-24 | 2018-04-20 | 四川长虹电器股份有限公司 | A kind of text similarity sort method based on ES search |
CN109948121A (en) * | 2017-12-20 | 2019-06-28 | 北京京东尚科信息技术有限公司 | Article similarity method for digging, system, equipment and storage medium |
CN108009599A (en) * | 2017-12-27 | 2018-05-08 | 福建中金在线信息科技有限公司 | A kind of original document determination methods, device, electronic equipment and storage medium |
CN108932228B (en) * | 2018-06-06 | 2023-08-08 | 广东南方报业移动媒体有限公司 | Live broadcast industry news and partition matching method and device, server and storage medium |
CN108932228A (en) * | 2018-06-06 | 2018-12-04 | 武汉斗鱼网络科技有限公司 | INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live |
CN109117474A (en) * | 2018-06-25 | 2019-01-01 | 广州多益网络股份有限公司 | Calculation method, device and the storage medium of statement similarity |
CN110674388A (en) * | 2018-07-03 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Mapping method and device for push item, storage medium and terminal equipment |
CN109271626B (en) * | 2018-08-31 | 2023-09-26 | 北京工业大学 | Text semantic analysis method |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109255018A (en) * | 2018-08-31 | 2019-01-22 | 沈文策 | A kind of method and apparatus identifying similar article |
CN109376231A (en) * | 2018-09-29 | 2019-02-22 | 杭州凡闻科技有限公司 | A kind of media hotspot tracking and system |
CN111104790A (en) * | 2018-10-10 | 2020-05-05 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for extracting key relation and computer readable medium |
CN111104790B (en) * | 2018-10-10 | 2024-03-22 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer readable medium for extracting key relation |
CN109508373B (en) * | 2018-11-13 | 2021-08-06 | 深圳前海微众银行股份有限公司 | Method and device for calculating enterprise public opinion index and computer readable storage medium |
CN109508373A (en) * | 2018-11-13 | 2019-03-22 | 深圳前海微众银行股份有限公司 | Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index |
CN109460415A (en) * | 2018-11-26 | 2019-03-12 | 江苏科技大学 | A kind of similar fixture search method based on N-dimensional vector included angle cosine |
CN109460415B (en) * | 2018-11-26 | 2021-09-21 | 江苏科技大学 | Similar fixture retrieval method based on N-dimensional vector included angle cosine |
CN109582964A (en) * | 2018-11-29 | 2019-04-05 | 天津工业大学 | Intelligent legal advice auxiliary system based on marriage law judicial decision document big data |
CN109614478A (en) * | 2018-12-18 | 2019-04-12 | 北京中科闻歌科技股份有限公司 | Construction method, key word matching method and the device of term vector model |
CN109614478B (en) * | 2018-12-18 | 2020-12-08 | 北京中科闻歌科技股份有限公司 | Word vector model construction method, keyword matching method and device |
CN111444450A (en) * | 2019-01-16 | 2020-07-24 | 北大方正集团有限公司 | Method and device for determining reprinted data |
CN111767365A (en) * | 2019-03-12 | 2020-10-13 | 株式会社理光 | Document retrieval apparatus and method |
CN109977196A (en) * | 2019-03-29 | 2019-07-05 | 云南电网有限责任公司电力科学研究院 | A kind of detection method and device of magnanimity document similarity |
CN111859896B (en) * | 2019-04-01 | 2022-11-25 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN111859896A (en) * | 2019-04-01 | 2020-10-30 | 长鑫存储技术有限公司 | Formula document detection method and device, computer readable medium and electronic equipment |
CN110532569A (en) * | 2019-09-05 | 2019-12-03 | 浪潮软件股份有限公司 | A kind of data collision method and system based on Chinese word segmentation |
CN110532569B (en) * | 2019-09-05 | 2023-03-28 | 浪潮软件股份有限公司 | Data collision method and system based on Chinese word segmentation |
CN110737839A (en) * | 2019-10-22 | 2020-01-31 | 京东数字科技控股有限公司 | Short text recommendation method, device, medium and electronic equipment |
CN111144068A (en) * | 2019-11-26 | 2020-05-12 | 方正璞华软件(武汉)股份有限公司 | Similar arbitration case recommendation method and device |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN111104794B (en) * | 2019-12-25 | 2023-07-04 | 同方知网数字出版技术股份有限公司 | Text similarity matching method based on subject term |
CN111666428A (en) * | 2020-06-04 | 2020-09-15 | 杭州凡闻科技有限公司 | Network media propagation evaluation method |
CN111666428B (en) * | 2020-06-04 | 2023-08-08 | 杭州凡闻科技有限公司 | Network media propagation force evaluation method |
WO2021253873A1 (en) * | 2020-06-15 | 2021-12-23 | 语联网(武汉)信息技术有限公司 | Method and apparatus for retrieving similar document |
CN112270183B (en) * | 2020-10-21 | 2024-03-19 | 北京钛氪新媒体科技有限公司 | News propagation effect monitoring system based on text |
CN112270183A (en) * | 2020-10-21 | 2021-01-26 | 北京钛氪新媒体科技有限公司 | News spreading effect monitoring system based on text |
CN113254634A (en) * | 2021-02-04 | 2021-08-13 | 天津德尔塔科技有限公司 | File classification method and system based on phase space |
CN112949304A (en) * | 2021-03-24 | 2021-06-11 | 中新国际联合研究院 | Construction case knowledge reuse query method and device |
CN115686432A (en) * | 2022-12-30 | 2023-02-03 | 药融云数字科技(成都)有限公司 | Document evaluation method for retrieval sorting, storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106095737A (en) | Documents Similarity computational methods and similar document the whole network retrieval tracking | |
CN105488024B (en) | The abstracting method and device of Web page subject sentence | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
Pereira et al. | Using web information for author name disambiguation | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
CN107153658A (en) | A kind of public sentiment hot word based on weighted keyword algorithm finds method | |
CN109388743B (en) | Language model determining method and device | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN110543595B (en) | In-station searching system and method | |
CN101944099A (en) | Method for automatically classifying text documents by utilizing body | |
Han et al. | HIT at TREC 2012 Microblog Track. | |
CN105320646A (en) | Incremental clustering based news topic mining method and apparatus thereof | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN110309251B (en) | Text data processing method, device and computer readable storage medium | |
CN103207864A (en) | Online novel content similarity comparison method | |
CN107526819A (en) | A kind of big data the analysis of public opinion method towards short text topic model | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
Sha et al. | Matching user accounts across social networks based on users message | |
Cui et al. | Personalized microblog recommendation using sentimental features | |
CN102063497A (en) | Open type knowledge sharing platform and entry processing method thereof | |
CN108509449B (en) | Information processing method and server | |
Juan | An effective similarity measurement for FAQ question answering system | |
Hong et al. | Project Rank: An internet topic evaluation model based on latent dirichlet allocation | |
CN102033961A (en) | Open-type knowledge sharing platform and polysemous word showing method thereof | |
Huang et al. | Study on multimedia network Weibo situational awareness model and emotional algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161109 |
|
RJ01 | Rejection of invention patent application after publication |