CN107656916A - A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms - Google Patents

A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms Download PDF

Info

Publication number
CN107656916A
CN107656916A CN201610588570.6A CN201610588570A CN107656916A CN 107656916 A CN107656916 A CN 107656916A CN 201610588570 A CN201610588570 A CN 201610588570A CN 107656916 A CN107656916 A CN 107656916A
Authority
CN
China
Prior art keywords
simhash
document
practising fraud
magnanimity
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610588570.6A
Other languages
Chinese (zh)
Inventor
余漫游
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Dry Network Technology Co Ltd
Original Assignee
Changsha Dry Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Dry Network Technology Co Ltd filed Critical Changsha Dry Network Technology Co Ltd
Priority to CN201610588570.6A priority Critical patent/CN107656916A/en
Publication of CN107656916A publication Critical patent/CN107656916A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses using the anti-demand of practising fraud of internet repetitive file as background, the Simhash anti-technology of practising fraud of magnanimity document is designed in exploitation, it is that the process for obtaining file characteristics based on document sentences the core algorithm of weight to the algorithm is improved by Simhash algorithms, using word senses as a considerations for weighing word weight;Signed for 64 document Simhash, there is provided the document of user's dimension, full text dimension and Hei Ku dimensions is sentenced to be serviced again, and can be based on carrying out document similarity comparisons with two kinds of granularities of paragraph in full.

Description

A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms
Technical field
The present invention relates to Internet technical field, is a kind of Simhash algorithmic techniques.
Background technology
In the epoch of this information explosion, the repetitive file on network is more and more, according to statistics, the repetition net on internet Page accounts for 30%-45%;Similarity judgement is carried out to the document on network, and corresponding processing is done according to result of determination, for example, it is small Son is included, deleted, and turns into an important branch of Internet technology development, in internet, a large amount of similar documents are very normal The phenomenon seen, a large amount of repetitive files are small only to reduce product quality, and to the small close friend of user, how to avoid a large amount of repetitions or close Document occurs be us and to a problem.
The content of the invention
Simhash algorithms mountain Google Charikar is proposed, is the signature that a document is converted into n positions, is passed through ratio The similarity of original text shelves is calculated compared with the similarity of signature;Signature is more close, then document is more similar;Therefore, the small meeting of whole process It is related to comparing two-by-two for original text shelves content of text, need not just stores the document content of these magnanimity, therefore the algorithm can push away The wide number that arrives is with 10,000,000,000 document comparison scope;Other algorithm is simple and easy, it is readily appreciated that, but to reach preferable effect and also need to tie Be situated between specific demand processing;Simhash algorithms are the approximate text detection algorithms of current main-stream.
The design of the anti-technology of practising fraud of magnanimity document of Simhash algorithms-high speed retrieval technique design:By each 64 Signature is divided into four parts, if the Hamming distances of two signatures are less than 3, by drawer principle, then must have a part to be It is equal, therefore, the signature of 64 can be bisected into 4 parts, per part 16, using the binary system of 16 as key, will contained The signature of 16 key is stored in redis as value;For a signature to be compared, 4 parts are divided into, each Part pulls value as key in redis, then Hamming distances are calculated from the value pulled out, and this method can be big The big scope for reducing Hamming distances and calculating.
Design-file characteristics weight computing of the anti-technology of practising fraud of magnanimity document of Simhash algorithms:Simhash algorithms, The word occurred using in document is used as the feature of document, the weight of the frequency of word as each feature;The frequency of word, although It is an important indicator for weighing file characteristics, but only using frequency as weight, still can loses a certain amount of information(Example Such as, for sentence, " summer heat ", the result after cutting is word " summer ", frequency 1, word " heat ", frequency 1;Although two The frequency of word is 1, but the core of this word is summer, and it represents the more features of word, therefore, it should should to son The bigger weight of word;I.e. for the angle of part of speech, noun characterizes the more features of document;Therefore, can using part of speech as Weigh a factor of word weight;Regulation in part of speech side and, noun weight highest, verb takes second place, adjective third, remaining It is minimum);Using part of speech as a factor for weighing word weight, can more complete and earth's surface solicit articles the features of shelves, so obtained Simhash signatures value also more be situated between reason, and then improve judge the similar accuracy rate of document.
Design-Simhash signature calculation the technologies of the anti-technology of practising fraud of magnanimity document of Simhash algorithms, document is counter to be made The Simhash signature calculations of document are its core processes in disadvantage technology;This section introduces the process of Simhash signature calculations:
1st, overall procedure-Simhash signatures are calculated and are broadly divided into following several steps:
1)If what the parameter of request connect transmission certainly is the file characteristics of discretization, connects certainly and perform the 3rd step;If required parameter It is document content, performs second step;
2)Obtain the file characteristics after discretization;
3)According to Simhash algorithms, document signature is calculated;
2nd, file characteristics are obtained:Simhash algorithms calculate the signature of document based on the file characteristics after discretization;Carry The file characteristics taken, can more characterize the implication of the content of original text shelves, and the signature of generation is more significant;Traditional Simhash with The word that occurs in document and word frequencies can lose a part of information as file characteristics, by the part of speech of word in the system Also serve as characterizing a factor of file characteristics;, can also be to calculating process in the system simultaneously in order to improve the accuracy rate of calculating Some basic processing, such as document pretreatment etc. are done, in the system, the key step for obtaining file characteristics is as follows:
1)Document is pre-processed (optional);
2)Pretreated document is segmented;
3)Remove stop words (optional);
4)Word frequencies are counted, obtain word part of speech;
5)According to Simhash algorithms, word weight is calculated;
5 steps more than, our cans obtain the file characteristics after discretization, to calculate document from given document content Simhash values provide foundation;
3rd, the first step therein, it is optional that pretreatment is carried out to document, i.e., when each example retransmits request, Ke Yigen Determined according to the demand of itself, if need to pre-process document, following processing is mainly sequentially done to document content:
1)Remove html labels;
2)Full-shape turns half-angle;
3)English alphabet capitalization turns small letter;
4)Traditional font turns simplified;
5)Go to space;
The target pre-processed to document is, rejects in document some extraneous features or meaningless feature as far as possible to document Influence, if some example uses system service, it is believed that document pretreatment can lose document key character, can match somebody with somebody in the request Textnorm parameter values are put, show that this request small need to pre-process;After each example oneself can also pre-process to document, Send and ask to system again.
Simhash based on full text sentences to be realized again:Simhash based on full text sentences weight, refers to that the granularity that document sentences weight is Whole document, i.e., according to the content of whole document generate Simhash signature, then according to calculate Simhash sign hamming away from From the similitude for judging document;This kind of mode, first, based on document in full calculate the Simhash values of document, and calculate and Document Hamming distances to be detected are that the document within 4 is last, according to required parameter, decide whether to reset and are matched text The out-of-service time of shelves.
Simhash based on paragraph sentences to be realized again:The Simhash based on full text is to document and sentences weight, its granularity is larger, very Easily bypassed by cribber, the preceding paragraph, or one section of text of middle string are added such as before and after original text, can all cause Hamming distances to become big; Require that higher field is situated between, it is necessary to more fine-grained signature calculation in computational accuracy, such as the signature calculation based on paragraph;It is based on The Simhash of paragraph sentences weight, its with based on full text Simhash sentence weight it is small with point be, it is necessary to be segmented to pending document, Then Simhash is asked to sign to every section.
The test result of the anti-technology of practising fraud of magnanimity document of Simhash algorithms and analysis:
1st, the accuracy rate for sentencing weight to Simhash first is tested, and in order to verify that Simhash Documents Similarities calculate effect, is taken out Sample has taken 103 pairs of texts, calculates 64 signatures of text, and compares the Hamming distances of each pair signature;
2nd, abscissa is document id, and 206 documents, 103 pairs, ordinate is the Hamming distances of each pair document;Hamming distances in figure Three layers, first layer 0-4 can be divided, the second layer is 5 one 19, and third layer is more than 19, can be obtained by document content analysis:
1)Text is all identical or long text in mono- several bytes of I it is different, the Hamming distances of this text are typically below 4;
2)There is part text in sample, content deltas is very big, and this class text Hamming distances is bigger;
3rd, two class text Hamming distances are typically in 5 one 19 this layer in sample, one kind is identical for textual portions, such as text mountain two Duan Zucheng, the last period is identical, latter section small same;One kind is short text, more than ten of Chinese character, but have one or two it is different, it is next right The request amount of system is tested with time-consuming relation, the data stored in maintenance system, the small change of 1000w document, and small break carries High system number of request per second, the short time consumption more each asked.Request amount, which takes relation, following variation tendency:
1)Request frequency takes stabilization within 10ms, now number of request is not the bottleneck of system service in 2000 times/s;
2)Request frequency 6000 this/s when, take stable in 20ms or so, now number of request is not the bottleneck of service;
3)After request frequency is more than 6000 times/s, takes and significantly increase, illustrate that now number of request turns into service impacting bottleneck;
Theoretically analyze, the document out-of-service time is set longer, and the time that it is stored within the storage system is longer, and storage is empty Anaplasia is big, influences search efficiency, therefore have an impact to asking to take;This is also that system sets the original handled cold and hot data Cause.
With reference at present in the anti-side of cheating of document and demand, develop the Simhash anti-technology of practising fraud of magnanimity document, pass through Improved Simhash algorithms can make real-time response to external request;Research includes new example registration, and instance data imports, phase Like file search;Document is sentenced can weigh strategy based on user, full text, sentencing for black storehouse dimension again;In granularity, full text and paragraph are supported The Simhash of granularity sentences weight;Support the processing of cold and hot data;The anti-technology of practising fraud of document is established on the basis of mass data, at present Each example can support the scale of 200,000,000 documents;The opposing party and, by the processing strategy to cold and hot data, example can be made Data are maintained in a more stable scope, will not because of example data itself growth and excessively rapid growth.

Claims (3)

  1. A kind of 1. anti-technical method of practising fraud of the magnanimity document of Simhash algorithms, it is characterised in that:With reference to practising fraud at present document is counter The Simhash anti-technology of practising fraud of magnanimity is designed in the demand of aspect, exploitation, and by test, program is stable, to extensive Data have very high treatment effeciency, can meet that multiple examples using servicing, can handle the demand of magnanimity document.
  2. 2. according to the method for claim 1, it is characterised in that the anti-key technology of practising fraud of magnanimity document of Simhash algorithms It is:High speed retrieval technique, file characteristics weight computing, Simhash signature calculation technologies, the Simhash based on full text sentence weight Realize, the Simhash based on paragraph sentences to be realized again.
  3. 3. according to the method for claim 1, it is characterised in that tested, come by the accuracy rate that weight is sentenced to Simhash Carry out verifying that Simhash Documents Similarities calculate effect to analyze the result of this technology.
CN201610588570.6A 2016-07-25 2016-07-25 A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms Pending CN107656916A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610588570.6A CN107656916A (en) 2016-07-25 2016-07-25 A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610588570.6A CN107656916A (en) 2016-07-25 2016-07-25 A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms

Publications (1)

Publication Number Publication Date
CN107656916A true CN107656916A (en) 2018-02-02

Family

ID=61126905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610588570.6A Pending CN107656916A (en) 2016-07-25 2016-07-25 A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms

Country Status (1)

Country Link
CN (1) CN107656916A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710729A (en) * 2018-12-14 2019-05-03 麒麟合盛网络技术股份有限公司 A kind of acquisition method and device of text data
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN113197428A (en) * 2021-04-27 2021-08-03 日照职业技术学院 Anti-cheating desk for higher education
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710729A (en) * 2018-12-14 2019-05-03 麒麟合盛网络技术股份有限公司 A kind of acquisition method and device of text data
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN110162752B (en) * 2019-05-13 2023-06-27 百度在线网络技术(北京)有限公司 Article judging and re-processing method and device and electronic equipment
CN113197428A (en) * 2021-04-27 2021-08-03 日照职业技术学院 Anti-cheating desk for higher education
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11182435B2 (en) Model generation device, text search device, model generation method, text search method, data structure, and program
Stamatatos Author identification using imbalanced and limited training texts
CN101685448A (en) Method and device for establishing association between query operation of user and search result
CN107656916A (en) A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms
CN102411564A (en) Electronic operation plagiarism detection method
CN102081602A (en) Method and equipment for determining category of unlisted word
KR101541306B1 (en) Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
CN110166847A (en) Barrage treating method and apparatus
Jiang et al. A unified neural network approach to e-commerce relevance learning
Chifu et al. Word sense disambiguation to improve precision for ambiguous queries
Rahman et al. An efficient deep learning technique for bangla fake news detection
CN102622378A (en) Method and device for detecting events from text flow
Yong et al. A neural-based text summarization system
CN105808602A (en) Detection method and device of junk information
Gao et al. Few-shot fake news detection via prompt-based tuning
Agarwal et al. Intelligent plagiarism detection mechanism using semantic technology: A different approach
Rofiq Indonesian news extractive text summarization using latent semantic analysis
Gardner et al. Automatic link detection: a sequence labeling approach
Li et al. Sentiment classification of financial microblogs through automatic text summarization
Zheng et al. Research on domain term extraction based on conditional random fields
Salahuddin et al. Automatic identification of Urdu fake news using logistic regression model
Akbari et al. Sentiment Analysis Using Learning Vector Quantization Method
Liang et al. Extracting keyphrases from chinese news articles using textrank and query log knowledge
CN108287851A (en) The anti-scheme of practising fraud of document based on Simhash technologies
Sardar et al. FakeDTML at CheckThat!-2023: Identifying Check-Worthiness of Tweets and Debate Snippets.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180202