CN107656916A - A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms - Google Patents
A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms Download PDFInfo
- Publication number
- CN107656916A CN107656916A CN201610588570.6A CN201610588570A CN107656916A CN 107656916 A CN107656916 A CN 107656916A CN 201610588570 A CN201610588570 A CN 201610588570A CN 107656916 A CN107656916 A CN 107656916A
- Authority
- CN
- China
- Prior art keywords
- simhash
- document
- practising fraud
- magnanimity
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses using the anti-demand of practising fraud of internet repetitive file as background, the Simhash anti-technology of practising fraud of magnanimity document is designed in exploitation, it is that the process for obtaining file characteristics based on document sentences the core algorithm of weight to the algorithm is improved by Simhash algorithms, using word senses as a considerations for weighing word weight;Signed for 64 document Simhash, there is provided the document of user's dimension, full text dimension and Hei Ku dimensions is sentenced to be serviced again, and can be based on carrying out document similarity comparisons with two kinds of granularities of paragraph in full.
Description
Technical field
The present invention relates to Internet technical field, is a kind of Simhash algorithmic techniques.
Background technology
In the epoch of this information explosion, the repetitive file on network is more and more, according to statistics, the repetition net on internet
Page accounts for 30%-45%;Similarity judgement is carried out to the document on network, and corresponding processing is done according to result of determination, for example, it is small
Son is included, deleted, and turns into an important branch of Internet technology development, in internet, a large amount of similar documents are very normal
The phenomenon seen, a large amount of repetitive files are small only to reduce product quality, and to the small close friend of user, how to avoid a large amount of repetitions or close
Document occurs be us and to a problem.
The content of the invention
Simhash algorithms mountain Google Charikar is proposed, is the signature that a document is converted into n positions, is passed through ratio
The similarity of original text shelves is calculated compared with the similarity of signature;Signature is more close, then document is more similar;Therefore, the small meeting of whole process
It is related to comparing two-by-two for original text shelves content of text, need not just stores the document content of these magnanimity, therefore the algorithm can push away
The wide number that arrives is with 10,000,000,000 document comparison scope;Other algorithm is simple and easy, it is readily appreciated that, but to reach preferable effect and also need to tie
Be situated between specific demand processing;Simhash algorithms are the approximate text detection algorithms of current main-stream.
The design of the anti-technology of practising fraud of magnanimity document of Simhash algorithms-high speed retrieval technique design:By each 64
Signature is divided into four parts, if the Hamming distances of two signatures are less than 3, by drawer principle, then must have a part to be
It is equal, therefore, the signature of 64 can be bisected into 4 parts, per part 16, using the binary system of 16 as key, will contained
The signature of 16 key is stored in redis as value;For a signature to be compared, 4 parts are divided into, each
Part pulls value as key in redis, then Hamming distances are calculated from the value pulled out, and this method can be big
The big scope for reducing Hamming distances and calculating.
Design-file characteristics weight computing of the anti-technology of practising fraud of magnanimity document of Simhash algorithms:Simhash algorithms,
The word occurred using in document is used as the feature of document, the weight of the frequency of word as each feature;The frequency of word, although
It is an important indicator for weighing file characteristics, but only using frequency as weight, still can loses a certain amount of information(Example
Such as, for sentence, " summer heat ", the result after cutting is word " summer ", frequency 1, word " heat ", frequency 1;Although two
The frequency of word is 1, but the core of this word is summer, and it represents the more features of word, therefore, it should should to son
The bigger weight of word;I.e. for the angle of part of speech, noun characterizes the more features of document;Therefore, can using part of speech as
Weigh a factor of word weight;Regulation in part of speech side and, noun weight highest, verb takes second place, adjective third, remaining
It is minimum);Using part of speech as a factor for weighing word weight, can more complete and earth's surface solicit articles the features of shelves, so obtained
Simhash signatures value also more be situated between reason, and then improve judge the similar accuracy rate of document.
Design-Simhash signature calculation the technologies of the anti-technology of practising fraud of magnanimity document of Simhash algorithms, document is counter to be made
The Simhash signature calculations of document are its core processes in disadvantage technology;This section introduces the process of Simhash signature calculations:
1st, overall procedure-Simhash signatures are calculated and are broadly divided into following several steps:
1)If what the parameter of request connect transmission certainly is the file characteristics of discretization, connects certainly and perform the 3rd step;If required parameter
It is document content, performs second step;
2)Obtain the file characteristics after discretization;
3)According to Simhash algorithms, document signature is calculated;
2nd, file characteristics are obtained:Simhash algorithms calculate the signature of document based on the file characteristics after discretization;Carry
The file characteristics taken, can more characterize the implication of the content of original text shelves, and the signature of generation is more significant;Traditional Simhash with
The word that occurs in document and word frequencies can lose a part of information as file characteristics, by the part of speech of word in the system
Also serve as characterizing a factor of file characteristics;, can also be to calculating process in the system simultaneously in order to improve the accuracy rate of calculating
Some basic processing, such as document pretreatment etc. are done, in the system, the key step for obtaining file characteristics is as follows:
1)Document is pre-processed (optional);
2)Pretreated document is segmented;
3)Remove stop words (optional);
4)Word frequencies are counted, obtain word part of speech;
5)According to Simhash algorithms, word weight is calculated;
5 steps more than, our cans obtain the file characteristics after discretization, to calculate document from given document content
Simhash values provide foundation;
3rd, the first step therein, it is optional that pretreatment is carried out to document, i.e., when each example retransmits request, Ke Yigen
Determined according to the demand of itself, if need to pre-process document, following processing is mainly sequentially done to document content:
1)Remove html labels;
2)Full-shape turns half-angle;
3)English alphabet capitalization turns small letter;
4)Traditional font turns simplified;
5)Go to space;
The target pre-processed to document is, rejects in document some extraneous features or meaningless feature as far as possible to document
Influence, if some example uses system service, it is believed that document pretreatment can lose document key character, can match somebody with somebody in the request
Textnorm parameter values are put, show that this request small need to pre-process;After each example oneself can also pre-process to document,
Send and ask to system again.
Simhash based on full text sentences to be realized again:Simhash based on full text sentences weight, refers to that the granularity that document sentences weight is
Whole document, i.e., according to the content of whole document generate Simhash signature, then according to calculate Simhash sign hamming away from
From the similitude for judging document;This kind of mode, first, based on document in full calculate the Simhash values of document, and calculate and
Document Hamming distances to be detected are that the document within 4 is last, according to required parameter, decide whether to reset and are matched text
The out-of-service time of shelves.
Simhash based on paragraph sentences to be realized again:The Simhash based on full text is to document and sentences weight, its granularity is larger, very
Easily bypassed by cribber, the preceding paragraph, or one section of text of middle string are added such as before and after original text, can all cause Hamming distances to become big;
Require that higher field is situated between, it is necessary to more fine-grained signature calculation in computational accuracy, such as the signature calculation based on paragraph;It is based on
The Simhash of paragraph sentences weight, its with based on full text Simhash sentence weight it is small with point be, it is necessary to be segmented to pending document,
Then Simhash is asked to sign to every section.
The test result of the anti-technology of practising fraud of magnanimity document of Simhash algorithms and analysis:
1st, the accuracy rate for sentencing weight to Simhash first is tested, and in order to verify that Simhash Documents Similarities calculate effect, is taken out
Sample has taken 103 pairs of texts, calculates 64 signatures of text, and compares the Hamming distances of each pair signature;
2nd, abscissa is document id, and 206 documents, 103 pairs, ordinate is the Hamming distances of each pair document;Hamming distances in figure
Three layers, first layer 0-4 can be divided, the second layer is 5 one 19, and third layer is more than 19, can be obtained by document content analysis:
1)Text is all identical or long text in mono- several bytes of I it is different, the Hamming distances of this text are typically below 4;
2)There is part text in sample, content deltas is very big, and this class text Hamming distances is bigger;
3rd, two class text Hamming distances are typically in 5 one 19 this layer in sample, one kind is identical for textual portions, such as text mountain two
Duan Zucheng, the last period is identical, latter section small same;One kind is short text, more than ten of Chinese character, but have one or two it is different, it is next right
The request amount of system is tested with time-consuming relation, the data stored in maintenance system, the small change of 1000w document, and small break carries
High system number of request per second, the short time consumption more each asked.Request amount, which takes relation, following variation tendency:
1)Request frequency takes stabilization within 10ms, now number of request is not the bottleneck of system service in 2000 times/s;
2)Request frequency 6000 this/s when, take stable in 20ms or so, now number of request is not the bottleneck of service;
3)After request frequency is more than 6000 times/s, takes and significantly increase, illustrate that now number of request turns into service impacting bottleneck;
Theoretically analyze, the document out-of-service time is set longer, and the time that it is stored within the storage system is longer, and storage is empty
Anaplasia is big, influences search efficiency, therefore have an impact to asking to take;This is also that system sets the original handled cold and hot data
Cause.
With reference at present in the anti-side of cheating of document and demand, develop the Simhash anti-technology of practising fraud of magnanimity document, pass through
Improved Simhash algorithms can make real-time response to external request;Research includes new example registration, and instance data imports, phase
Like file search;Document is sentenced can weigh strategy based on user, full text, sentencing for black storehouse dimension again;In granularity, full text and paragraph are supported
The Simhash of granularity sentences weight;Support the processing of cold and hot data;The anti-technology of practising fraud of document is established on the basis of mass data, at present
Each example can support the scale of 200,000,000 documents;The opposing party and, by the processing strategy to cold and hot data, example can be made
Data are maintained in a more stable scope, will not because of example data itself growth and excessively rapid growth.
Claims (3)
- A kind of 1. anti-technical method of practising fraud of the magnanimity document of Simhash algorithms, it is characterised in that:With reference to practising fraud at present document is counter The Simhash anti-technology of practising fraud of magnanimity is designed in the demand of aspect, exploitation, and by test, program is stable, to extensive Data have very high treatment effeciency, can meet that multiple examples using servicing, can handle the demand of magnanimity document.
- 2. according to the method for claim 1, it is characterised in that the anti-key technology of practising fraud of magnanimity document of Simhash algorithms It is:High speed retrieval technique, file characteristics weight computing, Simhash signature calculation technologies, the Simhash based on full text sentence weight Realize, the Simhash based on paragraph sentences to be realized again.
- 3. according to the method for claim 1, it is characterised in that tested, come by the accuracy rate that weight is sentenced to Simhash Carry out verifying that Simhash Documents Similarities calculate effect to analyze the result of this technology.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610588570.6A CN107656916A (en) | 2016-07-25 | 2016-07-25 | A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610588570.6A CN107656916A (en) | 2016-07-25 | 2016-07-25 | A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107656916A true CN107656916A (en) | 2018-02-02 |
Family
ID=61126905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610588570.6A Pending CN107656916A (en) | 2016-07-25 | 2016-07-25 | A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107656916A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710729A (en) * | 2018-12-14 | 2019-05-03 | 麒麟合盛网络技术股份有限公司 | A kind of acquisition method and device of text data |
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN113197428A (en) * | 2021-04-27 | 2021-08-03 | 日照职业技术学院 | Anti-cheating desk for higher education |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
-
2016
- 2016-07-25 CN CN201610588570.6A patent/CN107656916A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710729A (en) * | 2018-12-14 | 2019-05-03 | 麒麟合盛网络技术股份有限公司 | A kind of acquisition method and device of text data |
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN110162752B (en) * | 2019-05-13 | 2023-06-27 | 百度在线网络技术(北京)有限公司 | Article judging and re-processing method and device and electronic equipment |
CN113197428A (en) * | 2021-04-27 | 2021-08-03 | 日照职业技术学院 | Anti-cheating desk for higher education |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11182435B2 (en) | Model generation device, text search device, model generation method, text search method, data structure, and program | |
Stamatatos | Author identification using imbalanced and limited training texts | |
CN101685448A (en) | Method and device for establishing association between query operation of user and search result | |
CN107656916A (en) | A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms | |
CN102411564A (en) | Electronic operation plagiarism detection method | |
CN102081602A (en) | Method and equipment for determining category of unlisted word | |
KR101541306B1 (en) | Computer enabled method of important keyword extraction, server performing the same and storage media storing the same | |
CN110166847A (en) | Barrage treating method and apparatus | |
Jiang et al. | A unified neural network approach to e-commerce relevance learning | |
Chifu et al. | Word sense disambiguation to improve precision for ambiguous queries | |
Rahman et al. | An efficient deep learning technique for bangla fake news detection | |
CN102622378A (en) | Method and device for detecting events from text flow | |
Yong et al. | A neural-based text summarization system | |
CN105808602A (en) | Detection method and device of junk information | |
Gao et al. | Few-shot fake news detection via prompt-based tuning | |
Agarwal et al. | Intelligent plagiarism detection mechanism using semantic technology: A different approach | |
Rofiq | Indonesian news extractive text summarization using latent semantic analysis | |
Gardner et al. | Automatic link detection: a sequence labeling approach | |
Li et al. | Sentiment classification of financial microblogs through automatic text summarization | |
Zheng et al. | Research on domain term extraction based on conditional random fields | |
Salahuddin et al. | Automatic identification of Urdu fake news using logistic regression model | |
Akbari et al. | Sentiment Analysis Using Learning Vector Quantization Method | |
Liang et al. | Extracting keyphrases from chinese news articles using textrank and query log knowledge | |
CN108287851A (en) | The anti-scheme of practising fraud of document based on Simhash technologies | |
Sardar et al. | FakeDTML at CheckThat!-2023: Identifying Check-Worthiness of Tweets and Debate Snippets. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180202 |