CN109710834A - Similar web page detection method, device, storage medium and electronic equipment - Google Patents

Similar web page detection method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN109710834A
CN109710834A CN201811369272.3A CN201811369272A CN109710834A CN 109710834 A CN109710834 A CN 109710834A CN 201811369272 A CN201811369272 A CN 201811369272A CN 109710834 A CN109710834 A CN 109710834A
Authority
CN
China
Prior art keywords
target
sentences
webpage
web page
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811369272.3A
Other languages
Chinese (zh)
Other versions
CN109710834B (en
Inventor
邹启波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201811369272.3A priority Critical patent/CN109710834B/en
Publication of CN109710834A publication Critical patent/CN109710834A/en
Application granted granted Critical
Publication of CN109710834B publication Critical patent/CN109710834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This disclosure relates to which a kind of similar web page detection method, device, storage medium and electronic equipment, choose the target sentences of the first predetermined number in target text;Each target sentences are scanned for using the search engine of the second predetermined number, and choose the target webpage of third predetermined number from search result according to the second preset rules;Obtain the web page text information in all target webpages;The matching rate of target text and web page text information is calculated, and the webpage that matching rate is greater than the first preset threshold is determined as webpage similar with target text.In this way, it can be by by target text subordinate sentence to be identified, and it scans for obtaining the target webpage that there is Similar content with the target text using search engine, by being matched to the text information in target webpage with target text, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect that whether target text is plagiarized in other web page contents.

Description

Similar web page detection method, device, storage medium and electronic equipment
Technical field
This disclosure relates to text identification field, and in particular, to a kind of similar web page detection method, device, storage medium And electronic equipment.
Background technique
The carrying plagiarism phenomenon of web site contents submission is commonplace, and the case where content similar in content occur in multiple websites It is a kind of normality of present Web Community, this not only compromises the interests of authorship, while also to can not identify in plagiarism The website of appearance affects.Therefore the method for needing a kind of pair of text to carry out similarity detection in the whole network, thus can It is enough to carry out plagiarism identification for submission, it to avoid submission is plagiarized in other web site contents but the phenomenon that can not detected.
Summary of the invention
Purpose of this disclosure is to provide a kind of similar web page detection method, device, storage medium and electronic equipment, Neng Goushi The now effect that webpage similar with target text detected, this makes it possible to easily detect that whether target text is plagiarized In other web page contents.
To achieve the goals above, the disclosure provides a kind of similar web page detection method, which comprises
The target sentences of the first predetermined number are chosen in target text;
Each target sentences are scanned for using the search engine of the second predetermined number, to obtain third default Several target webpages;
Obtain the web page text information in all target webpages;
The matching rate of the target text Yu the web page text information is calculated, and the matching rate is greater than first and is preset The webpage of threshold value is determined as webpage similar with the target text, wherein the matching rate is higher, characterizes the target text It is more similar to the web page text information.
Optionally, the target sentences that the first predetermined number is chosen in target text include:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to described It is sampled in obtained all sentences after target text progress subordinate sentence, to obtain the target sentence of first predetermined number Son.
Optionally, described that each target sentences are scanned for using the search engine of the second predetermined number, with Target webpage to third predetermined number includes:
Each target sentences are scanned for using the search engine of the second predetermined number, choose each described search It is called together based on the webpage for preceding 4th predetermined number in search result web page that engine is returned for each target sentences Return webpage;
It calculates each target sentences and the corresponding basis and recalls likelihood between the summary texts of webpage, The likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Recall webpage and the corresponding target sentences in the basis that the likelihood is calculated higher than the second preset threshold Between similarity;
According to the similarity, the webpage of the preceding third predetermined number most like with the target sentences is determined as described Target webpage.
Optionally, it is described calculate each target sentences and the corresponding basis recall webpage summary texts it Between likelihood include:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will First ratio of the sum of the number and word in the target sentences for the word in the target sentences being matched to is determined as institute State likelihood;
Recall webpage and the corresponding target in the basis that the calculating likelihood is higher than the second preset threshold Similarity between sentence includes:
The similarity is calculated according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts To return the number that the described search engine of webpage is recalled on the basis.
Optionally, the calculating target text and the matching rate of the web page text information include:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in institute The matching score in web page text information is stated, and is directed to each described web page text information, it will be with the web page text information The matching score be greater than third predetermined threshold value sentence according to the third preset rules to the target text carry out The matching rate of ratio shared in obtained all sentences as the target webpage after subordinate sentence, wherein the matching score For characterizing the similarity degree between sentence and the web page text information in the target text, matching score is higher, phase It is higher like degree.
Optionally, the calculation method of the matching score are as follows:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are divided Word, and Shiftable window is respectively set according to the number for the word for including in all sentences, by described removable by word movement Dynamic window carries out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information with it is described removable The ratio of word in the target text corresponding to dynamic window to match is not less than the 5th preset threshold, then calculates the mesh It is matched and second ratio of the distance between adjacent word and word sum in mark text, and will maximum in second ratio Matching score of the ratio as sentence corresponding with the Shiftable window, wherein institute's predicate sum is described removable Total word number in window subtracts one.
Optionally, the web page text information includes Web page text, delivers the time, one or more in author's title.
The disclosure also provides a kind of similar web page detection device, and described device includes:
First processing module, for choosing the target sentences of the first predetermined number in target text;
Second processing module searches each target sentences for the search engine using the second predetermined number Rope, to obtain the target webpage of third predetermined number;
Third processing module, for obtaining the web page text information in all target webpages;
Fourth processing module, for calculating the matching rate of the target text Yu the web page text information, and will be described The webpage that matching rate is greater than the first preset threshold is determined as webpage similar with the target text, wherein the matching rate is got over It is more similar to the web page text information to characterize the target text for height.
Optionally, the first processing module is also used to:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to described Stochastical sampling is carried out in obtained all sentences after target text progress subordinate sentence, to obtain the target of first predetermined number Sentence.
Optionally, the Second processing module includes:
Submodule is searched for, each target sentences are scanned for for the search engine using the second predetermined number, Choose preceding 4th predetermined number in the search result web page that each described search engine is returned for each target sentences Webpage based on recall webpage;
Likelihood computational submodule recalls webpage with the corresponding basis for calculating each target sentences Likelihood between summary texts, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Similarity calculation submodule recalls webpage higher than the basis of the second preset threshold for calculating the likelihood With the similarity between the corresponding target sentences;
Target webpage determines submodule, is used for according to the similarity, by the preceding third most like with the target sentences The webpage of predetermined number is determined as the target webpage.
Optionally, the likelihood computational submodule is also used to:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will First ratio of the sum of the number and word in the target sentences for the word in the target sentences being matched to is determined as institute State likelihood;
The similarity calculation submodule is also used to calculate the similarity according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts To return the number that the described search engine of webpage is recalled on the basis.
Optionally, the fourth processing module is also used to:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in institute The matching score in web page text information is stated, and is directed to each described web page text information, it will be with the web page text information The matching score be greater than third predetermined threshold value sentence according to the third preset rules to the target text carry out The matching rate of ratio shared in obtained all sentences as the target webpage after subordinate sentence, wherein the matching score For characterizing the similarity degree between sentence and the web page text information in the target text, matching score is higher, phase It is higher like degree.
Optionally, the fourth processing module is also used to:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are divided Word, and Shiftable window is respectively set according to the number for the word for including in all sentences, by described removable by word movement Dynamic window carries out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information with it is described removable The ratio of word in the target text corresponding to dynamic window to match is not less than the 5th preset threshold, then calculates the mesh It is matched and second ratio of the distance between adjacent word and word sum in mark text, and will maximum in second ratio Matching score of the ratio as sentence corresponding with the Shiftable window, wherein institute's predicate sum is described removable Total word number in window subtracts one.
Optionally, the web page text information includes Web page text, delivers the time, one or more in author's title.
The disclosure also provides a kind of computer readable storage medium, is stored thereon with computer program, and the program is processed The step of approach described above is realized when device executes.
The disclosure also provides a kind of electronic equipment, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, the step of to realize approach described above.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect Whether plagiarize to target text in other web page contents.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is a kind of flow chart of similar web page detection method shown according to one exemplary embodiment of the disclosure.
Fig. 2 is to confirm target webpage in a kind of similar web page detection method shown according to one exemplary embodiment of the disclosure Method flow chart.
Fig. 3 is a kind of structural block diagram of similar web page detection device shown according to one exemplary embodiment of the disclosure.
Fig. 4 is the structural block diagram of the another similar web page detection device shown according to one exemplary embodiment of the disclosure.
Fig. 5 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment.
Fig. 6 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment
Specific embodiment
It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
Fig. 1 is a kind of flow chart of similar web page detection method shown according to one exemplary embodiment of the disclosure.Such as Fig. 1 Shown, the method includes the steps 101 to step 104.
In a step 101, the target sentences of the first predetermined number are chosen in target text.Carry out to target text When the detection of similar web page, it selected section sentence can be scanned for first from target text, can compare entire mesh in this way Mark text scan for expend time want it is short very much, so as to improve similar web page detection efficiency.This first default Several value ranges should preferably be less than the sum of all sentences in the target text, and be greater than 1.
In a kind of possible embodiment, the target sentences packet that the first predetermined number is chosen in target text Include: according to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number Son.Want in target text selected section sentence to replace entire target text and carry out the search of webpage, it first can be right The operation of target text execution subordinate sentence, wherein the method for carrying out subordinate sentence to one section of text information carrys out those skilled in the art Say it is common technological means, therefore, the first preset rules can for arbitrarily to the subordinate sentence method of text, herein just no longer to point The method of sentence repeats.First default is chosen after completing target text subordinate sentence, in all sentences in target text When several target sentences, can by the way of stochastical sampling come from all sentences by the target sentences of first predetermined number It selects and.First predetermined number can be such as 10.
In a step 102, each target sentences are scanned for using the search engine of the second predetermined number, and root The target webpage of third predetermined number is chosen from search result according to the second preset rules.In a step 101 from according to target text After the target sentences for selecting the first predetermined number in all sentences obtained after this progress subordinate sentence, search can be used and draw It holds up and each target sentences is scanned for respectively, and obtain the target webpage of third predetermined number.Wherein, described search engine One or more can be used, when using multiple search engines, each target sentences will be in each search engine It is once searched for, and selects the target network of third predetermined number according to second preset rules in all search results Page.Since the webpage quantity scanned for by search engine to target sentences is generally huger, after execution Before continuous step, need first to search for the target sentences of the first predetermined number by one or more search engines The target webpage that third preset quantity is chosen in a large amount of webpages, controls the number of the target webpage handled in subsequent step Amount, second preset rules herein with no restrictions, as long as the target of the third predetermined number can be determined in search result Webpage.The third predetermined number can be such as 50.
In step 103, the web page text information in all target webpages is obtained.Has been determined in a step 102 After the target webpage of three preset quantities, the web page text information in the target webpage is obtained, so as to subsequent It is used in step 104.
At step 104, the matching rate of the target text Yu the web page text information is calculated, and by the matching rate It is determined as webpage similar with the target text greater than the webpage of the first preset threshold, wherein the matching rate is higher, characterization The target text is more similar to the web page text information.Target webpage is being had selected, and is being obtained in all target webpages Web page text information after, target text is compared with the web page text information of all target webpages, and calculate target Matching rate between text and each target webpage.The calculation method of the matching rate herein with no restrictions, as long as the matching rate energy The similarity degree between target text and target webpage is enough characterized, matching rate is higher, and similarity degree is higher.Wherein, this One preset threshold can be such as 70%.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect Whether plagiarize to target text in other web page contents.
In a kind of possible embodiment, second preset rules in step 102 as shown in Figure 1 may include Step 201 as shown in Figure 2 is to step 204.
In step 201, the search result net that each described search engine is returned for each target sentences is chosen Webpage is recalled based on the webpage of preceding 4th predetermined number in page.Wherein, the 4th predetermined number can be such as 10 It is a, i.e., after each search engine scans for any one target sentences in the target sentences, all search is obtained Result in sort forward 10 webpages based on recall webpage.If the number of the target sentences is 10, described Second predetermined number is 2, then the number that webpage is recalled on the basis finally returned that is 200 (10*2*10).
In step 202, the summary texts that each target sentences recall webpage with the corresponding basis are calculated Between likelihood, the likelihood is higher, characterizes more similar between the summary texts and the target sentences.It calculates at this time The likelihood between webpage and each target sentences is recalled on the basis filtered out in step 201, is recalled by comparing basis Similitude between the summary texts and target sentences of webpage calculates.The size of summary texts is generally possible to be maintained at certain Within the scope of number of words, therefore, to the whole nets for being compared between summary texts and target sentences with directly recalling webpage to basis Page text information can greatly improve the computational efficiency of the likelihood compared with being compared between target sentences.
In step 203, calculate the likelihood higher than the second preset threshold the basis recall webpage with it is corresponding Similarity between the target sentences.Second preset threshold can be such as 80%.In step 202 by basis It recalls and is compared to have obtained each basis between the abstract of webpage and target sentences and recalls phase between webpage and target sentences Like rate, according to the sequence of the likelihood, webpage is recalled on the basis for selecting the likelihood greater than the second preset threshold.This process It can be recalled on the basis and select a part of webpage even more like with target text in webpage and call together for the basis Return the calculating of the further similarity of web page text information whole in webpage.In step 203, it needs to carry out similarity meter The basis of calculation recalls webpage and has carried out primary screening according to the likelihood being calculated in step 202, therefore, is left at this time Basis recall webpage quantity than directly with the webpage quantity in the search result of the search engine of the second predetermined number it is few It is very much, can directly to the remaining basis, whole web page text information are detected with recalling webpage, calculate itself and mesh Mark the similarity between sentence.
In step 204, according to the similarity, by the net of the preceding third predetermined number most like with the target sentences Page is determined as the target webpage.The third predetermined number can be such as 50.Calculate target sentences with it is remaining After the similarity between whole web page text information of webpage is recalled on basis, most according to the similarity reselection and target sentences The webpage of similar preceding third predetermined number is determined as target webpage, the similarity can be value it is bigger characterization it is more similar, can also To be that the smaller characterization of value is more dissimilar.
This makes it possible to will webpage quantity similar with target text to be contracted to according to similarity degree third from big to small pre- If a several.
In a kind of possible embodiment, step 202 shown in Fig. 2 further include: to each target sentences and The summary texts are segmented;Respectively by word and the word in the corresponding summary texts in each target sentences The first of the sum of the number and word in the target sentences for the word in the target sentences for being matched, and being will match to Ratio is determined as the likelihood.For example, having 10 words after some target sentences participle, some is opposite with the target sentences The summary texts that webpage is recalled on the basis answered have 12 words after segmenting, and have 9 words can be in the summary texts in the target sentences In find matching, then the likelihood is 90% (9/10).
In a kind of possible embodiment, step 203 shown in Fig. 2 further include: described in calculating according to the following formula Similarity:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts To return the number that the described search engine of webpage is recalled on the basis.
In a kind of possible embodiment, the calculating target text described in step 104 shown in Fig. 1 with it is described The method of the matching rate of web page text information can be with are as follows: carries out subordinate sentence to the target text according to third preset rules, and counts Matching score of the obtained all sentences respectively in the web page text information is calculated, and is directed to each described web page text Information will be greater than the sentence of third predetermined threshold value with the matching score of the web page text information pre- according to the third If rule is to ratio shared in obtained all sentences after target text progress subordinate sentence as the target webpage Matching rate.The third preset rules can be identical as first preset rules, can not also be identical, as long as can realize pair The function of target text progress subordinate sentence.After carrying out subordinate sentence to target text, own to obtained target text Sentence is carried out the step of being compared it with the web page text information of target webpage, to obtain institute in the target text Have matching score of the sentence respectively between the target webpage, the matching score be used to characterize sentence in target text with Similarity degree between target webpage, matching score is higher, and similarity degree is higher.Then each to calculate according to the matching score The matching rate of a target webpage, for example, being divided into 10 sentences, and the third predetermined threshold value after target text subordinate sentence altogether Be 80%, for webpage A one of in target webpage, in target text with webpage A match score be greater than 80% sentence Son one shares 9, then the matching rate of the target text and webpage A are 90%.
The calculation method of above-mentioned matching score can be any method for calculating text similarity degree, in a kind of possible reality It applies in mode, or method described below: subordinate sentence being carried out to the target text to according to the third preset rules Obtained all sentences are segmented afterwards, and movable window is respectively set according to the number for the word for including in all sentences Mouthful, by carrying out the matching between all sentences and the web page text information by the mobile Shiftable window of word; In each matching, if the word shown in the Shiftable window in the web page text information and the Shiftable window institute The ratio of word in the corresponding target text to match is not less than the 5th preset threshold, then calculates in the target text It is matched and second ratio of the distance between adjacent word and word sum, and ratio maximum in second ratio is made For the matching score of sentence corresponding with the Shiftable window, wherein institute's predicate sum is in the Shiftable window Total word number subtracts one.5th preset threshold can carry out different settings, preferably 80% according to the actual situation.
For example, obtaining the sentence by A, B, C, tetra- words of D after some sentence in the target text is segmented It constitutes, the web page text information of some webpage also carries out word segmentation processing in the target webpage, and determines A, B, C, tetra- words of D Shared position in the web page respectively, such as A=(5,10,12,20,24), B=(1,3,11,55,75,98), C=(7, 13,45,56,85,97,101),
D=(8,14,44,57,86,88), by A, B, C, the position of tetra- words of D respectively in the web page merges, and obtains A, Position (1, B) of tetra- words of B, C, D in the web page text information, (3, B), (5, A), (7, C), (8, D), (10, A), (11, B), (12, A), (13, C), (14, D), (20, A), (24, A), (44, D), (45, C), (55, B), (56, C), (57, D), (75 B), (85, C), (86, D), (88, D), (97, C), (98, B), (101, C).Then establishing a length is the sentence word number Window, mobile by word in the web page text information of the webpage using the window, every movement is once given a mark, is beaten every time Whether timesharing, the ratio to match for first calculating word of the web page text information in the word and the sentence in the window are not less than 60% (the 5th preset threshold), if it is, calculate in the window, the distance between the word that is matched and adjacent word with Second ratio of word sum, for example, the window of three words is moved to 54 in the web page text, 55,56,57 this four positions When upper, the ratio to match that can obtain word of the web page text information in the word and the sentence in the window is 3/4 (75%), it is greater than the 5th preset threshold (60%), then carries out subsequent marking;In the window, the word B, C, D that are matched to it Between it is adjacent to each other, therefore distance is B, and the distance between C 1 adds C, the distance between D 1, and sentence word sum is Shiftable window In total word number subtract one, i.e., four subtract one be equal to three, therefore, the obtained matching score of the window position is 2/3.When this Shiftable window all positions in the web page text information all carry out marking finish after, using highest score as the webpage Matching score corresponding with the sentence.
In a kind of possible embodiment, give a mark in web page text information by word movement in Shiftable window When, described in calculating in web page text information corresponding to the word that is shown in the Shiftable window and the Shiftable window When the ratio of the word in target text to match, if the ratio is 1, then it represents that the sentence has been able to believe in the web page text It exactly matches in breath, no longer gives a mark, directly return to 1 as webpage matching score corresponding with the sentence.
In a kind of possible embodiment, the web page text information includes Web page text, delivers time, author's title In one or more.
In a kind of possible embodiment, after step 103 shown in Fig. 1, the first of the target text The second author in author and target webpage does not execute shown in FIG. 1 if the first authors are identical as second author Step in step 104.
Fig. 3 is a kind of structural block diagram of similar web page detection device shown according to one exemplary embodiment of the disclosure.Such as Shown in Fig. 3, described device includes: first processing module 10, for choosing the target sentence of the first predetermined number in target text Son;Second processing module 20 scans for each target sentences for the search engine using the second predetermined number, with Obtain the target webpage of third predetermined number;Third processing module 30, for obtaining the text of the webpage in all target webpages This information;Fourth processing module 40, for calculating the matching rate of the target text Yu the web page text information, and will be described The webpage that matching rate is greater than the first preset threshold is determined as webpage similar with the target text.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect Whether plagiarize to target text in other web page contents.
In a kind of possible embodiment, the first processing module 10 is also used to: according to the first preset rules to institute It states target text and carries out subordinate sentence, and the obtained institute after carrying out subordinate sentence to the target text according to first preset rules Have and carry out stochastical sampling in sentence, to obtain the target sentences of first predetermined number.
Fig. 4 is the second processing mould in a kind of similar web page detection device shown according to one exemplary embodiment of the disclosure The structural block diagram of block 20.As shown in figure 4, the Second processing module 20 includes: search submodule 201, for pre- using second If the search engine of number scans for each target sentences, each described search engine is chosen for each mesh Webpage is recalled based on the webpage for preceding 4th predetermined number in search result web page that mark sentence returns;Likelihood calculates son Module 202 recalls the phase between the summary texts of webpage for calculating each target sentences with the corresponding basis Like rate, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;Similarity calculation submodule 203, webpage and the corresponding target sentences are recalled higher than the basis of the second preset threshold for calculating the likelihood Between similarity;Target webpage determines submodule 204, is used for according to the similarity, will be most like with the target sentences The webpage of preceding third predetermined number be determined as the target webpage.
In a kind of possible embodiment, the likelihood computational submodule 202 is also used to: to each target sentence The sub and described summary texts are segmented;Respectively by each target sentences word in the corresponding summary texts The word target sentences that are matched, and will match in word number and the word in the target sentences sum First ratio is determined as the likelihood;The similarity calculation submodule 203 is also used to calculate the phase according to the following formula Like degree: score=hit_rate*10+return_counts, wherein the score is the similarity, the hit_rate For the likelihood, the return_counts is the number for returning the described search engine that webpage is recalled on the basis.
In a kind of possible embodiment, the fourth processing module 40 is also used to: according to third preset rules to institute It states target text and carries out subordinate sentence, and calculate matching score of the obtained all sentences respectively in the web page text information, And it is directed to each described web page text information, threshold is preset by third is greater than with the matching score of the web page text information The sentence of value is shared in obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules Matching rate of the ratio as the target webpage, wherein the matching score is used to characterize the sentence in the target text With the similarity degree between the web page text information, matching score is higher, and similarity degree is higher.
In a kind of possible embodiment, the fourth processing module 40 is also used to: to according to the default rule of the third Obtained all sentences segment after then carrying out subordinate sentence to the target text, and include according in all sentences Shiftable window is respectively set in the number of word, by carried out by the mobile Shiftable window of word all sentences with it is described Matching between web page text information;In each matching, if being shown in the Shiftable window in the web page text information The ratio of word in the target text corresponding to the word and the Shiftable window shown to match is default not less than the 5th Threshold value is then calculated and is matched in the target text and second ratio of the distance between adjacent word and word sum, and will Matching score of the maximum ratio as sentence corresponding with the Shiftable window in second ratio, wherein described Word sum is that total word number in the Shiftable window subtracts one.
In a kind of possible embodiment, the web page text information includes Web page text, delivers time, author's title In one or more.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 5 is the block diagram of a kind of electronic equipment 500 shown according to an exemplary embodiment.As shown in figure 5, the electronics is set Standby 500 may include: processor 501, memory 502.The electronic equipment 500 can also include multimedia component 503, input/ Export one or more of (I/O) interface 504 and communication component 505.
Wherein, processor 501 is used to control the integrated operation of the electronic equipment 500, to complete above-mentioned similar web page inspection All or part of the steps in survey method.Memory 502 is for storing various types of data to support in the electronic equipment 500 Operation, these data for example may include the finger of any application or method for operating on the electronic equipment 500 Order and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..The storage Device 502 can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static random It accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory (Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, disk or CD.Multimedia component 503 may include screen and audio component.Wherein Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage Device 502 is sent by communication component 505.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O Interface 504 provides interface between processor 501 and other interface modules, other above-mentioned interface modules can be keyboard, mouse, Button etc..These buttons can be virtual push button or entity button.Communication component 505 is for the electronic equipment 500 and other Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of combination, therefore corresponding communication Component 505 may include: Wi-Fi module, bluetooth module, NFC module.
In one exemplary embodiment, electronic equipment 500 can be by one or more application specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device, Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member Part is realized, for executing above-mentioned similar web page detection method.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should The step of above-mentioned similar web page detection method is realized when program instruction is executed by processor.For example, the computer-readable storage Medium can be the above-mentioned memory 502 including program instruction, and above procedure instruction can be by the processor 501 of electronic equipment 500 It executes to complete above-mentioned similar web page detection method.
Fig. 6 is the block diagram of a kind of electronic equipment 600 shown according to an exemplary embodiment.For example, electronic equipment 600 can To be provided as a server.Referring to Fig. 6, electronic equipment 600 includes processor 622, and quantity can be one or more, with And memory 632, for storing the computer program that can be executed by processor 622.The computer program stored in memory 632 May include it is one or more each correspond to one group of instruction module.In addition, processor 622 can be configured as The computer program is executed, to execute above-mentioned similar web page detection method.
In addition, electronic equipment 600 can also include power supply module 626 and communication component 650, which can be with It is configured as executing the power management of electronic equipment 600, which, which can be configured as, realizes electronic equipment 600 Communication, for example, wired or wireless communication.In addition, the electronic equipment 600 can also include input/output (I/O) interface 658.Electricity Sub- equipment 600 can be operated based on the operating system for being stored in memory 632, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM etc..
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should The step of above-mentioned similar web page detection method is realized when program instruction is executed by processor.For example, the computer-readable storage Medium can be the above-mentioned memory 632 including program instruction, and above procedure instruction can be by the processor 622 of electronic equipment 600 It executes to complete above-mentioned similar web page detection method.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure Monotropic type, these simple variants belong to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the disclosure to it is various can No further explanation will be given for the combination of energy.
In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally Disclosed thought equally should be considered as disclosure disclosure of that.

Claims (16)

1. a kind of similar web page detection method, which is characterized in that the described method includes:
The target sentences of the first predetermined number are chosen in target text;
Each target sentences are scanned for using the search engine of the second predetermined number, and according to the second preset rules from The target webpage of third predetermined number is chosen in search result;
Obtain the web page text information in all target webpages;
The matching rate of the target text Yu the web page text information is calculated, and the matching rate is greater than the first preset threshold Webpage be determined as webpage similar with the target text, wherein the matching rate is higher, characterizes the target text and institute It is more similar to state web page text information.
2. the method according to claim 1, wherein the mesh for choosing the first predetermined number in target text Marking sentence includes:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number Son.
3. the method according to claim 1, wherein second preset rules include:
The preceding 4th chosen in the search result web page that each described search engine is returned for each target sentences presets Webpage is recalled based on the webpage of number;
It calculates each target sentences and the corresponding basis and recalls likelihood between the summary texts of webpage, it is described Likelihood is higher, characterizes more similar between the summary texts and the target sentences;
It recalls between webpage and the corresponding target sentences on the basis that the likelihood is calculated higher than the second preset threshold Similarity;
According to the similarity, the webpage of the preceding third predetermined number most like with the target sentences is determined as the target Webpage.
4. according to the method described in claim 3, it is characterized in that, described calculate each target sentences and corresponding institute Stating the likelihood that basis is recalled between the summary texts of webpage includes:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will matching To the target sentences in word number and the word in the target sentences sum the first ratio be determined as the phase Like rate;
Recall webpage and the corresponding target sentences in the basis that the calculating likelihood is higher than the second preset threshold Between similarity include:
The similarity is calculated according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, and the return_counts is to return The number that the described search engine of webpage is recalled on the basis is returned.
5. the method according to claim 1, wherein described calculate the target text and web page text letter The matching rate of breath includes:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in the net Matching score in page text information, and it is directed to each described web page text information, by the institute with the web page text information The sentence that matching score is stated greater than third predetermined threshold value is carrying out subordinate sentence to the target text according to the third preset rules The matching rate of ratio shared in obtained all sentences as the target webpage afterwards, wherein the matching score is used for The similarity degree between the sentence and the web page text information in the target text is characterized, matching score is higher, similar journey It spends higher.
6. according to the method described in claim 5, it is characterized in that, the calculation method of the matching score are as follows:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are segmented, and Shiftable window is respectively set according to the number for the word for including in all sentences, by by the mobile Shiftable window of word To carry out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information and the movable window The ratio of the word in the target text corresponding to mouthful to match is not less than the 5th preset threshold, then calculates the target text It is matched in this and second ratio of the distance between adjacent word and word sum, and by ratio maximum in second ratio It is worth the matching score as sentence corresponding with the Shiftable window, wherein institute's predicate sum is the Shiftable window In total word number subtract one.
7. the method according to claim 1, wherein when the web page text information includes Web page text, delivers Between, one or more in author's title.
8. a kind of similar web page detection device, which is characterized in that described device includes:
First processing module, for choosing the target sentences of the first predetermined number in target text;
Second processing module scans for each target sentences for the search engine using the second predetermined number, and The target webpage of third predetermined number is chosen from search result according to the second preset rules;
Third processing module, for obtaining the web page text information in all target webpages;
Fourth processing module, for calculating the matching rate of the target text Yu the web page text information, and by the matching The webpage that rate is greater than the first preset threshold is determined as webpage similar with the target text, wherein the matching rate is higher, table It is more similar to the web page text information to levy the target text.
9. device according to claim 8, which is characterized in that the first processing module is also used to:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number Son.
10. device according to claim 8, which is characterized in that the Second processing module includes:
Submodule is searched for, each target sentences are scanned for for the search engine using the second predetermined number, is chosen The net for preceding 4th predetermined number in search result web page that each described search engine is returned for each target sentences Webpage is recalled based on page;
Likelihood computational submodule recalls the abstract of webpage for calculating each target sentences with the corresponding basis Likelihood between text, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Similarity calculation submodule, for calculate the likelihood higher than the second preset threshold the basis recall webpage with it is right The similarity between the target sentences answered;
Target webpage determines submodule, for according to the similarity, the preceding third most like with the target sentences to be preset The webpage of number is determined as the target webpage.
11. device according to claim 10, which is characterized in that the likelihood computational submodule is also used to:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will matching To the target sentences in word number and the word in the target sentences sum the first ratio be determined as the phase Like rate;
The similarity calculation submodule is also used to calculate the similarity according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, and the return_counts is to return The number that the described search engine of webpage is recalled on the basis is returned.
12. device according to claim 8, which is characterized in that the fourth processing module is also used to:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in the net Matching score in page text information, and it is directed to each described web page text information, by the institute with the web page text information The sentence that matching score is stated greater than third predetermined threshold value is carrying out subordinate sentence to the target text according to the third preset rules The matching rate of ratio shared in obtained all sentences as the target webpage afterwards, wherein the matching score is used for The similarity degree between the sentence and the web page text information in the target text is characterized, matching score is higher, similar journey It spends higher.
13. device according to claim 12, which is characterized in that the fourth processing module is also used to:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are segmented, and Shiftable window is respectively set according to the number for the word for including in all sentences, by by the mobile Shiftable window of word To carry out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information and the movable window The ratio of the word in the target text corresponding to mouthful to match is not less than the 5th preset threshold, then calculates the target text It is matched in this and second ratio of the distance between adjacent word and word sum, and by ratio maximum in second ratio It is worth the matching score as sentence corresponding with the Shiftable window, wherein institute's predicate sum is the Shiftable window In total word number subtract one.
14. device according to claim 8, which is characterized in that when the web page text information includes Web page text, delivers Between, one or more in author's title.
15. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of any one of claim 1-7 the method is realized when execution.
16. a kind of electronic equipment characterized by comprising
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in any one of claim 1-7 The step of method.
CN201811369272.3A 2018-11-16 2018-11-16 Similar webpage detection method and device, storage medium and electronic equipment Active CN109710834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811369272.3A CN109710834B (en) 2018-11-16 2018-11-16 Similar webpage detection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811369272.3A CN109710834B (en) 2018-11-16 2018-11-16 Similar webpage detection method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109710834A true CN109710834A (en) 2019-05-03
CN109710834B CN109710834B (en) 2020-01-10

Family

ID=66254955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811369272.3A Active CN109710834B (en) 2018-11-16 2018-11-16 Similar webpage detection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109710834B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552877A (en) * 2020-04-29 2020-08-18 百度在线网络技术(北京)有限公司 Data processing method and device
CN112699657A (en) * 2020-12-30 2021-04-23 广东德诚大数据科技有限公司 Abnormal text detection method and device, electronic equipment and storage medium
CN113887192A (en) * 2021-12-06 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Text matching method and device and storage medium
CN114417812A (en) * 2022-03-15 2022-04-29 太平金融科技服务(上海)有限公司深圳分公司 Text checking method, device, equipment and storage medium
CN115687736A (en) * 2022-12-30 2023-02-03 北京长亭未来科技有限公司 Web application searching method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1952947A (en) * 2005-10-17 2007-04-25 左其其 A system and method for web site against clone
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
CN103345466A (en) * 2013-07-12 2013-10-09 唐煜舟 Academic paper information detection method based on free internet information
CN103678528A (en) * 2013-12-03 2014-03-26 北京建筑大学 Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection
US8782082B1 (en) * 2011-11-07 2014-07-15 Trend Micro Incorporated Methods and apparatus for multiple-keyword matching
CN105808552A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstract from webpage based on slide window
CN106874299A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 Page detection method and device
CN106909628A (en) * 2017-01-24 2017-06-30 南京大学 A kind of text similarity method based on interval

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1952947A (en) * 2005-10-17 2007-04-25 左其其 A system and method for web site against clone
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
US8782082B1 (en) * 2011-11-07 2014-07-15 Trend Micro Incorporated Methods and apparatus for multiple-keyword matching
CN103345466A (en) * 2013-07-12 2013-10-09 唐煜舟 Academic paper information detection method based on free internet information
CN103678528A (en) * 2013-12-03 2014-03-26 北京建筑大学 Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection
CN105808552A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstract from webpage based on slide window
CN106874299A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 Page detection method and device
CN106909628A (en) * 2017-01-24 2017-06-30 南京大学 A kind of text similarity method based on interval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
廖兴伟: "文档复制检测方法研究与系统实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552877A (en) * 2020-04-29 2020-08-18 百度在线网络技术(北京)有限公司 Data processing method and device
CN111552877B (en) * 2020-04-29 2023-11-07 百度在线网络技术(北京)有限公司 Data processing method and device
CN112699657A (en) * 2020-12-30 2021-04-23 广东德诚大数据科技有限公司 Abnormal text detection method and device, electronic equipment and storage medium
CN113887192A (en) * 2021-12-06 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Text matching method and device and storage medium
CN114417812A (en) * 2022-03-15 2022-04-29 太平金融科技服务(上海)有限公司深圳分公司 Text checking method, device, equipment and storage medium
CN115687736A (en) * 2022-12-30 2023-02-03 北京长亭未来科技有限公司 Web application searching method and device and electronic equipment

Also Published As

Publication number Publication date
CN109710834B (en) 2020-01-10

Similar Documents

Publication Publication Date Title
CN109710834A (en) Similar web page detection method, device, storage medium and electronic equipment
WO2017045443A1 (en) Image retrieval method and system
CN104572717B (en) Information searching method and device
TWI505139B (en) A method for realizing intelligent association in the input method, device and terminal device
US20110302654A1 (en) Method and apparatus for analyzing and detecting malicious software
CN103942189B (en) A kind of method and apparatus for determining works keyword
CN104516887B (en) A kind of web data searching method, device and system
JP2018504727A (en) Reference document recommendation method and apparatus
US20160092421A1 (en) Text Editing Method and Apparatus, and Server
EP3559930A1 (en) Conversion of static images into interactive maps
US20120265767A1 (en) Method for searching related documents based on and guided by meaningful entities
CN103324674B (en) Web page contents choosing method and device
CN107291772B (en) Search access method and device and electronic equipment
CN108073292B (en) Intelligent word forming method and device for intelligent word forming
CN109241437A (en) A kind of generation method, advertisement recognition method and the system of advertisement identification model
CN106886294B (en) Input method error correction method and device
CN111984749A (en) Method and device for ordering interest points
KR101130206B1 (en) Method, apparatus and computer program product for providing an input order independent character input mechanism
CN104281275A (en) Method and device for inputting English
CN106919593B (en) Searching method and device
CN107665218B (en) Searching method and device and electronic equipment
KR20150032141A (en) Semantic searching system and method for smart device
CN111222316B (en) Text detection method, device and storage medium
CN109657840A (en) Decision tree generation method, device, computer readable storage medium and electronic equipment
CN111274428B (en) Keyword extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant