CN109710834A - Similar web page detection method, device, storage medium and electronic equipment - Google Patents
Similar web page detection method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN109710834A CN109710834A CN201811369272.3A CN201811369272A CN109710834A CN 109710834 A CN109710834 A CN 109710834A CN 201811369272 A CN201811369272 A CN 201811369272A CN 109710834 A CN109710834 A CN 109710834A
- Authority
- CN
- China
- Prior art keywords
- target
- sentences
- webpage
- web page
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
This disclosure relates to which a kind of similar web page detection method, device, storage medium and electronic equipment, choose the target sentences of the first predetermined number in target text;Each target sentences are scanned for using the search engine of the second predetermined number, and choose the target webpage of third predetermined number from search result according to the second preset rules;Obtain the web page text information in all target webpages;The matching rate of target text and web page text information is calculated, and the webpage that matching rate is greater than the first preset threshold is determined as webpage similar with target text.In this way, it can be by by target text subordinate sentence to be identified, and it scans for obtaining the target webpage that there is Similar content with the target text using search engine, by being matched to the text information in target webpage with target text, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect that whether target text is plagiarized in other web page contents.
Description
Technical field
This disclosure relates to text identification field, and in particular, to a kind of similar web page detection method, device, storage medium
And electronic equipment.
Background technique
The carrying plagiarism phenomenon of web site contents submission is commonplace, and the case where content similar in content occur in multiple websites
It is a kind of normality of present Web Community, this not only compromises the interests of authorship, while also to can not identify in plagiarism
The website of appearance affects.Therefore the method for needing a kind of pair of text to carry out similarity detection in the whole network, thus can
It is enough to carry out plagiarism identification for submission, it to avoid submission is plagiarized in other web site contents but the phenomenon that can not detected.
Summary of the invention
Purpose of this disclosure is to provide a kind of similar web page detection method, device, storage medium and electronic equipment, Neng Goushi
The now effect that webpage similar with target text detected, this makes it possible to easily detect that whether target text is plagiarized
In other web page contents.
To achieve the goals above, the disclosure provides a kind of similar web page detection method, which comprises
The target sentences of the first predetermined number are chosen in target text;
Each target sentences are scanned for using the search engine of the second predetermined number, to obtain third default
Several target webpages;
Obtain the web page text information in all target webpages;
The matching rate of the target text Yu the web page text information is calculated, and the matching rate is greater than first and is preset
The webpage of threshold value is determined as webpage similar with the target text, wherein the matching rate is higher, characterizes the target text
It is more similar to the web page text information.
Optionally, the target sentences that the first predetermined number is chosen in target text include:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to described
It is sampled in obtained all sentences after target text progress subordinate sentence, to obtain the target sentence of first predetermined number
Son.
Optionally, described that each target sentences are scanned for using the search engine of the second predetermined number, with
Target webpage to third predetermined number includes:
Each target sentences are scanned for using the search engine of the second predetermined number, choose each described search
It is called together based on the webpage for preceding 4th predetermined number in search result web page that engine is returned for each target sentences
Return webpage;
It calculates each target sentences and the corresponding basis and recalls likelihood between the summary texts of webpage,
The likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Recall webpage and the corresponding target sentences in the basis that the likelihood is calculated higher than the second preset threshold
Between similarity;
According to the similarity, the webpage of the preceding third predetermined number most like with the target sentences is determined as described
Target webpage.
Optionally, it is described calculate each target sentences and the corresponding basis recall webpage summary texts it
Between likelihood include:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will
First ratio of the sum of the number and word in the target sentences for the word in the target sentences being matched to is determined as institute
State likelihood;
Recall webpage and the corresponding target in the basis that the calculating likelihood is higher than the second preset threshold
Similarity between sentence includes:
The similarity is calculated according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts
To return the number that the described search engine of webpage is recalled on the basis.
Optionally, the calculating target text and the matching rate of the web page text information include:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in institute
The matching score in web page text information is stated, and is directed to each described web page text information, it will be with the web page text information
The matching score be greater than third predetermined threshold value sentence according to the third preset rules to the target text carry out
The matching rate of ratio shared in obtained all sentences as the target webpage after subordinate sentence, wherein the matching score
For characterizing the similarity degree between sentence and the web page text information in the target text, matching score is higher, phase
It is higher like degree.
Optionally, the calculation method of the matching score are as follows:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are divided
Word, and Shiftable window is respectively set according to the number for the word for including in all sentences, by described removable by word movement
Dynamic window carries out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information with it is described removable
The ratio of word in the target text corresponding to dynamic window to match is not less than the 5th preset threshold, then calculates the mesh
It is matched and second ratio of the distance between adjacent word and word sum in mark text, and will maximum in second ratio
Matching score of the ratio as sentence corresponding with the Shiftable window, wherein institute's predicate sum is described removable
Total word number in window subtracts one.
Optionally, the web page text information includes Web page text, delivers the time, one or more in author's title.
The disclosure also provides a kind of similar web page detection device, and described device includes:
First processing module, for choosing the target sentences of the first predetermined number in target text;
Second processing module searches each target sentences for the search engine using the second predetermined number
Rope, to obtain the target webpage of third predetermined number;
Third processing module, for obtaining the web page text information in all target webpages;
Fourth processing module, for calculating the matching rate of the target text Yu the web page text information, and will be described
The webpage that matching rate is greater than the first preset threshold is determined as webpage similar with the target text, wherein the matching rate is got over
It is more similar to the web page text information to characterize the target text for height.
Optionally, the first processing module is also used to:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to described
Stochastical sampling is carried out in obtained all sentences after target text progress subordinate sentence, to obtain the target of first predetermined number
Sentence.
Optionally, the Second processing module includes:
Submodule is searched for, each target sentences are scanned for for the search engine using the second predetermined number,
Choose preceding 4th predetermined number in the search result web page that each described search engine is returned for each target sentences
Webpage based on recall webpage;
Likelihood computational submodule recalls webpage with the corresponding basis for calculating each target sentences
Likelihood between summary texts, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Similarity calculation submodule recalls webpage higher than the basis of the second preset threshold for calculating the likelihood
With the similarity between the corresponding target sentences;
Target webpage determines submodule, is used for according to the similarity, by the preceding third most like with the target sentences
The webpage of predetermined number is determined as the target webpage.
Optionally, the likelihood computational submodule is also used to:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will
First ratio of the sum of the number and word in the target sentences for the word in the target sentences being matched to is determined as institute
State likelihood;
The similarity calculation submodule is also used to calculate the similarity according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts
To return the number that the described search engine of webpage is recalled on the basis.
Optionally, the fourth processing module is also used to:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in institute
The matching score in web page text information is stated, and is directed to each described web page text information, it will be with the web page text information
The matching score be greater than third predetermined threshold value sentence according to the third preset rules to the target text carry out
The matching rate of ratio shared in obtained all sentences as the target webpage after subordinate sentence, wherein the matching score
For characterizing the similarity degree between sentence and the web page text information in the target text, matching score is higher, phase
It is higher like degree.
Optionally, the fourth processing module is also used to:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are divided
Word, and Shiftable window is respectively set according to the number for the word for including in all sentences, by described removable by word movement
Dynamic window carries out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information with it is described removable
The ratio of word in the target text corresponding to dynamic window to match is not less than the 5th preset threshold, then calculates the mesh
It is matched and second ratio of the distance between adjacent word and word sum in mark text, and will maximum in second ratio
Matching score of the ratio as sentence corresponding with the Shiftable window, wherein institute's predicate sum is described removable
Total word number in window subtracts one.
Optionally, the web page text information includes Web page text, delivers the time, one or more in author's title.
The disclosure also provides a kind of computer readable storage medium, is stored thereon with computer program, and the program is processed
The step of approach described above is realized when device executes.
The disclosure also provides a kind of electronic equipment, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, the step of to realize approach described above.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine
Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage
This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect
Whether plagiarize to target text in other web page contents.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool
Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is a kind of flow chart of similar web page detection method shown according to one exemplary embodiment of the disclosure.
Fig. 2 is to confirm target webpage in a kind of similar web page detection method shown according to one exemplary embodiment of the disclosure
Method flow chart.
Fig. 3 is a kind of structural block diagram of similar web page detection device shown according to one exemplary embodiment of the disclosure.
Fig. 4 is the structural block diagram of the another similar web page detection device shown according to one exemplary embodiment of the disclosure.
Fig. 5 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment.
Fig. 6 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment
Specific embodiment
It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched
The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
Fig. 1 is a kind of flow chart of similar web page detection method shown according to one exemplary embodiment of the disclosure.Such as Fig. 1
Shown, the method includes the steps 101 to step 104.
In a step 101, the target sentences of the first predetermined number are chosen in target text.Carry out to target text
When the detection of similar web page, it selected section sentence can be scanned for first from target text, can compare entire mesh in this way
Mark text scan for expend time want it is short very much, so as to improve similar web page detection efficiency.This first default
Several value ranges should preferably be less than the sum of all sentences in the target text, and be greater than 1.
In a kind of possible embodiment, the target sentences packet that the first predetermined number is chosen in target text
Include: according to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target
Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number
Son.Want in target text selected section sentence to replace entire target text and carry out the search of webpage, it first can be right
The operation of target text execution subordinate sentence, wherein the method for carrying out subordinate sentence to one section of text information carrys out those skilled in the art
Say it is common technological means, therefore, the first preset rules can for arbitrarily to the subordinate sentence method of text, herein just no longer to point
The method of sentence repeats.First default is chosen after completing target text subordinate sentence, in all sentences in target text
When several target sentences, can by the way of stochastical sampling come from all sentences by the target sentences of first predetermined number
It selects and.First predetermined number can be such as 10.
In a step 102, each target sentences are scanned for using the search engine of the second predetermined number, and root
The target webpage of third predetermined number is chosen from search result according to the second preset rules.In a step 101 from according to target text
After the target sentences for selecting the first predetermined number in all sentences obtained after this progress subordinate sentence, search can be used and draw
It holds up and each target sentences is scanned for respectively, and obtain the target webpage of third predetermined number.Wherein, described search engine
One or more can be used, when using multiple search engines, each target sentences will be in each search engine
It is once searched for, and selects the target network of third predetermined number according to second preset rules in all search results
Page.Since the webpage quantity scanned for by search engine to target sentences is generally huger, after execution
Before continuous step, need first to search for the target sentences of the first predetermined number by one or more search engines
The target webpage that third preset quantity is chosen in a large amount of webpages, controls the number of the target webpage handled in subsequent step
Amount, second preset rules herein with no restrictions, as long as the target of the third predetermined number can be determined in search result
Webpage.The third predetermined number can be such as 50.
In step 103, the web page text information in all target webpages is obtained.Has been determined in a step 102
After the target webpage of three preset quantities, the web page text information in the target webpage is obtained, so as to subsequent
It is used in step 104.
At step 104, the matching rate of the target text Yu the web page text information is calculated, and by the matching rate
It is determined as webpage similar with the target text greater than the webpage of the first preset threshold, wherein the matching rate is higher, characterization
The target text is more similar to the web page text information.Target webpage is being had selected, and is being obtained in all target webpages
Web page text information after, target text is compared with the web page text information of all target webpages, and calculate target
Matching rate between text and each target webpage.The calculation method of the matching rate herein with no restrictions, as long as the matching rate energy
The similarity degree between target text and target webpage is enough characterized, matching rate is higher, and similarity degree is higher.Wherein, this
One preset threshold can be such as 70%.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine
Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage
This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect
Whether plagiarize to target text in other web page contents.
In a kind of possible embodiment, second preset rules in step 102 as shown in Figure 1 may include
Step 201 as shown in Figure 2 is to step 204.
In step 201, the search result net that each described search engine is returned for each target sentences is chosen
Webpage is recalled based on the webpage of preceding 4th predetermined number in page.Wherein, the 4th predetermined number can be such as 10
It is a, i.e., after each search engine scans for any one target sentences in the target sentences, all search is obtained
Result in sort forward 10 webpages based on recall webpage.If the number of the target sentences is 10, described
Second predetermined number is 2, then the number that webpage is recalled on the basis finally returned that is 200 (10*2*10).
In step 202, the summary texts that each target sentences recall webpage with the corresponding basis are calculated
Between likelihood, the likelihood is higher, characterizes more similar between the summary texts and the target sentences.It calculates at this time
The likelihood between webpage and each target sentences is recalled on the basis filtered out in step 201, is recalled by comparing basis
Similitude between the summary texts and target sentences of webpage calculates.The size of summary texts is generally possible to be maintained at certain
Within the scope of number of words, therefore, to the whole nets for being compared between summary texts and target sentences with directly recalling webpage to basis
Page text information can greatly improve the computational efficiency of the likelihood compared with being compared between target sentences.
In step 203, calculate the likelihood higher than the second preset threshold the basis recall webpage with it is corresponding
Similarity between the target sentences.Second preset threshold can be such as 80%.In step 202 by basis
It recalls and is compared to have obtained each basis between the abstract of webpage and target sentences and recalls phase between webpage and target sentences
Like rate, according to the sequence of the likelihood, webpage is recalled on the basis for selecting the likelihood greater than the second preset threshold.This process
It can be recalled on the basis and select a part of webpage even more like with target text in webpage and call together for the basis
Return the calculating of the further similarity of web page text information whole in webpage.In step 203, it needs to carry out similarity meter
The basis of calculation recalls webpage and has carried out primary screening according to the likelihood being calculated in step 202, therefore, is left at this time
Basis recall webpage quantity than directly with the webpage quantity in the search result of the search engine of the second predetermined number it is few
It is very much, can directly to the remaining basis, whole web page text information are detected with recalling webpage, calculate itself and mesh
Mark the similarity between sentence.
In step 204, according to the similarity, by the net of the preceding third predetermined number most like with the target sentences
Page is determined as the target webpage.The third predetermined number can be such as 50.Calculate target sentences with it is remaining
After the similarity between whole web page text information of webpage is recalled on basis, most according to the similarity reselection and target sentences
The webpage of similar preceding third predetermined number is determined as target webpage, the similarity can be value it is bigger characterization it is more similar, can also
To be that the smaller characterization of value is more dissimilar.
This makes it possible to will webpage quantity similar with target text to be contracted to according to similarity degree third from big to small pre-
If a several.
In a kind of possible embodiment, step 202 shown in Fig. 2 further include: to each target sentences and
The summary texts are segmented;Respectively by word and the word in the corresponding summary texts in each target sentences
The first of the sum of the number and word in the target sentences for the word in the target sentences for being matched, and being will match to
Ratio is determined as the likelihood.For example, having 10 words after some target sentences participle, some is opposite with the target sentences
The summary texts that webpage is recalled on the basis answered have 12 words after segmenting, and have 9 words can be in the summary texts in the target sentences
In find matching, then the likelihood is 90% (9/10).
In a kind of possible embodiment, step 203 shown in Fig. 2 further include: described in calculating according to the following formula
Similarity:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, the return_counts
To return the number that the described search engine of webpage is recalled on the basis.
In a kind of possible embodiment, the calculating target text described in step 104 shown in Fig. 1 with it is described
The method of the matching rate of web page text information can be with are as follows: carries out subordinate sentence to the target text according to third preset rules, and counts
Matching score of the obtained all sentences respectively in the web page text information is calculated, and is directed to each described web page text
Information will be greater than the sentence of third predetermined threshold value with the matching score of the web page text information pre- according to the third
If rule is to ratio shared in obtained all sentences after target text progress subordinate sentence as the target webpage
Matching rate.The third preset rules can be identical as first preset rules, can not also be identical, as long as can realize pair
The function of target text progress subordinate sentence.After carrying out subordinate sentence to target text, own to obtained target text
Sentence is carried out the step of being compared it with the web page text information of target webpage, to obtain institute in the target text
Have matching score of the sentence respectively between the target webpage, the matching score be used to characterize sentence in target text with
Similarity degree between target webpage, matching score is higher, and similarity degree is higher.Then each to calculate according to the matching score
The matching rate of a target webpage, for example, being divided into 10 sentences, and the third predetermined threshold value after target text subordinate sentence altogether
Be 80%, for webpage A one of in target webpage, in target text with webpage A match score be greater than 80% sentence
Son one shares 9, then the matching rate of the target text and webpage A are 90%.
The calculation method of above-mentioned matching score can be any method for calculating text similarity degree, in a kind of possible reality
It applies in mode, or method described below: subordinate sentence being carried out to the target text to according to the third preset rules
Obtained all sentences are segmented afterwards, and movable window is respectively set according to the number for the word for including in all sentences
Mouthful, by carrying out the matching between all sentences and the web page text information by the mobile Shiftable window of word;
In each matching, if the word shown in the Shiftable window in the web page text information and the Shiftable window institute
The ratio of word in the corresponding target text to match is not less than the 5th preset threshold, then calculates in the target text
It is matched and second ratio of the distance between adjacent word and word sum, and ratio maximum in second ratio is made
For the matching score of sentence corresponding with the Shiftable window, wherein institute's predicate sum is in the Shiftable window
Total word number subtracts one.5th preset threshold can carry out different settings, preferably 80% according to the actual situation.
For example, obtaining the sentence by A, B, C, tetra- words of D after some sentence in the target text is segmented
It constitutes, the web page text information of some webpage also carries out word segmentation processing in the target webpage, and determines A, B, C, tetra- words of D
Shared position in the web page respectively, such as A=(5,10,12,20,24), B=(1,3,11,55,75,98), C=(7,
13,45,56,85,97,101),
D=(8,14,44,57,86,88), by A, B, C, the position of tetra- words of D respectively in the web page merges, and obtains A,
Position (1, B) of tetra- words of B, C, D in the web page text information, (3, B), (5, A), (7, C), (8, D), (10, A), (11,
B), (12, A), (13, C), (14, D), (20, A), (24, A), (44, D), (45, C), (55, B), (56, C), (57, D), (75
B), (85, C), (86, D), (88, D), (97, C), (98, B), (101, C).Then establishing a length is the sentence word number
Window, mobile by word in the web page text information of the webpage using the window, every movement is once given a mark, is beaten every time
Whether timesharing, the ratio to match for first calculating word of the web page text information in the word and the sentence in the window are not less than
60% (the 5th preset threshold), if it is, calculate in the window, the distance between the word that is matched and adjacent word with
Second ratio of word sum, for example, the window of three words is moved to 54 in the web page text, 55,56,57 this four positions
When upper, the ratio to match that can obtain word of the web page text information in the word and the sentence in the window is 3/4
(75%), it is greater than the 5th preset threshold (60%), then carries out subsequent marking;In the window, the word B, C, D that are matched to it
Between it is adjacent to each other, therefore distance is B, and the distance between C 1 adds C, the distance between D 1, and sentence word sum is Shiftable window
In total word number subtract one, i.e., four subtract one be equal to three, therefore, the obtained matching score of the window position is 2/3.When this
Shiftable window all positions in the web page text information all carry out marking finish after, using highest score as the webpage
Matching score corresponding with the sentence.
In a kind of possible embodiment, give a mark in web page text information by word movement in Shiftable window
When, described in calculating in web page text information corresponding to the word that is shown in the Shiftable window and the Shiftable window
When the ratio of the word in target text to match, if the ratio is 1, then it represents that the sentence has been able to believe in the web page text
It exactly matches in breath, no longer gives a mark, directly return to 1 as webpage matching score corresponding with the sentence.
In a kind of possible embodiment, the web page text information includes Web page text, delivers time, author's title
In one or more.
In a kind of possible embodiment, after step 103 shown in Fig. 1, the first of the target text
The second author in author and target webpage does not execute shown in FIG. 1 if the first authors are identical as second author
Step in step 104.
Fig. 3 is a kind of structural block diagram of similar web page detection device shown according to one exemplary embodiment of the disclosure.Such as
Shown in Fig. 3, described device includes: first processing module 10, for choosing the target sentence of the first predetermined number in target text
Son;Second processing module 20 scans for each target sentences for the search engine using the second predetermined number, with
Obtain the target webpage of third predetermined number;Third processing module 30, for obtaining the text of the webpage in all target webpages
This information;Fourth processing module 40, for calculating the matching rate of the target text Yu the web page text information, and will be described
The webpage that matching rate is greater than the first preset threshold is determined as webpage similar with the target text.
Through the above technical solutions, can be by by target text subordinate sentence to be identified, and searched using search engine
Rope obtains the target webpage for having Similar content with the target text, by the text information and target text in target webpage
This is matched, to realize the effect that detected webpage similar with target text, this makes it possible to easily detect
Whether plagiarize to target text in other web page contents.
In a kind of possible embodiment, the first processing module 10 is also used to: according to the first preset rules to institute
It states target text and carries out subordinate sentence, and the obtained institute after carrying out subordinate sentence to the target text according to first preset rules
Have and carry out stochastical sampling in sentence, to obtain the target sentences of first predetermined number.
Fig. 4 is the second processing mould in a kind of similar web page detection device shown according to one exemplary embodiment of the disclosure
The structural block diagram of block 20.As shown in figure 4, the Second processing module 20 includes: search submodule 201, for pre- using second
If the search engine of number scans for each target sentences, each described search engine is chosen for each mesh
Webpage is recalled based on the webpage for preceding 4th predetermined number in search result web page that mark sentence returns;Likelihood calculates son
Module 202 recalls the phase between the summary texts of webpage for calculating each target sentences with the corresponding basis
Like rate, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;Similarity calculation submodule
203, webpage and the corresponding target sentences are recalled higher than the basis of the second preset threshold for calculating the likelihood
Between similarity;Target webpage determines submodule 204, is used for according to the similarity, will be most like with the target sentences
The webpage of preceding third predetermined number be determined as the target webpage.
In a kind of possible embodiment, the likelihood computational submodule 202 is also used to: to each target sentence
The sub and described summary texts are segmented;Respectively by each target sentences word in the corresponding summary texts
The word target sentences that are matched, and will match in word number and the word in the target sentences sum
First ratio is determined as the likelihood;The similarity calculation submodule 203 is also used to calculate the phase according to the following formula
Like degree: score=hit_rate*10+return_counts, wherein the score is the similarity, the hit_rate
For the likelihood, the return_counts is the number for returning the described search engine that webpage is recalled on the basis.
In a kind of possible embodiment, the fourth processing module 40 is also used to: according to third preset rules to institute
It states target text and carries out subordinate sentence, and calculate matching score of the obtained all sentences respectively in the web page text information,
And it is directed to each described web page text information, threshold is preset by third is greater than with the matching score of the web page text information
The sentence of value is shared in obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules
Matching rate of the ratio as the target webpage, wherein the matching score is used to characterize the sentence in the target text
With the similarity degree between the web page text information, matching score is higher, and similarity degree is higher.
In a kind of possible embodiment, the fourth processing module 40 is also used to: to according to the default rule of the third
Obtained all sentences segment after then carrying out subordinate sentence to the target text, and include according in all sentences
Shiftable window is respectively set in the number of word, by carried out by the mobile Shiftable window of word all sentences with it is described
Matching between web page text information;In each matching, if being shown in the Shiftable window in the web page text information
The ratio of word in the target text corresponding to the word and the Shiftable window shown to match is default not less than the 5th
Threshold value is then calculated and is matched in the target text and second ratio of the distance between adjacent word and word sum, and will
Matching score of the maximum ratio as sentence corresponding with the Shiftable window in second ratio, wherein described
Word sum is that total word number in the Shiftable window subtracts one.
In a kind of possible embodiment, the web page text information includes Web page text, delivers time, author's title
In one or more.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 5 is the block diagram of a kind of electronic equipment 500 shown according to an exemplary embodiment.As shown in figure 5, the electronics is set
Standby 500 may include: processor 501, memory 502.The electronic equipment 500 can also include multimedia component 503, input/
Export one or more of (I/O) interface 504 and communication component 505.
Wherein, processor 501 is used to control the integrated operation of the electronic equipment 500, to complete above-mentioned similar web page inspection
All or part of the steps in survey method.Memory 502 is for storing various types of data to support in the electronic equipment 500
Operation, these data for example may include the finger of any application or method for operating on the electronic equipment 500
Order and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..The storage
Device 502 can be realized by any kind of volatibility or non-volatile memory device or their combination, such as static random
It accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory
(Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable
Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory
(Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as
ROM), magnetic memory, flash memory, disk or CD.Multimedia component 503 may include screen and audio component.Wherein
Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include
One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage
Device 502 is sent by communication component 505.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O
Interface 504 provides interface between processor 501 and other interface modules, other above-mentioned interface modules can be keyboard, mouse,
Button etc..These buttons can be virtual push button or entity button.Communication component 505 is for the electronic equipment 500 and other
Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field
Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of combination, therefore corresponding communication
Component 505 may include: Wi-Fi module, bluetooth module, NFC module.
In one exemplary embodiment, electronic equipment 500 can be by one or more application specific integrated circuit
(Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital
Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device,
Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array
(Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member
Part is realized, for executing above-mentioned similar web page detection method.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should
The step of above-mentioned similar web page detection method is realized when program instruction is executed by processor.For example, the computer-readable storage
Medium can be the above-mentioned memory 502 including program instruction, and above procedure instruction can be by the processor 501 of electronic equipment 500
It executes to complete above-mentioned similar web page detection method.
Fig. 6 is the block diagram of a kind of electronic equipment 600 shown according to an exemplary embodiment.For example, electronic equipment 600 can
To be provided as a server.Referring to Fig. 6, electronic equipment 600 includes processor 622, and quantity can be one or more, with
And memory 632, for storing the computer program that can be executed by processor 622.The computer program stored in memory 632
May include it is one or more each correspond to one group of instruction module.In addition, processor 622 can be configured as
The computer program is executed, to execute above-mentioned similar web page detection method.
In addition, electronic equipment 600 can also include power supply module 626 and communication component 650, which can be with
It is configured as executing the power management of electronic equipment 600, which, which can be configured as, realizes electronic equipment 600
Communication, for example, wired or wireless communication.In addition, the electronic equipment 600 can also include input/output (I/O) interface 658.Electricity
Sub- equipment 600 can be operated based on the operating system for being stored in memory 632, such as Windows ServerTM, Mac OS
XTM, UnixTM, LinuxTM etc..
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should
The step of above-mentioned similar web page detection method is realized when program instruction is executed by processor.For example, the computer-readable storage
Medium can be the above-mentioned memory 632 including program instruction, and above procedure instruction can be by the processor 622 of electronic equipment 600
It executes to complete above-mentioned similar web page detection method.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure
Monotropic type, these simple variants belong to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the disclosure to it is various can
No further explanation will be given for the combination of energy.
In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally
Disclosed thought equally should be considered as disclosure disclosure of that.
Claims (16)
1. a kind of similar web page detection method, which is characterized in that the described method includes:
The target sentences of the first predetermined number are chosen in target text;
Each target sentences are scanned for using the search engine of the second predetermined number, and according to the second preset rules from
The target webpage of third predetermined number is chosen in search result;
Obtain the web page text information in all target webpages;
The matching rate of the target text Yu the web page text information is calculated, and the matching rate is greater than the first preset threshold
Webpage be determined as webpage similar with the target text, wherein the matching rate is higher, characterizes the target text and institute
It is more similar to state web page text information.
2. the method according to claim 1, wherein the mesh for choosing the first predetermined number in target text
Marking sentence includes:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target
Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number
Son.
3. the method according to claim 1, wherein second preset rules include:
The preceding 4th chosen in the search result web page that each described search engine is returned for each target sentences presets
Webpage is recalled based on the webpage of number;
It calculates each target sentences and the corresponding basis and recalls likelihood between the summary texts of webpage, it is described
Likelihood is higher, characterizes more similar between the summary texts and the target sentences;
It recalls between webpage and the corresponding target sentences on the basis that the likelihood is calculated higher than the second preset threshold
Similarity;
According to the similarity, the webpage of the preceding third predetermined number most like with the target sentences is determined as the target
Webpage.
4. according to the method described in claim 3, it is characterized in that, described calculate each target sentences and corresponding institute
Stating the likelihood that basis is recalled between the summary texts of webpage includes:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will matching
To the target sentences in word number and the word in the target sentences sum the first ratio be determined as the phase
Like rate;
Recall webpage and the corresponding target sentences in the basis that the calculating likelihood is higher than the second preset threshold
Between similarity include:
The similarity is calculated according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, and the return_counts is to return
The number that the described search engine of webpage is recalled on the basis is returned.
5. the method according to claim 1, wherein described calculate the target text and web page text letter
The matching rate of breath includes:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in the net
Matching score in page text information, and it is directed to each described web page text information, by the institute with the web page text information
The sentence that matching score is stated greater than third predetermined threshold value is carrying out subordinate sentence to the target text according to the third preset rules
The matching rate of ratio shared in obtained all sentences as the target webpage afterwards, wherein the matching score is used for
The similarity degree between the sentence and the web page text information in the target text is characterized, matching score is higher, similar journey
It spends higher.
6. according to the method described in claim 5, it is characterized in that, the calculation method of the matching score are as follows:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are segmented, and
Shiftable window is respectively set according to the number for the word for including in all sentences, by by the mobile Shiftable window of word
To carry out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information and the movable window
The ratio of the word in the target text corresponding to mouthful to match is not less than the 5th preset threshold, then calculates the target text
It is matched in this and second ratio of the distance between adjacent word and word sum, and by ratio maximum in second ratio
It is worth the matching score as sentence corresponding with the Shiftable window, wherein institute's predicate sum is the Shiftable window
In total word number subtract one.
7. the method according to claim 1, wherein when the web page text information includes Web page text, delivers
Between, one or more in author's title.
8. a kind of similar web page detection device, which is characterized in that described device includes:
First processing module, for choosing the target sentences of the first predetermined number in target text;
Second processing module scans for each target sentences for the search engine using the second predetermined number, and
The target webpage of third predetermined number is chosen from search result according to the second preset rules;
Third processing module, for obtaining the web page text information in all target webpages;
Fourth processing module, for calculating the matching rate of the target text Yu the web page text information, and by the matching
The webpage that rate is greater than the first preset threshold is determined as webpage similar with the target text, wherein the matching rate is higher, table
It is more similar to the web page text information to levy the target text.
9. device according to claim 8, which is characterized in that the first processing module is also used to:
According to the first preset rules to the target text carry out subordinate sentence, and according to first preset rules to the target
Stochastical sampling is carried out in obtained all sentences after text progress subordinate sentence, to obtain the target sentence of first predetermined number
Son.
10. device according to claim 8, which is characterized in that the Second processing module includes:
Submodule is searched for, each target sentences are scanned for for the search engine using the second predetermined number, is chosen
The net for preceding 4th predetermined number in search result web page that each described search engine is returned for each target sentences
Webpage is recalled based on page;
Likelihood computational submodule recalls the abstract of webpage for calculating each target sentences with the corresponding basis
Likelihood between text, the likelihood is higher, characterizes more similar between the summary texts and the target sentences;
Similarity calculation submodule, for calculate the likelihood higher than the second preset threshold the basis recall webpage with it is right
The similarity between the target sentences answered;
Target webpage determines submodule, for according to the similarity, the preceding third most like with the target sentences to be preset
The webpage of number is determined as the target webpage.
11. device according to claim 10, which is characterized in that the likelihood computational submodule is also used to:
Each target sentences and the summary texts are segmented;
The word in each target sentences is matched with the word in the corresponding summary texts respectively, and will matching
To the target sentences in word number and the word in the target sentences sum the first ratio be determined as the phase
Like rate;
The similarity calculation submodule is also used to calculate the similarity according to the following formula:
Score=hit_rate*10+return_counts,
Wherein, the score is the similarity, and the hit_rate is the likelihood, and the return_counts is to return
The number that the described search engine of webpage is recalled on the basis is returned.
12. device according to claim 8, which is characterized in that the fourth processing module is also used to:
Subordinate sentence is carried out to the target text according to third preset rules, and calculates obtained all sentences respectively in the net
Matching score in page text information, and it is directed to each described web page text information, by the institute with the web page text information
The sentence that matching score is stated greater than third predetermined threshold value is carrying out subordinate sentence to the target text according to the third preset rules
The matching rate of ratio shared in obtained all sentences as the target webpage afterwards, wherein the matching score is used for
The similarity degree between the sentence and the web page text information in the target text is characterized, matching score is higher, similar journey
It spends higher.
13. device according to claim 12, which is characterized in that the fourth processing module is also used to:
Obtained all sentences after carrying out subordinate sentence to the target text according to the third preset rules are segmented, and
Shiftable window is respectively set according to the number for the word for including in all sentences, by by the mobile Shiftable window of word
To carry out the matching between all sentences and the web page text information;
In each matching, if the word shown in the Shiftable window in the web page text information and the movable window
The ratio of the word in the target text corresponding to mouthful to match is not less than the 5th preset threshold, then calculates the target text
It is matched in this and second ratio of the distance between adjacent word and word sum, and by ratio maximum in second ratio
It is worth the matching score as sentence corresponding with the Shiftable window, wherein institute's predicate sum is the Shiftable window
In total word number subtract one.
14. device according to claim 8, which is characterized in that when the web page text information includes Web page text, delivers
Between, one or more in author's title.
15. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The step of any one of claim 1-7 the method is realized when execution.
16. a kind of electronic equipment characterized by comprising
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in any one of claim 1-7
The step of method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811369272.3A CN109710834B (en) | 2018-11-16 | 2018-11-16 | Similar webpage detection method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811369272.3A CN109710834B (en) | 2018-11-16 | 2018-11-16 | Similar webpage detection method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109710834A true CN109710834A (en) | 2019-05-03 |
CN109710834B CN109710834B (en) | 2020-01-10 |
Family
ID=66254955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811369272.3A Active CN109710834B (en) | 2018-11-16 | 2018-11-16 | Similar webpage detection method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710834B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111552877A (en) * | 2020-04-29 | 2020-08-18 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
CN112699657A (en) * | 2020-12-30 | 2021-04-23 | 广东德诚大数据科技有限公司 | Abnormal text detection method and device, electronic equipment and storage medium |
CN113887192A (en) * | 2021-12-06 | 2022-01-04 | 阿里巴巴达摩院(杭州)科技有限公司 | Text matching method and device and storage medium |
CN114417812A (en) * | 2022-03-15 | 2022-04-29 | 太平金融科技服务(上海)有限公司深圳分公司 | Text checking method, device, equipment and storage medium |
CN115687736A (en) * | 2022-12-30 | 2023-02-03 | 北京长亭未来科技有限公司 | Web application searching method and device and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952947A (en) * | 2005-10-17 | 2007-04-25 | 左其其 | A system and method for web site against clone |
CN101350032A (en) * | 2008-09-23 | 2009-01-21 | 胡辉 | Method for judging whether web page content is identical or not |
CN103049467A (en) * | 2011-10-12 | 2013-04-17 | 杨纯青 | Chinese digital anti-plagiarism detection and comparison system and method |
CN103345466A (en) * | 2013-07-12 | 2013-10-09 | 唐煜舟 | Academic paper information detection method based on free internet information |
CN103678528A (en) * | 2013-12-03 | 2014-03-26 | 北京建筑大学 | Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection |
US8782082B1 (en) * | 2011-11-07 | 2014-07-15 | Trend Micro Incorporated | Methods and apparatus for multiple-keyword matching |
CN105808552A (en) * | 2014-12-30 | 2016-07-27 | 北京奇虎科技有限公司 | Method and device for extracting abstract from webpage based on slide window |
CN106874299A (en) * | 2015-12-14 | 2017-06-20 | 北京国双科技有限公司 | Page detection method and device |
CN106909628A (en) * | 2017-01-24 | 2017-06-30 | 南京大学 | A kind of text similarity method based on interval |
-
2018
- 2018-11-16 CN CN201811369272.3A patent/CN109710834B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1952947A (en) * | 2005-10-17 | 2007-04-25 | 左其其 | A system and method for web site against clone |
CN101350032A (en) * | 2008-09-23 | 2009-01-21 | 胡辉 | Method for judging whether web page content is identical or not |
CN103049467A (en) * | 2011-10-12 | 2013-04-17 | 杨纯青 | Chinese digital anti-plagiarism detection and comparison system and method |
US8782082B1 (en) * | 2011-11-07 | 2014-07-15 | Trend Micro Incorporated | Methods and apparatus for multiple-keyword matching |
CN103345466A (en) * | 2013-07-12 | 2013-10-09 | 唐煜舟 | Academic paper information detection method based on free internet information |
CN103678528A (en) * | 2013-12-03 | 2014-03-26 | 北京建筑大学 | Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection |
CN105808552A (en) * | 2014-12-30 | 2016-07-27 | 北京奇虎科技有限公司 | Method and device for extracting abstract from webpage based on slide window |
CN106874299A (en) * | 2015-12-14 | 2017-06-20 | 北京国双科技有限公司 | Page detection method and device |
CN106909628A (en) * | 2017-01-24 | 2017-06-30 | 南京大学 | A kind of text similarity method based on interval |
Non-Patent Citations (1)
Title |
---|
廖兴伟: "文档复制检测方法研究与系统实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111552877A (en) * | 2020-04-29 | 2020-08-18 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
CN111552877B (en) * | 2020-04-29 | 2023-11-07 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
CN112699657A (en) * | 2020-12-30 | 2021-04-23 | 广东德诚大数据科技有限公司 | Abnormal text detection method and device, electronic equipment and storage medium |
CN113887192A (en) * | 2021-12-06 | 2022-01-04 | 阿里巴巴达摩院(杭州)科技有限公司 | Text matching method and device and storage medium |
CN114417812A (en) * | 2022-03-15 | 2022-04-29 | 太平金融科技服务(上海)有限公司深圳分公司 | Text checking method, device, equipment and storage medium |
CN115687736A (en) * | 2022-12-30 | 2023-02-03 | 北京长亭未来科技有限公司 | Web application searching method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109710834B (en) | 2020-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109710834A (en) | Similar web page detection method, device, storage medium and electronic equipment | |
WO2017045443A1 (en) | Image retrieval method and system | |
CN104572717B (en) | Information searching method and device | |
TWI505139B (en) | A method for realizing intelligent association in the input method, device and terminal device | |
US20110302654A1 (en) | Method and apparatus for analyzing and detecting malicious software | |
CN103942189B (en) | A kind of method and apparatus for determining works keyword | |
CN104516887B (en) | A kind of web data searching method, device and system | |
JP2018504727A (en) | Reference document recommendation method and apparatus | |
US20160092421A1 (en) | Text Editing Method and Apparatus, and Server | |
EP3559930A1 (en) | Conversion of static images into interactive maps | |
US20120265767A1 (en) | Method for searching related documents based on and guided by meaningful entities | |
CN103324674B (en) | Web page contents choosing method and device | |
CN107291772B (en) | Search access method and device and electronic equipment | |
CN108073292B (en) | Intelligent word forming method and device for intelligent word forming | |
CN109241437A (en) | A kind of generation method, advertisement recognition method and the system of advertisement identification model | |
CN106886294B (en) | Input method error correction method and device | |
CN111984749A (en) | Method and device for ordering interest points | |
KR101130206B1 (en) | Method, apparatus and computer program product for providing an input order independent character input mechanism | |
CN104281275A (en) | Method and device for inputting English | |
CN106919593B (en) | Searching method and device | |
CN107665218B (en) | Searching method and device and electronic equipment | |
KR20150032141A (en) | Semantic searching system and method for smart device | |
CN111222316B (en) | Text detection method, device and storage medium | |
CN109657840A (en) | Decision tree generation method, device, computer readable storage medium and electronic equipment | |
CN111274428B (en) | Keyword extraction method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |