CN105243062A - Webpage feature region detection method and apparatus - Google Patents

Webpage feature region detection method and apparatus Download PDF

Info

Publication number
CN105243062A
CN105243062A CN201410245946.4A CN201410245946A CN105243062A CN 105243062 A CN105243062 A CN 105243062A CN 201410245946 A CN201410245946 A CN 201410245946A CN 105243062 A CN105243062 A CN 105243062A
Authority
CN
China
Prior art keywords
page
webpage
characteristic area
area
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410245946.4A
Other languages
Chinese (zh)
Other versions
CN105243062B (en
Inventor
梁捷
周超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201410245946.4A priority Critical patent/CN105243062B/en
Publication of CN105243062A publication Critical patent/CN105243062A/en
Application granted granted Critical
Publication of CN105243062B publication Critical patent/CN105243062B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage feature region detection method and apparatus. The method comprises: firstly, generating a first page result of a page under a filtration normal entry-into-force condition, and after setting a threshold time, generating a second page result of the page; and secondly, comparing the second page result with the first page result, and if discovering that different regions exist, determining that the existent different regions are feature regions causing problems. In a scene of a webpage to perform advertisement filtration, the feature regions causing the problems are advertisement regions possibly as advertisements which ought to be filtered appear due to failure of advertisement filtration rules or new advertisements not contained in the advertisement filtration rules appear. According to the webpage feature region detection method and apparatus, the webpage is compared with a reference webpage under the filtration normal entry-into-force condition, so that the feature regions (the advertisement regions) in the webpage can be quickly detected, the problems are quickly discovered, and a reference basis is provided for subsequent webpage filtration processing; and therefore the filtration rules can be adjusted to obtain better filtration effects.

Description

A kind of method and apparatus detecting webpage characteristic area
Technical field
The present invention relates to mobile communication technology field, more specifically, relate to the method and apparatus detecting webpage characteristic area.
Background technology
Panoramic advertisement is comprised in current webpage, these advertisements have impact on the experience of user on the one hand, also may cause on the one hand the consumption of added flow during access, aly the browser of advertisement in intelligently filters webpage or browser plug-in can bring very large lifting on Consumer's Experience.
Existing browser is generally all provided with advertisement filter rule, and the Rulemaking mode of advertisement filter is all check on internet, whether webpage produces the advertisement of new form by user feedback and artificial investigation two kinds of modes, the mode of user feedback is not prompt enough, and the mode manually investigated is efficient not.
The system of existing webpage Aulomatizeted Detect advertisement also has the mode of the difference by comparing dom tree and the Render tree generated in web analysis and process of typeset to detect advertisement.The method, particular by after advertisement filter, obtains the webpage not having the dom tree of advertisement and Render tree, then follow-up by webpage to be measured with do not have the webpage of advertisement to compare dom tree and Render sets, thus detect advertisement.
But this mode, usually for the test page that content can not change, for the internet page of web page contents change, cannot be distinguished the change caused because of advertisement or the change caused because of the content of webpage own, thus possibly cannot detect advertisement.Further, in prior art, advertisement filter is exactly carry out filtering advertisements by the DOM structure of webpage, if the system of Aulomatizeted Detect advertisement also adopts same mechanism, is also difficult to reach the object detecting advertisement.
Summary of the invention
In view of the above problems, the object of this invention is to provide a kind of method and the device that detect webpage characteristic area, the characteristic area in webpage can be detected fast, be convenient to pinpoint the problems fast when web advertisement filters, for follow-up web advertisement filtration treatment provides reference frame, can filtering rule be adjusted, and then obtain better filter effect.
According to an aspect of the present invention, a kind of method detecting webpage characteristic area is provided, comprises:
Generate the page and filter the first page result under works fine condition;
After setting threshold time, obtain the second page results of the page;
By described second page results and described first page results contrast, if find to there is different regions, determine that described to there is different regions be the characteristic area had problems.
Wherein: generate the first page result of the page under filtration works fine condition and comprise: generate the page and filtering the first page result marking off content logic region under works fine condition, wherein, described content logic region generates by performing the repeatedly page and load and merging after comparing the difference of each Webpage loaded;
Described second page results and described first page results contrast are comprised:
Region in described second page results and described first page result except described content logic region is compared.
Wherein, perform repeatedly the page to load and the difference of relatively each Webpage loaded merges generating content logic region comprises:
Sectional drawing is carried out to each page loaded, compares the difference of each sectional drawing, record discrepant pixel;
The multiple rectangular areas surrounding described discrepant pixel are generated according to described discrepant pixel;
Adjacent rectangular area is merged into content logic region.
Wherein, described second page results and described first page results contrast are comprised,
Judge whether the page exists skew;
If there is page skew, calculate page off-set value;
Compare again after carrying out page alignment according to page off-set value.
Wherein, judge whether the page exists skew and comprise:
Circulate from page first trip, relatively other row have does not have red with current line, blue, green three color feature value identical, if there is identical row, whether equally all one by one continue to set more thereafter the color feature value of often going in threshold range, if equal, determine that skew has appearred in the current page that compares; Then determine in other situation not occur that the page offsets;
Wherein, calculate page off-set value to comprise:
Calculate the alternate position spike of two offset row, position difference is page off-set value.
Wherein, described content logic region is configured to display first color, and the described characteristic area had problems determined is configured to display second color.
On the other hand, the present invention also provides a kind of device detecting webpage characteristic area, comprising:
Benchmark page generating unit, is filtering the first page result under works fine condition for generating the page;
Relatively page generating unit, for after setting threshold time, obtains the second page results of the page;
Characteristic area determining unit, for by described second page results and described first page results contrast, if find to there is different regions, determines that described to there is different regions be the characteristic area had problems.
Wherein, benchmark result generation unit comprises:
Load-on module, loads for performing repeatedly the page;
Difference searches module, for comparing the difference performing repeatedly the Webpage that the page loads;
Content area generation module, for the difference generating content logic region by described Webpage.
Wherein, benchmark result generation unit also comprises:
Screen capture module, for carrying out sectional drawing to each page loaded;
Rectangular area generation module, is merged into content logic region for content area generation module by described multiple rectangular area for generating multiple rectangular area according to discrepant pixel.
Wherein, characteristic area determining unit, comprising:
Comparison module, for by described second page results and described first page results contrast;
Skew judge module, for when carrying out first page sectional drawing and the second page screenshot, judges whether the page exists skew when the row comparing current comparison there are differences;
Off-set value computing module, when judging that the page exists skew for judge module, calculates page off-set value;
Alignment module, for carrying out page alignment according to page off-set value;
Characteristic area determination module, for being defined as the characteristic area of webpage by the diff area finally determined after carrying out page alignment.
The method and apparatus of detection webpage characteristic area of the present invention, first generates the page and is filtering the first page result under works fine condition, after setting threshold time, generate the second page results of the page; Then by the second page results and described first page results contrast, if find to there is different regions, determine that described to there is different regions be the characteristic area had problems.Carry out the scene of advertisement filter at webpage under, these characteristic areas had problems are exactly advertising area, and its reason may be that the inefficacy of advertisement filter rule causes the advertisement that should filter to occur, or the new advertisement that advertisement filter rule does not comprise.Therefore, the present invention is by comparing webpage and the reference webpage filtered under works fine, the characteristic area (advertising area) in webpage can be detected fast, pinpoint the problems fast, for follow-up home page filter process provides reference frame, make it possible to adjust filtering rule, and then obtain better filter effect.
In order to realize above-mentioned and relevant object, will describe in detail and the feature particularly pointed out in the claims after one or more aspect of the present invention comprises.Explanation below and accompanying drawing describe some illustrative aspects of the present invention in detail.But what these aspects indicated is only some modes that can use in the various modes of principle of the present invention.In addition, the present invention is intended to comprise all these aspects and their equivalent.
Accompanying drawing explanation
By reference to the content below in conjunction with the description of the drawings and claims, and understand more comprehensively along with to of the present invention, other object of the present invention and result will be understood and easy to understand more.In the accompanying drawings:
The process flow diagram of the method for the detection webpage characteristic area that Fig. 1 provides for the embodiment of the present invention;
An embodiment detail flowchart of the method for the detection webpage characteristic area that Fig. 2 provides for the embodiment of the present invention;
Fig. 3 shows a kind of device block scheme detecting webpage characteristic area of the present invention;
Fig. 4 shows a kind of block scheme detecting the benchmark result generation unit of an embodiment of the device of webpage characteristic area of the present invention;
Fig. 5 shows a kind of block scheme detecting the characteristic area determining unit of an embodiment of the device of webpage characteristic area of the present invention.
Label identical in all of the figs indicates similar or corresponding feature or function.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, carry out clear, complete description to the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
A kind of method and device detecting webpage characteristic area of method and apparatus of detection webpage characteristic area of the present invention, the characteristic area in webpage can be detected fast, be convenient to pinpoint the problems fast when web advertisement filters, for follow-up web advertisement filtration treatment provides reference frame, can filtering rule be adjusted, and then obtain better filter effect.
Fig. 1 shows the process flow diagram of the method for the detection webpage characteristic area according to the embodiment of the present invention;
As shown in Figure 1, comprise according to the method for detection webpage characteristic area of the present invention:
S110: generate the page and filtering the first page result under works fine condition;
Filter works fine and refer to the execution advertisement filter when the page loads, obtain the page not having advertisement, this page is first page result.
The first page result of the page under filtration works fine condition is become to comprise in this step: to generate the page and filtering the first page result marking off content logic region under works fine condition, wherein, described content logic region generates by performing the repeatedly page and load and merging after comparing the difference of each Webpage loaded.Specifically performing repeatedly the page to load, performing advertisement filter when loading, then sectional drawing is carried out, the difference of more multiple sectional drawing to each page loaded, record discrepant pixel; Multiple rectangular area is generated according to described discrepant pixel; Adjacent rectangular area is merged into content logic region.Then be first page result by the page screenshot of content logic region.
Here the information in first page result and content logic region is needed to preserve.
Such as, after performing twice page loading, compare the difference of the page of twice loading, the page of the variant point of keeping records becomes first page result.Perform twice page to load, need certain time interval, such as, about the time of interval 3-7 days, the page successively performing same URL loads, and then distinguishes sectional drawing, forms two sectional drawings, compare the difference of two sectional drawings afterwards, record discrepant pixel.Discrepant pixel according to record generates multiple rectangular area.Then proximate region is carried out in rectangular area and be merged into content logic region, the page screenshot of content logic region is defined as first page result.Wherein the information in content logic region also needs further preservation.
S120: after setting threshold time, obtain the second page results of the page;
This time perform after page loads a period of time that needs to be separated by after first page result is formed and perform, namely after setting threshold time, such as 3-7 days, generates the second page results of the page.Here be the loading performing the same URL page, also perform advertisement filter when loading, the page results of generation is the second page results.
S130: by described second page results and described first page results contrast, if find to there is different regions, determines that described to there is different regions be the characteristic area had problems.
Full page face can be carried out in this step compare, such as, full page sectional drawing be compared.But can ignore for the difference in the region of webpage body content.
The region that what another preferably compared is unless the context beyond logic region.The content logic region formed in S110 is defined as webpage body content region, and what compare in S130 is region beyond body matter region.If these regions are variant, just represent that new content has appearred in these regions, these regions are just defined as characteristic area.These characteristic areas may be new advertising areas.Only compare content logic region and webpage body content region in the present embodiment, compare and compare full page sectional drawing and decrease the workload that picture compares.Save the time, compare speed faster.
The method of detection webpage characteristic area of the present invention, first generates the page and is filtering the first page result under works fine condition, after setting threshold time, generate the second page results of the page; Then by the second page results and described first page results contrast, if find to there is different regions, determine that described to there is different regions be the characteristic area had problems.Carry out the scene of advertisement filter at webpage under, these characteristic areas had problems are exactly advertising area, and its reason may be that the inefficacy of advertisement filter rule causes the advertisement that should filter to occur, or the new advertisement that advertisement filter rule does not comprise.Therefore, the present invention is by comparing webpage and the reference webpage filtered under works fine, the characteristic area (advertising area) in webpage can be detected fast, pinpoint the problems fast, for follow-up home page filter process provides reference frame, make it possible to adjust filtering rule, and then obtain better filter effect.
Fig. 2 shows an embodiment detail flowchart of the method for the detection webpage characteristic area according to the embodiment of the present invention.
As shown in Figure 2, the method for the detection webpage characteristic area of the embodiment of the present invention comprises:
First perform S200, perform the page and load, to page screenshot, formation base page screenshot.Need to perform advertisement filter in the process performing page loading.Afterwards page screenshot is performed to the webpage after filtration.Perform a page afterwards again to load, to page screenshot, generate secondary page screenshot (S210).Current page loading procedure also needs to perform advertisement filter.The action need that this twice page loads sectional drawing has certain time interval, and namely after setting threshold time, then perform S210, such as time threshold is 3-7 days.Setting threshold time is to ensure that webpage during this period of time produces webpage body content change.Webpage web page contents in shorter interval does not probably change.None-identified goes out webpage body content.
Perform S220 afterwards, comparison basis page screenshot, secondary page screenshot, the pixel of record difference.The comparison of sectional drawing here adopts conventional figure comparing method to compare, and repeats no more here.
Obtain after recording the page screenshot of difference pixel, perform S230, generate multiple rectangular area according to the pixel of difference.By difference pixel of lining by line scan in S230, the distance between every two discrepant pixels within the specific limits time be divided in a rectangular area.In the present embodiment, certain limit refers to and meets dx 2+ dy 2<1000 then thinks adjacent, and these two pixels are divided in a rectangular area, and wherein x represents lateral separation, and y represents fore-and-aft distance.
After completing rectangular area division, perform S240, adjacent rectangular area is merged into content logic region.Wherein content logic region can be registered as the content area of the page.Rectangular area merging adjacent in S240 refers to that the distance between two rectangular areas is incorporated in a content logic region within the specific limits time, and the distance namely between two rectangular areas meets dx 2+ dy 2<1000 then thinks adjacent, and these two rectangular areas are divided in a content logic region, and wherein x represents lateral separation, and y represents fore-and-aft distance.After completing S240, perform S250, the page screenshot that there is content logic region is defined as first page sectional drawing, the information in first page sectional drawing and content logic region is preserved.Now namely first page sectional drawing is equivalent to the first page result in previous embodiment in S110.Here the information in content logic region at least comprises the positional information of content logic region in page screenshot.
S260, performs the page and loads, to page screenshot, generate the second page screenshot.Here the second page screenshot is equivalent to the second page results in previous embodiment in S120.The page of this step loads, identical with S200, S210 step to page screenshot, needs to perform advertisement filter, and after completing S250, performs S260 again after meeting time threshold, such as, perform S260 after 3-7 days when performing and loading.
S270, compares first page sectional drawing and the second page screenshot.The page screenshot being about to exist content logic region carries out picture with the 3rd page screenshot and compares.In the present embodiment be by the content logic region, content logic region determined in S250 beyond the sectional drawing in region compare.Relatively time, utilize the positional information of content logic region in page screenshot of preserving in S250 to carry out the comparison of page screenshot.This step is the comparison that mode by lining by line scan carries out sectional drawing.When the color feature value of often going is identical time, then think that this two row content is identical.What deserves to be explained is, can compare by full page sectional drawing time this step compares, just ignore the change in content logic region.Only compare content logic region and webpage body content region, compare and compare full page sectional drawing and decrease the workload that picture compares, save the time, compare speed faster.
Due to the instability of network, the opportunity that script in the page runs is uncertain, certain region overall situation that offset by a part downwards or upwards compared with first page result in the sectional drawing result of certain page or result picture may be there is, at this moment, the result directly compared is not identical, but in fact the framework of whole webpage does not change.So, in a preferred embodiment, when carrying out first page sectional drawing and the second page screenshot, when comparing difference, when the row specifically comparing current comparison in sectional drawing there are differences, first entering S280 and judging whether the page exists skew.
Judge in the present embodiment that the method whether page exists skew is:
First the region picture often row calculating color feature value for comparing, the color feature value account form of jth row is:
jrowColor = &Sigma; i = 0 i = width color ( i , j ) * i
What i represented is row; J represents row; JrowColor represents the color feature value of whole row; The R of color (i, j) current pixel, the value of G, B tri-colors.Width represents the breadth extreme of current line.
Then follow from page first trip, relatively other row have does not have red with current line, blue, green three color feature value identical, if there is identical row, whether equally all one by one continue to set more thereafter the color feature value of often going in threshold range, if equal, determine that page skew has appearred in the current page that compares, then determine in other situation not occur that the page offsets.
If there is page skew, perform S281, calculate page off-set value.Calculate the alternate position spike of two offset row, position difference is page off-set value.
After calculating page off-set value, carry out page alignment (S282) according to page off-set value.After completing page alignment, return S270, now S270 compares backward from the region after alignment.
If S280 judges that the page does not exist page skew, then finally determine that the page area of current comparison there are differences.
S290, is defined as the characteristic area of webpage by the diff area finally determined.
In preferred embodiment, content logic region can be configured to display first color, the described characteristic area had problems determined is configured to display second color.Conveniently identify characteristic area.
Compare and compare full page sectional drawing and decrease the workload compared, save the time, compare speed faster.
The method of detection webpage characteristic area of the present invention have ignored the change of webpage body content, relatively be if that these regions, these regions, region except body matter change, so these regions are just judged as characteristic area, namely the advertisement causing filtering of losing efficacy of advertisement filter that to be exactly probably new its reason of advertising area may be rule occurs, or the new advertisement that advertisement filter rule does not comprise.Therefore, the present invention is by comparing webpage and the reference webpage filtered under works fine, the characteristic area (advertising area) in webpage can be detected fast, pinpoint the problems fast, for follow-up home page filter process provides reference frame, make it possible to adjust filtering rule, and then obtain better filter effect.
The present invention also provides a kind of device detecting webpage characteristic area.
Fig. 3 shows a kind of device block scheme detecting webpage characteristic area of the present invention.
As shown in Figure 3, a kind of device detecting webpage characteristic area of the present invention comprises: benchmark result generation unit 300, compare page generating unit 310 and characteristic area determining unit 320.
Benchmark result generation unit 300, is filtering the first page result under works fine condition for generating the page;
Filter works fine and refer to the execution advertisement filter when the page loads, obtain the page not having advertisement, this page is first page result.
Fig. 4 shows a kind of block scheme detecting the benchmark result generation unit of a preferred embodiment of the device of webpage characteristic area of the present invention, and benchmark result generation unit 300 comprises as shown in Figure 4,
Load-on module 301, loads for performing repeatedly the page.Load-on module 301 performs when each page loads all will perform advertisement filter.
Difference searches module 302, for comparing the difference performing repeatedly the Webpage that the page loads.
Content area generation module 303, for the difference generating content logic region by described Webpage.
In a preferred embodiment, described benchmark result generation unit 300 also comprises:
Screen capture module 304, for carrying out sectional drawing to each page loaded.Specifically perform repeatedly the page by load-on module 301 to load, in loading procedure, perform advertisement filter, then screen capture module 304 carries out sectional drawing to each page loaded.
Now difference is searched module 302 mode be configured to by lining by line scan and is obtained difference pixel.Rectangular area generation module 305, is merged into content logic region for content area generation module 303 by described multiple rectangular area for generating multiple rectangular area according to discrepant pixel.Distance between every two discrepant pixels within the specific limits time be divided in a rectangular area.In the present embodiment, certain limit refers to that meeting dx2+dy2<1000 then thinks adjacent, and these two pixels are divided in a rectangular area by rectangular area generation module 305, and wherein x represents lateral separation, and y represents fore-and-aft distance.
The adjacent rectangular area that now content area generation module 303 is configured to rectangular area generation module 305 generates merges, a content logic region is incorporated in during by the distance between two rectangular areas within the specific limits, such as: the distance between two rectangular areas meets dx2+dy2<1000 and then thinks adjacent, these two rectangular areas are divided in a content logic region by content area generation module 303, wherein x represents lateral separation, and y represents fore-and-aft distance.
First page result determination module 306, for being defined as first page result by the page screenshot in content logic region.
Also comprise and preserve module (not shown), for preserving the information in first page result and content logic region.
About the time of interval 3-7 days, the page successively being performed same URL by load-on module 301 loads, after by screen capture module 304 pairs of pages sectional drawings respectively, form two sectional drawings, then search by difference the difference that module 302 compares two sectional drawings, record discrepant pixel.Rectangular area generation module 305 is according to the discrepant pixel of record afterwards, generates multiple rectangular area.Then adjacent rectangular area is merged into content logic region by content area generation module 303.The page screenshot of content logic region is defined as first page result by first page result determination module 306.Preserve by preserving the information of module by first page result and content logic region afterwards.
Of the present invention shown in Fig. 3 compares page generating unit 310, for, obtain the second page results of the page.
After reaching setting threshold time, perform the loading of the same URL page, in loading procedure, perform advertisement filter, the page results of formation, this time perform after page loads a period of time that needs to be separated by after benchmark result generation unit 300 generates first page result and perform again, such as interval 3-7 days.
Characteristic area determining unit 320 shown in Fig. 3, for by described second page results and described first page results contrast, if find to there is different regions, determines that described to there is different regions be the characteristic area had problems.
In preferred embodiment, the region that what characteristic area determining unit 320 compared is unless the context beyond logic region.The content logic region that content area generation module 303 generates is defined as webpage body content region, the region that what so characteristic area determining unit 320 compared is beyond body matter region.If these regions are had any different a little, so just represent these regions and occurred new content.These characteristic areas may be new advertising areas.
In preferred embodiment, content logic region can be configured to display first color, the described characteristic area had problems determined is configured to display second color.Conveniently identify characteristic area.
What deserves to be explained is, full page relatively can compare by characteristic area determining unit 320, just ignores the change in content logic region.Only compare content logic region and webpage body content region, compare and compare full page sectional drawing and decrease the workload compared.Save the time, compare speed faster.
Fig. 5 shows a kind of block scheme detecting the characteristic area determining unit of a preferred embodiment of the device of webpage characteristic area of the present invention;
As shown in Figure 5, characteristic area determining unit 320 comprises, comparison module 321, skew judge module 322, off-set value computing module 323, alignment module 324, characteristic area determination module 325.
Comparison module 321, for by described second page results and described first page results contrast.
The page area that what comparison module 321 compared is unless the context beyond the content logic region that generates of Area generation module 303.The content logic region that content area generation module 303 generates is defined as webpage body content region, the region that what comparison module 321 compared is beyond body matter region.Comparison module 321 is comparisons that mode by lining by line scan carries out sectional drawing.When the color feature value of often going is identical time, then think the identical i.e. advertising area of this two row content etc., if these regions are had any different a little, so just representing these regions there is new content, and these regions are just defined as characteristic area.These characteristic areas may be new advertising areas.
Due to the instability of network, the opportunity that script in the page runs is uncertain, certain region overall situation that offset by a part downwards or upwards compared with first page result in the sectional drawing result of certain page or result picture may be there is, at this moment, the result directly compared is not identical certainly, but in fact the framework of whole webpage does not change.So, in a preferred embodiment, when comparison module 321 region compared beyond content logic region there are differences, when the row specifically comparing current comparison there are differences, judge whether the page exists skew; Need to arrange skew judge module 322, be used for judging whether the page exists skew.
In the present embodiment, skew judge module 322 is used for judging that the determination methods whether page exists skew is:
First the region picture often row calculating color feature value for comparing, the color feature value account form of jth row is:
jrowColor = &Sigma; i = 0 i = width color ( i , j ) * i
What i represented is row; J represents row; JrowColor represents the color feature value of whole row; The R of color (i, j) current pixel, the value of G, B tri-colors.Width represents the breadth extreme of current line.
Then follow from page first trip, relatively other row have does not have red with current line, blue, green three color feature value identical, if there is identical row, whether equally all one by one continue to set more thereafter the color feature value of often going in threshold range, if equal, determine that page skew has appearred in the current page that compares, then determine in other situation not occur that the page offsets.
Off-set value computing module 323, when judging that the page exists skew for judge module 322, calculates page off-set value.Namely calculate the alternate position spike of two offset row, position difference is page off-set value.
Alignment module 324, for carrying out page alignment according to page off-set value.
Characteristic area determination module 325, for being defined as the characteristic area of webpage by the diff area finally determined after carrying out page alignment.The change of body matter is have ignored from the device of detection webpage characteristic area of the present invention, relatively be if that these regions, these regions, region except body matter change, so these regions are just judged as characteristic area, namely be exactly probably new advertising area, its reason may be that the inefficacy of advertisement filter rule causes the advertisement that should filter to occur, or the new advertisement that advertisement filter rule does not comprise.Therefore, the present invention is by comparing webpage and the reference webpage filtered under works fine, the characteristic area (advertising area) in webpage can be detected fast, pinpoint the problems fast, for follow-up home page filter process provides reference frame, make it possible to adjust filtering rule, and then obtain better filter effect.
Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that disclosed system, apparatus and method can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.
If described function using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims (10)

1. detect a method for webpage characteristic area, comprising:
Generate the page and filter the first page result under works fine condition;
After setting threshold time, obtain the second page results of the page;
By described second page results and described first page results contrast, if find to there is different regions, determine that described to there is different regions be the characteristic area had problems.
2. the method detecting webpage characteristic area as claimed in claim 1, wherein:
Generate the first page result of the page under filtration works fine condition to comprise: generate the page and filtering the first page result marking off content logic region under works fine condition, wherein, described content logic region generates by performing the repeatedly page and load and merging after comparing the difference of each Webpage loaded;
Described second page results and described first page results contrast are comprised:
Region in described second page results and described first page result except described content logic region is compared.
3. the as claimed in claim 2 method detecting webpage characteristic area, wherein, performs repeatedly the page and to load and the difference of relatively more each Webpage loaded merges generating content logic region comprises:
Sectional drawing is carried out to each page loaded, compares the difference of each sectional drawing, record discrepant pixel;
The multiple rectangular areas surrounding described discrepant pixel are generated according to described discrepant pixel;
Adjacent rectangular area is merged into content logic region.
4. the method for the detection webpage characteristic area as described in any one of claims 1 to 3, wherein, comprises described second page results and described first page results contrast,
Judge whether the page exists skew;
If there is page skew, calculate page off-set value;
Compare again after carrying out page alignment according to page off-set value.
5. the method detecting webpage characteristic area as claimed in claim 4, wherein, judges whether the page exists skew and comprise:
Circulate from page first trip, relatively other row have does not have red with current line, blue, green three color feature value identical, if there is identical row, whether equally all one by one continue to set more thereafter the color feature value of often going in threshold range, if equal, determine that skew has appearred in the current page that compares; Then determine in other situation not occur that the page offsets;
Wherein, calculate page off-set value to comprise:
Calculate the alternate position spike of two offset row, position difference is page off-set value.
6. detect the method for webpage characteristic area as claimed in claim 2 or claim 3, wherein, described content logic region is configured to display first color, and the described characteristic area had problems determined is configured to display second color.
7. detect a device for webpage characteristic area, comprising:
Benchmark page generating unit, is filtering the first page result under works fine condition for generating the page;
Relatively page generating unit, for after setting threshold time, obtains the second page results of the page;
Characteristic area determining unit, for by described second page results and described first page results contrast, if find to there is different regions, determines that described to there is different regions be the characteristic area had problems.
8. the device detecting webpage characteristic area as claimed in claim 7, wherein, benchmark result generation unit comprises:
Load-on module, loads for performing repeatedly the page;
Difference searches module, for comparing the difference performing repeatedly the Webpage that the page loads; Content area generation module, for the difference generating content logic region by described Webpage.
9. the device detecting webpage characteristic area as claimed in claim 8, wherein, benchmark result generation unit also comprises:
Screen capture module, for carrying out sectional drawing to each page loaded;
Rectangular area generation module, is merged into content logic region for content area generation module by described multiple rectangular area for generating multiple rectangular area according to discrepant pixel.
10., as right wants the device of the detection webpage characteristic area as described in 9, wherein, characteristic area determining unit, comprising:
Comparison module, for by described second page results and described first page results contrast;
Skew judge module, for when carrying out first page sectional drawing and the second page screenshot, judges whether the page exists skew when the row comparing current comparison there are differences;
Off-set value computing module, when judging that the page exists skew for judge module, calculates page off-set value;
Alignment module, for carrying out page alignment according to page off-set value;
Characteristic area determination module, for being defined as the characteristic area of webpage by the diff area finally determined after carrying out page alignment.
CN201410245946.4A 2014-06-04 2014-06-04 Method and device for detecting webpage feature area Active CN105243062B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410245946.4A CN105243062B (en) 2014-06-04 2014-06-04 Method and device for detecting webpage feature area

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410245946.4A CN105243062B (en) 2014-06-04 2014-06-04 Method and device for detecting webpage feature area

Publications (2)

Publication Number Publication Date
CN105243062A true CN105243062A (en) 2016-01-13
CN105243062B CN105243062B (en) 2020-10-30

Family

ID=55040714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410245946.4A Active CN105243062B (en) 2014-06-04 2014-06-04 Method and device for detecting webpage feature area

Country Status (1)

Country Link
CN (1) CN105243062B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193956A (en) * 2017-05-23 2017-09-22 深圳天珑无线科技有限公司 Page processing method and device
CN110134904A (en) * 2019-05-21 2019-08-16 腾讯科技(上海)有限公司 A kind of page check method, apparatus, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060235960A1 (en) * 2004-11-23 2006-10-19 Inventec Appliances Corporation Method for blocking network advertising
CN102999636A (en) * 2012-12-19 2013-03-27 北京奇虎科技有限公司 Method and browser for carrying out interception treatment on popup window in webpage
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device
CN103530560A (en) * 2013-09-29 2014-01-22 北京金山网络科技有限公司 Method, device and client side for advertisement blocking
CN103699665A (en) * 2013-12-27 2014-04-02 贝壳网际(北京)安全技术有限公司 Method and device for filtering web page advertisements

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060235960A1 (en) * 2004-11-23 2006-10-19 Inventec Appliances Corporation Method for blocking network advertising
CN102999636A (en) * 2012-12-19 2013-03-27 北京奇虎科技有限公司 Method and browser for carrying out interception treatment on popup window in webpage
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device
CN103530560A (en) * 2013-09-29 2014-01-22 北京金山网络科技有限公司 Method, device and client side for advertisement blocking
CN103699665A (en) * 2013-12-27 2014-04-02 贝壳网际(北京)安全技术有限公司 Method and device for filtering web page advertisements

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193956A (en) * 2017-05-23 2017-09-22 深圳天珑无线科技有限公司 Page processing method and device
CN110134904A (en) * 2019-05-21 2019-08-16 腾讯科技(上海)有限公司 A kind of page check method, apparatus, equipment and medium
CN110134904B (en) * 2019-05-21 2022-11-29 腾讯科技(上海)有限公司 Page checking method, device, equipment and medium

Also Published As

Publication number Publication date
CN105243062B (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN102750535B (en) Method and system for automatically extracting image foreground
US9424479B2 (en) Systems and methods for resizing an image
US10748023B2 (en) Region-of-interest detection apparatus, region-of-interest detection method, and recording medium
CN111124888B (en) Method and device for generating recording script and electronic device
US20190238884A1 (en) Determining variance of a block of an image based on a motion vector for the block
US9477885B2 (en) Image processing apparatus, image processing method and image processing program
CN104978733B (en) Smog detection method and device
CN104835134B (en) A kind of method and apparatus for calculating commodity image psoriasis score value
CN106775747A (en) A kind of method and apparatus of color configuration
CN110119675B (en) Product identification method and device
US9131097B2 (en) Method and system for black bar identification
US11728914B2 (en) Detection device, detection method, and program
US9286660B2 (en) Filtering method and device in image processing
Hashim et al. Development of tomato inspection and grading system using image processing
CN105243062A (en) Webpage feature region detection method and apparatus
CN103955330A (en) Information displaying method and device
CN115240197A (en) Image quality evaluation method, image quality evaluation device, electronic apparatus, scanning pen, and storage medium
CN111757182B (en) Image splash screen detection method, device, computer device and readable storage medium
EP2735997B1 (en) Image processing apparatus
CN105446968B (en) A kind of method and apparatus detecting web page characteristics region
CN109685079B (en) Method and device for generating characteristic image category information
KR102413043B1 (en) Method and apparatus for seperating shot of moving picture content
EP2883205B1 (en) Method and apparatus to detect artificial edges in images
KR102339342B1 (en) Method and system for detecting wave overtopping
JP2013178732A (en) Image processing device and image processing method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200526

Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square

Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant