CN105653550B - Webpage filtering method and device - Google Patents
Webpage filtering method and device Download PDFInfo
- Publication number
- CN105653550B CN105653550B CN201410648193.1A CN201410648193A CN105653550B CN 105653550 B CN105653550 B CN 105653550B CN 201410648193 A CN201410648193 A CN 201410648193A CN 105653550 B CN105653550 B CN 105653550B
- Authority
- CN
- China
- Prior art keywords
- node
- webpage
- collections
- web pages
- possibility
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of Webpage filtering method and devices, belong to Internet technical field.Include multiple webpages in the collections of web pages the described method includes: obtaining collections of web pages to be analyzed, includes multiple nodes in each webpage;For each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value is for indicating a possibility that node is specified type node size;Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;Based on fixed specified type node, treats displayed web page and be filtered.The present invention is by calculating in collections of web pages in each webpage characteristic value a possibility that each node, will likely property characteristic value be greater than specified threshold node as specified type node, fixed specified type node can be directly based upon, displayed web page is treated to be filtered, without human configuration filtering profile, it is simple and efficient to handle, save time cost and human cost.
Description
Technical field
The present invention relates to Internet technical field, in particular to a kind of Webpage filtering method and device.
Background technique
With the popularity of the internet, many manufacturers can in webpage releasing advertisements, with publicize its production product, this just leads
It causes in webpage to include various advertisements, seriously affects user's normal browsing webpage.
In order to filter out the advertisement in webpage, website operation personnel can be according to the advertisement in each webpage, human configuration
Filtering profile, and it is uploaded to Website server, Website server can be filtered webpage according to the filtering profile.The mistake
Filtering template can be blacklist or white list, and should in Website server extraction webpage when the filtering profile is blacklist
The matched web page contents of filtering profile, the web page contents extracted are filtered out, when the filtering profile is white list, website clothes
It is engaged in device extraction webpage filtering out other web page contents in webpage with the matched web page contents of the filtering profile.
In the implementation of the present invention, inventor has found that the prior art at least has the following deficiencies: the webpage for magnanimity
When configurating filtered template, need to expend excessive human cost.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of Webpage filtering method and devices.It is described
Technical solution is as follows:
In a first aspect, providing a kind of Webpage filtering method, which comprises
Collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple sections in each webpage
Point;
For each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value
For indicating a possibility that node is specified type node size;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered.
Second aspect, provides a kind of home page filter device, and described device includes:
Collections of web pages obtains module, includes multiple webpages in the collections of web pages for obtaining collections of web pages to be analyzed,
It include multiple nodes in each webpage;
Computing module, characteristic value a possibility that for calculating the node for each node in each webpage are described
Possibility characteristic value is for indicating a possibility that node is specified type node size;
Specified type node determining module, for will likely property characteristic value be greater than specified threshold node be determined as the finger
Determine type node;
Filtering module is treated displayed web page and is filtered for being based on fixed specified type node.
Technical solution provided in an embodiment of the present invention has the benefit that
Method and apparatus provided in an embodiment of the present invention, can by calculate in collections of web pages each node in each webpage
Energy property characteristic value, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon really
Fixed specified type node, treats displayed web page and is filtered, and is not necessarily to human configuration filtering profile, simple and efficient to handle, saves
Time cost and human cost.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention;
Fig. 2 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention;
Fig. 3 is webpage schematic diagram provided in an embodiment of the present invention;
Fig. 4 is specified tree structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 is possibility characteristic value calculation flow chart provided in an embodiment of the present invention;
Fig. 6 is a kind of home page filter apparatus structure schematic diagram provided in an embodiment of the present invention;
Fig. 7 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Fig. 1 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention.The execution master of the inventive embodiments
Body is server, referring to Fig. 1, this method comprises:
101, collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple in each webpage
Node.
102, for each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value
For indicating a possibility that node is specified type node size.
103, will likely property characteristic value be greater than specified threshold node be determined as the specified type node.
104, it is based on fixed specified type node, displayed web page is treated and is filtered.
Method provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage
Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger
Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time
Cost and human cost.
Optionally, should for each node in each webpage, a possibility that calculating the node characteristic value include:
According to the content of each node, calculate every in other webpages in the node and the collections of web pages in addition to the webpage
The similarity of a node;
A possibility that similarity of each node in other webpages of the node and this is counted, obtains node feature
Value.
Optionally, this method further include:
According to position of each node in corresponding webpage, multiple nodes in multiple webpage are grouped, are obtained
Multiple node sets, multiple nodes in each node set are located at the same position in different web pages.
Optionally, should for each node in each webpage, a possibility that calculating the node characteristic value include:
For each node in each node set, according to the content of each node, the node and the node collection are calculated
The similarity of other nodes in conjunction;
A possibility that counting to the similarity of the node and other nodes in the node set, obtain the node is special
Value indicative.
Optionally, acquisition collections of web pages to be analyzed includes:
Obtain the multiple webpages generated in the specified duration before current point in time;
Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, this is grouped multiple webpage, obtains multiple collections of web pages and includes:
According to the publication account of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the storage catalogue of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the subdomain title of each webpage, multiple webpage is grouped, obtains multiple collections of web pages.
Optionally, it should be based on fixed specified type node, displayed web page is treated and be filtered and include:
Fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, which is filtered, is wrapped with filtering out in the original web page
The specified type node included.
Optionally, it should be based on fixed specified type node, displayed web page is treated and be filtered and include:
Node in multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, which is filtered, is wrapped with filtering out in the original web page
The specified type node included.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer
It repeats one by one.
Fig. 2 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention.The execution master of the inventive embodiments
Body is server, referring to fig. 2, this method comprises:
201, the server is grouped multiple webpages to be analyzed, obtains multiple collections of web pages.
In embodiments of the present invention, which is used to for terminal provide webpage, the terminal can be fixed terminal or
Mobile terminal, such as computer, mobile phone.When user wishes to browse webpage, the operation of access webpage can be triggered at the terminal,
When the terminal gets the operation of access webpage, web page display request is sent to the server, web page display request carries
Web page address.When the server receives web page display request, web page display request can be obtained according to the web page address
Corresponding original web page, if the terminal is fixed terminal, which sends the original web page to the fixed terminal, this is fixed
Terminal can show the original web page, if the terminal is mobile terminal, which carries out transcoding to the original web page, to this
Mobile terminal sends the webpage after transcoding, which can show the webpage after the transcoding.
In practical applications, it may include in advertisement, operation instruction, recommendation information, junk information etc. in the original web page
Hold, these contents are unrelated with the content of webpage itself, but easily impact to the browsing of user, and many users wish in browsing net
These contents are filtered out when page.In order to meet the needs of users, which can send webpage to be presented to terminal every time
Before, the content to be filtered in the webpage to be presented is determined, to be filtered to the webpage to be presented.And for the ease of determination
The content to be filtered in the webpage to be presented, the server can be trained multiple webpages, identify and want in each webpage
The content of filtering.
Further, in order to improve trained accuracy, which can be grouped multiple webpages, obtain multiple nets
Page set, is trained each collections of web pages respectively.Specifically, which can be grouped all webpages, can also
To choose multiple sample web pages from all webpages, multiple sample web page is grouped, each webpage can also be obtained
Snapshots of web pages is grouped the multiple snapshots of web pages got, and it is not limited in the embodiment of the present invention.
Optionally, which is grouped multiple webpage according to specified rule, obtains multiple collections of web pages.Its
In, which can be publication account, storage catalogue or the subdomain title etc. of webpage, and the embodiment of the present invention does not do this
It limits.It include the webpage issued by multiple accounts in the server, when the specified rule is the publication account of webpage, the service
Device is grouped multiple webpage according to the publication account of each webpage, obtains multiple collections of web pages, in same collections of web pages
The publication account of webpage is identical, and the publication account of webpage is different in different web pages set.The server by multiple web storages in
In different storage catalogues, when the specified rule be webpage storage catalogue when, the server according to each webpage storage mesh
Record, is grouped multiple webpage, obtains multiple collections of web pages, the storage catalogue of webpage is identical in same collections of web pages, no
It is different with the storage catalogue of webpage in collections of web pages.The server is each auto-building html files corresponding web page addresses, the webpage
In location include subdomain title, when the specified rule be webpage subdomain title when, the server according to each webpage subdomain name
Claim, multiple webpage is grouped, multiple collections of web pages are obtained, the subdomain title of webpage is identical in same collections of web pages, no
It is different with the subdomain title of webpage in collections of web pages.In actual application, which can also be using other specified
Rule is grouped multiple webpage, and it is not limited in the embodiment of the present invention.
In embodiments of the present invention, different collections of web pages belongs to different groups, it is subsequent when the server get to
When displayed web page, the webpage to be presented can be divided according to the specified rule, it is determining to belong to together with the webpage to be presented
One group of other collections of web pages, thus according to the training result in the collections of web pages, determine to be filtered in the webpage to be presented it is interior
Hold.For example, obtaining the publication account of the webpage to be presented when the server gets webpage to be presented, the publication account is determined
Number corresponding collections of web pages, as belongs to same group of other collections of web pages with the webpage to be presented.
202, for each node in each webpage in each collections of web pages, a possibility which calculates the node
Characteristic value, the possibility characteristic value is for indicating a possibility that node is specified type node size.
Webpage can be divided into multiple nodes by the server, multiple node may include text node, picture node,
The node of the multiple formats such as video node, webpage link address node.Specifically, which can will be in the text in webpage
Hold and is divided into multiple text nodes according to paragraph, it, will be every in webpage using each picture in webpage as a picture node
A video is as a video node, using each webpage link address in webpage as a webpage link address node, originally
Inventive embodiments to the mode of the server partitioning site without limitation.
Wherein, the content of some nodes is the content of the webpage itself, and the content of the content of some nodes and the webpage
It is unrelated.The node that content is unrelated with the content that it is currently located webpage is as specified type node, then the specified type node
The node as to be filtered in webpage.
For each collections of web pages, in order to filter out the specified type node in webpage, the server is to the webpage
Each webpage in set is analyzed, and the node for being most likely to be specified type node is therefrom found out.Specifically, for the net
A possibility that each node in page set in each webpage, which calculates the node characteristic value, the possibility characteristic value
It for indicating a possibility that node is specified type node size, that is to say, characteristic value is bigger a possibility that node, and indicating should
A possibility that node is more likely to be specified type node, node characteristic value is smaller, indicates that the node is more unlikely to be specified class
Type node.
In practical applications, for the different web pages in same collections of web pages, included by specified type node
Content it is often same or similar.For example, Fig. 3 is webpage schematic diagram provided in an embodiment of the present invention comprising same account
Two webpages of publication, include two different articles: " article 1 " and " article 2 " in the two webpages, but two webpages is upper
Side and lower section all include the node of identical content, and the node of the identical content is likely to specified type node.
Based on These characteristics, for each node, when the node similar with the node for including in the collections of web pages
When more, it is believed that the node is more likely to be specified type node, and include in the collections of web pages it is similar to the node
Node it is fewer when, it is believed that the node is more unlikely to be specified type node.
For this purpose, the server can calculate the section according to the content of each node for each node in each webpage
The similarity of point and each node in other webpages in the collections of web pages in addition to webpage where the node, then it is available
Multiple similarities of the node and multiple nodes, the server count the multiple similarities being calculated, and obtain the section
A possibility that putting characteristic value, the possibility characteristic value can be used in indicating a possibility that node is specified type node size.
When being counted to multiple similarity, which can calculate multiple similarity and value or average value etc., work
A possibility that for the node characteristic value, it is not limited in the embodiment of the present invention.
Referring to table 1, which includes webpage A and webpage B, includes node 1 in webpage A, in webpage B includes section
Point 2 and node 3, then for node 1, calculate node 1 and the similarity of node 2 and the similarity of node 1 and node 3,
Using the average value of be calculated two similarities as the similarity of node 1.
Table 1
Webpage | Node |
Webpage A | Node 1 |
Webpage B | Node 2 and node 3 |
Further, for text node, which can be preset between node content and characteristic value
Corresponding relationship, as the corresponding characteristic value of word each in text node determines corresponding to each text node according to the corresponding relationship
Multiple characteristic values the feature vector of each text node can be obtained by obtained multiple eigenvalue clusters at feature vector.And
For picture node or webpage link address node, which can preset URL (Uniform Resource
Locator, uniform resource locator) and feature vector between corresponding relationship, then the server obtain each picture node or
The URL of person's webpage link address node determines each picture node or webpage link address node according to the corresponding relationship
Feature vector.For each node in each webpage, the server can calculate the node feature vector and other
The similarity of the feature vector of each node in webpage, obtains multiple similarities.The server can calculate the feature of the node
The cosine similarity of the feature vector of each node or Euclidean distance similarity etc. in vector and other webpages, the present invention are implemented
Example does not limit this.
In practical applications, for the different web pages in same collections of web pages, included by specified type node
Position in corresponding webpage is often same or similar, for example, Website server can add in the lower right corner of each webpage of the webpage
Add advertising node.Based on the feature, for each node, in order to reduce calculation amount, the server only calculate the node with
The similarity of same position node in other webpages.Specifically, position of the server according to each node in corresponding webpage,
Multiple nodes in multiple webpage are grouped, multiple node sets are obtained, multiple nodes in each node set point
It Wei Yu not same position in different web pages.Then for each node in each node set, according to the content of each node,
The similarity for calculating other nodes in the node and the node set, to other nodes in the node and the node set
A possibility that similarity is counted, and the node is obtained characteristic value.
Citing based on table 1, it is assumed that node 3 is identical as position of the node 1 in webpage A in the position in webpage B, then should
The similarity of server computing node 1 and node 3, characteristic value a possibility that as node 1.
Optionally, which can analyze each webpage in the collections of web pages, establish the finger of each webpage
Determine tree structure, include multiple nodes in the specified tree structure, which can be calculated every based on the specified tree structure
A possibility that a node characteristic value.Wherein, which can be tree-like for DOM (Document Object Model)
Structure or other tree structures, it is not limited in the embodiment of the present invention.
It is specified in tree structure at this, multiple node has hierarchical relationship, and each node has a upper node layer,
And there may be multiple next node layers.For example, one section of text node in webpage may include the text node of multirow.
By taking characteristic value a possibility that calculating the first node of the first webpage as an example, the second webpage is to remove to be somebody's turn to do in the collections of web pages
Any webpage other than first webpage, for each node in second webpage, the first node is similar to the node
When, it may also be similar to a upper node layer for the node, at this point, in order to improve the accuracy of possibility characteristic value, the server
Maximum node similar with the first node can be chosen, the similarity of the first node and the maximum node is applied to calculate
The process of the possibility characteristic value.
Fig. 4 is the specified tree structure schematic diagram of the second webpage provided in an embodiment of the present invention, and Fig. 5 is the embodiment of the present invention
A possibility that offer characteristic value calculation flow chart, referring to fig. 4 and Fig. 5, a possibility which calculates the first node feature
When value, following steps (1)-(9) can be executed:
(1) server chooses undermost node 111 in the specified tree structure of second webpage.
(2) server calculates first similarity of the first node Yu node 111, judges whether first similarity is big
In first threshold, if so, step (4) are executed, if not, executing step (3).
In embodiments of the present invention, when first similarity is greater than the first threshold, the first node and node are indicated
111 is similar, when first similarity is not more than the first threshold, indicates that the first node and node 111 are dissimilar.Wherein,
The first threshold can be predefined by technical staff, or be passed through by the server to the first node and each lowest level section
The similarity of point carries out statistics determination, and it is not limited in the embodiment of the present invention.
(3) server chooses another undermost node 112, continues to execute step (2), arrives each most until choosing
The node of lower layer.
(4) server chooses the node 11 positioned at one layer on node 111.
(5) server calculates second similarity of the first node Yu node 11, judges whether second similarity is big
In the first threshold, if so, step (8) are executed, if not, executing step (6).
(6) server is using first similarity as similarity to be counted.
It, can be true when first similarity is greater than the first threshold, and second similarity is not more than the first threshold
The fixed first node is similar to node 111, dissimilar with node 11, then the server, which determines, chooses first similarity, as
The similarity of the subsequent statistical first node possibility characteristic value.
(7) server chooses one and is located at different points from node 11 from the lowest level node for specifying tree structure
The node 121 of branch, continues to execute step (2).
(8) server chooses the node 1 positioned at one layer on node 11, continues to execute step (5), until choosing to most upper
The node of layer.
(9) for each webpage in the collections of web pages in addition to first webpage, which repeats above-mentioned step
Suddenly, when obtaining the corresponding similarity to be counted of each webpage, obtained multiple similarities is counted, the first segment is obtained
A possibility that putting characteristic value.
Above-mentioned steps (1)-(9) are only to give the illustrative steps of the server calculability characteristic value, in reality
In, which can also use other modes, determine maximum node similar with the first node in each webpage, obtain
The corresponding similarity to be counted of each webpage is taken, to calculate the possibility characteristic value, the embodiment of the present invention does not limit this
It is fixed.
203, the node that possibility characteristic value in the collections of web pages is greater than specified threshold is determined as this and specified by the server
Type node.
Wherein, the specified threshold can by the server by each node a possibility that the characteristic value and webpage collection
Number of nodes in conjunction is analyzed to obtain, and the corresponding specified threshold of different web pages set may be the same or different, this hair
Bright embodiment does not limit this.
In embodiments of the present invention, it is believed that possibility characteristic value is greater than the node and the collections of web pages of the specified threshold
In other webpages many nodes it is similar, i.e., there is " frequent " in the collections of web pages in the node, then refers to using the node as this
Determine type node.And possibility characteristic value is no more than seldom section of other webpages in the node of the specified threshold and the collections of web pages
Point is similar, i.e. not " frequent " occurs in the collections of web pages in the node, then the node is not the specified type node.
204, the server is based on fixed specified type node, treats displayed web page and is filtered.
When the server has determined the specified type node in the collections of web pages, can belong to together to the collections of web pages
One group of other webpage to be presented is filtered, and filters out the specified type node in the webpage to be presented.Specifically, the server
According to determining specified type node, template profile is generated, it is subsequent to be based on the template profile again, treat displayed web page
It is filtered.
It in embodiments of the present invention, can when user wishes to browse webpage under the premise of filtering out specified type node
To trigger the operation of access filtering webpage on the terminal, when the terminal gets the operation of access filtering webpage, to the clothes
Business device sends network filtering and shows request, which, which shows, requests to carry web page address, which receives the webpage
When showing request, it can obtain the web page display according to the web page address and request corresponding original web page, according to the specified rule
Then, determining to belong to same group of other collections of web pages with the original web page, the corresponding template profile of the collections of web pages is obtained, then
Based on the template profile, which is filtered, filters out the specified type node for including in the original web page,
Filtered webpage is sent to the terminal, when which receives the filtered webpage, shows the filtered webpage.It should
It include the content of webpage itself in filtered webpage, without including the specified type node unrelated with the web page contents, so that
When user browses the webpage, salubriouser viewing experience can be provided for user to avoid the interference of specified type node.
Wherein, the template profile can be white list or blacklist, correspondingly, the step 204 may include with
Any one of lower step 204a and 204b:
204a, the server export fixed specified type node into blacklist template profile, work as reception
When showing request to home page filter, obtains the home page filter and show the corresponding original web page of request, matched based on the blacklist template
File is set, which is filtered, to filter out the specified type node for including in the original web page.
The server can generate blacklist template profile for the collections of web pages, by fixed specified type node
Output saves the blacklist template profile, then the blacklist template profile into the blacklist template profile
In node be the specified type node that should be filtered out, when the server receive terminal transmission home page filter show request
When, corresponding original web page is obtained, the blacklist template profile is based on, filters out the blacklist template in the original web page
The node for including in configuration file, to filter out the specified type node for including in the original web page.
204b, the server export the node in multiple webpage in addition to the specified type node to white list template
In configuration file, when receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request, base
In the white list template profile, which is filtered, to filter out the specified class for including in the original web page
Type node.
The server can for the collections of web pages generate white list template profile, by multiple webpage except this is specified
Node other than type node is exported into white list template profile, saves the white list template profile, then this is white
Node in list template profile is the web page joint that should retain, when the server receives the home page filter of terminal transmission
When showing request, corresponding original web page is obtained, the white list template profile is based on, it is white to filter out this in the original web page
Node not to be covered in list template profile, to filter out the specified type node for including in the original web page.
When user uses mobile terminal, which be can be applied in the transcoding process of the server, when the service
When device gets the original web page, it is based on the template profile, transcoding is carried out to the original web page, so that the webpage after transcoding
In do not include specified type node.
It should be noted that the embodiment of the present invention only using the server by the webpage being currently generated as webpage to be analyzed
For be illustrated, and in practical applications, which is likely to update net due to upgrading service, anti-crawl etc.
Page, once webpage is updated, the position of content or web page contents in webpage may change, then specified in webpage
Type node can also change.In order to guarantee the timeliness of template profile, which also will be to template configuration text
Part is updated.
Optionally, which obtains the multiple webpages generated in the specified duration before current point in time, that is to say,
Duration is specified every this, which obtains the multiple webpages generated in the specified duration before current point in time, to this
Multiple webpages execution above-mentioned steps 201-204, the template profile updated, the template profile based on the update,
Displayed web page is treated to be filtered.Wherein, which can be by between time point of the server according to more new web page
Interval determines, can be one day or several days etc., it is not limited in the embodiment of the present invention.
In order to avoid current business of the renewal process to the server impacts, which gets multiple net
When page, above-mentioned steps 201-204 can be executed offline, and in the process, which can be based on old template configuration text
Part is treated displayed web page and is filtered, and when the server gets the template profile of update, reloads the mould of the update
Plate configuration file, the template profile based on the update are treated displayed web page and are filtered.
It is current in the related technology, by human configuration filtering profile, when Website server has updated webpage, originally configure
Filtering profile will fail, operation personnel needs to monitor the update status of each webpage, could find the template of failure, then weigh
New template is newly configured, excessive human cost is consumed.And in practical applications, operation personnel is difficult to find failure in time
Template, poor in timeliness.And in embodiments of the present invention, which specifies duration every this, automatically obtains newly-generated more
A webpage executes to repeating scrolling training step, updates template profile in time, and entire training process is unsupervised and automatic
Change and repeat, greatly reduces human cost, ensure that the timeliness of template profile, and by the way of off-line training,
Avoid the influence to current business.
Method provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage
Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger
Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time
Cost and human cost.Further, newly-generated multiple webpages are automatically obtained, repeat training step, in time more
New template configuration file, greatly reduces human cost, ensure that the timeliness of template profile, and using off-line training
Mode avoids the influence to current business.
Fig. 6 is a kind of home page filter apparatus structure schematic diagram provided in an embodiment of the present invention, and referring to Fig. 6, which includes:
Collections of web pages obtains module 601, includes multiple nets in the collections of web pages for obtaining collections of web pages to be analyzed
Page, it include multiple nodes in each webpage;
Computing module 602, characteristic value a possibility that for calculating the node for each node in each webpage, should
Possibility characteristic value is for indicating a possibility that node is specified type node size;
Specified type node determining module 603, for will likely property characteristic value be greater than specified threshold node be determined as this
Specified type node;
Filtering module 604 is treated displayed web page and is filtered for being based on fixed specified type node.
Device provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage
Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger
Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time
Cost and human cost.
Optionally, which is used for the content according to each node, calculates and removes in the node and the collections of web pages
The similarity of each node in other webpages other than the webpage;To the similarity of each node in the node and other webpages
A possibility that being counted, obtaining node characteristic value.
Optionally, the device further include:
Node grouping module, for the position according to each node in corresponding webpage, to multiple in multiple webpage
Node is grouped, and obtains multiple node sets, and multiple nodes in each node set are located at the identical bits in different web pages
It sets.
Optionally, which is used for for each node in each node set, according in each node
Hold, calculates the similarity of other nodes in the node and the node set;To other sections in the node and the node set
A possibility that similarity of point is counted, obtains node characteristic value.
Optionally, which obtains module 601 and generates in the specified duration before current point in time for obtaining
Multiple webpages;Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, which obtains module 601 specifically for the publication account according to each webpage, to multiple net
Page is grouped, and obtains multiple collections of web pages;Alternatively, multiple webpage is grouped according to the storage catalogue of each webpage,
Obtain multiple collections of web pages;Alternatively, being grouped according to the subdomain title of each webpage to multiple webpage, multiple nets are obtained
Page set.
Optionally, which is used to export fixed specified type node to blacklist template configuration text
In part;When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;It is black based on this
List template profile is filtered the original web page, to filter out the specified type node for including in the original web page.
Optionally, which is used to export the node in multiple webpage in addition to the specified type node
Into white list template profile;When receiving home page filter displaying request, obtains the home page filter and show that request corresponds to
Original web page;Based on the white list template profile, which is filtered, to filter out in the original web page
Including specified type node.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer
It repeats one by one.
It should be understood that home page filter device provided by the above embodiment is when being filtered webpage, only with above-mentioned
The division progress of each functional module can according to need and for example, in practical application by above-mentioned function distribution by different
Functional module is completed, i.e., the internal structure of server is divided into different functional modules, with complete it is described above whole or
Person's partial function.In addition, home page filter device provided by the above embodiment and Webpage filtering method embodiment belong to same design,
Its specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 7 is a kind of server architecture schematic diagram provided in an embodiment of the present invention, which can be used for above-mentioned implementation
Function performed by server in the Webpage filtering method exemplified.Specifically: referring to Fig. 7, which can be because of configuration
Or performance is different and generate bigger difference, may include one or more central processing units (Central
Processing Unit, CPU) 722 (for example, one or more processors) and memory 732, one or more
Store the storage medium 730 (such as one or more mass memory units) of application program 742 or data 744.Wherein, it deposits
Reservoir 732 and storage medium 730 can be of short duration storage or persistent storage.The program for being stored in storage medium 730 may include
One or more modules (diagram does not mark).
Server 700 can also include one or more power supplys 726, one or more wired or wireless networks
Interface 750, one or more input/output interfaces 758, and/or, one or more operating systems 741, such as
Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
One perhaps more than one program be stored in memory and be configured to by one or more than one processor
It executes, this or more than one program include the instruction for performing the following operation:
Collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple nodes in each webpage;
For each node in each webpage, a possibility that calculating the node characteristic value, which is used for
Indicate a possibility that node is specified type node size;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered.
Optionally, also comprising the instruction for performing the following operation:
According to the content of each node, calculate every in other webpages in the node and the collections of web pages in addition to the webpage
The similarity of a node;
A possibility that similarity of each node in other webpages of the node and this is counted, obtains node feature
Value.
Optionally, also comprising the instruction for performing the following operation:
According to position of each node in corresponding webpage, multiple nodes in multiple webpage are grouped, are obtained
Multiple node sets, multiple nodes in each node set are located at the same position in different web pages.
Optionally, also comprising the instruction for performing the following operation:
For each node in each node set, according to the content of each node, the node and the node collection are calculated
The similarity of other nodes in conjunction;
A possibility that counting to the similarity of the node and other nodes in the node set, obtain the node is special
Value indicative.
Optionally, also comprising the instruction for performing the following operation:
Obtain the multiple webpages generated in the specified duration before current point in time;
Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, also comprising the instruction for performing the following operation:
According to the publication account of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the storage catalogue of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the subdomain title of each webpage, multiple webpage is grouped, obtains multiple collections of web pages.
Optionally, also comprising the instruction for performing the following operation:
Fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, which is filtered, is wrapped with filtering out in the original web page
The specified type node included.
Optionally, also comprising the instruction for performing the following operation:
Node in multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, which is filtered, is wrapped with filtering out in the original web page
The specified type node included.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer
It repeats one by one.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of Webpage filtering method, which is characterized in that the described method includes:
Obtain the multiple webpages generated in the specified duration before current point in time;
According to the publication account of each webpage, the multiple webpage is grouped, obtains multiple collections of web pages;Alternatively, according to
The storage catalogue of each webpage is grouped the multiple webpage, obtains multiple collections of web pages;Alternatively, according to each webpage
Subdomain title, the multiple webpage is grouped, obtains multiple collections of web pages, includes multiple nets in each collections of web pages
Page;
For each webpage, according to text node, picture node, video node, webpage link address node format, will
The webpage is divided into multiple nodes;
For each node, a possibility that calculating the node characteristic value, the possibility characteristic value is for indicating the node
It is size a possibility that specifying type node;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered;
The method also includes:
According to position of each node in corresponding webpage, multiple nodes in the multiple webpage are grouped, are obtained
To multiple node sets, multiple nodes in each node set are located at the same position in different web pages;
It is described for each node, a possibility that calculating the node characteristic value include:
The node and the section are calculated according to the content of each node for each node in each node set
The similarity of other nodes in point set;
A possibility that similarity of the node and other nodes in the node set is counted, obtains the node
Characteristic value.
2. calculating the possibility of the node the method according to claim 1, wherein described for each node
Property characteristic value includes:
According to the content of each node, calculate in other webpages in the node and the collections of web pages in addition to the webpage
The similarity of each node;
A possibility that counting to the similarity of the node and each node in other described webpages, obtain the node is special
Value indicative.
3. treating exhibition the method according to claim 1, wherein described be based on fixed specified type node
Showing that webpage is filtered includes:
The fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, the original web page is filtered, to filter out in the original web page
Including specified type node.
4. treating exhibition the method according to claim 1, wherein described be based on fixed specified type node
Showing that webpage is filtered includes:
Node in the multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, the original web page is filtered, to filter out in the original web page
Including specified type node.
5. a kind of home page filter device, which is characterized in that described device includes:
Collections of web pages obtains module, for obtaining the multiple webpages generated in the specified duration before current point in time;
The collections of web pages obtains module, is also used to the publication account according to each webpage, is grouped to the multiple webpage,
Obtain multiple collections of web pages;Alternatively, being grouped according to the storage catalogue of each webpage to the multiple webpage, obtain multiple
Collections of web pages;Alternatively, being grouped according to the subdomain title of each webpage to the multiple webpage, multiple collections of web pages are obtained,
It include multiple webpages in each collections of web pages;
The collections of web pages obtains module, is also used to for each webpage, according to text node, picture node, video section
The format of point, webpage link address node divides the webpage into multiple nodes;
Computing module, characteristic value a possibility that for calculating the node for each node, the possibility characteristic value are used for
Indicate a possibility that node is specified type node size;
Specified type node determining module, for will likely property characteristic value be greater than specified threshold node be determined as the specified class
Type node;
Filtering module is treated displayed web page and is filtered for being based on fixed specified type node;
Described device further include:
Node grouping module, for the position according to each node in corresponding webpage, to more in the multiple webpage
A node is grouped, and obtains multiple node sets, and multiple nodes in each node set are located at identical in different web pages
Position;
The computing module is also used to for each node in each node set, according to the content of each node, meter
Calculate the similarity of other nodes in the node and the node set;To other in the node and the node set
A possibility that similarity of node is counted, obtains node characteristic value.
6. device according to claim 5, which is characterized in that the computing module is used for the content according to each node,
Calculate the similarity of each node in other webpages in the node and the collections of web pages in addition to the webpage;To described
A possibility that similarity of node and each node in other described webpages is counted, obtains node characteristic value.
7. device according to claim 5, which is characterized in that the filtering module is used for fixed specified type section
Point output is into blacklist template profile;When receiving home page filter displaying request, obtains the home page filter and show
Request corresponding original web page;Based on the blacklist template profile, the original web page is filtered, to filter out
The specified type node for including in the original web page.
8. device according to claim 5, which is characterized in that the filtering module is used to that institute will to be removed in the multiple webpage
The node other than specified type node is stated to export into white list template profile;Request is shown when receiving home page filter
When, it obtains the home page filter and shows the corresponding original web page of request;Based on the white list template profile, to the original
Beginning webpage is filtered, to filter out the specified type node for including in the original web page.
9. a kind of server for filtering web page, which is characterized in that the server include memory and one or one with
Upper processor, one perhaps more than one program be stored in the memory and be configured to by one or one
The above processor is executed to realize the Webpage filtering method as described in claim 1 to claim 4 any claim.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage have one or
More than one program, and be configured and executed by one or more processors to realize that claim 1 to claim 4 such as is appointed
Webpage filtering method described in one claim.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410648193.1A CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410648193.1A CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105653550A CN105653550A (en) | 2016-06-08 |
CN105653550B true CN105653550B (en) | 2019-11-05 |
Family
ID=56479084
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410648193.1A Active CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105653550B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106326455A (en) * | 2016-08-26 | 2017-01-11 | 乐视控股(北京)有限公司 | Web page browsing filtering processing method and system, terminal and cloud acceleration server |
CN106599246B (en) * | 2016-12-20 | 2020-02-11 | 维沃移动通信有限公司 | Display content interception method, mobile terminal and control server |
CN108628888A (en) * | 2017-03-21 | 2018-10-09 | 中兴通讯股份有限公司 | A kind of browser Ad blocking method, apparatus and terminal |
CN107423059A (en) * | 2017-07-07 | 2017-12-01 | 北京小米移动软件有限公司 | Display methods, device and the terminal of the page |
CN109756393B (en) * | 2018-12-27 | 2021-04-30 | 阿里巴巴(中国)有限公司 | Information processing method, system, medium, and computing device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101399818A (en) * | 2007-09-25 | 2009-04-01 | 日电(中国)有限公司 | Theme related webpage filtering method and system based on navigation route information |
CN103678313A (en) * | 2012-08-31 | 2014-03-26 | 北京百度网讯科技有限公司 | Method and device for assessing authority of web pages |
CN103870590A (en) * | 2014-03-28 | 2014-06-18 | 北京奇虎科技有限公司 | Webpage identification method and device with error-reported characteristic |
-
2014
- 2014-11-14 CN CN201410648193.1A patent/CN105653550B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101399818A (en) * | 2007-09-25 | 2009-04-01 | 日电(中国)有限公司 | Theme related webpage filtering method and system based on navigation route information |
CN103678313A (en) * | 2012-08-31 | 2014-03-26 | 北京百度网讯科技有限公司 | Method and device for assessing authority of web pages |
CN103870590A (en) * | 2014-03-28 | 2014-06-18 | 北京奇虎科技有限公司 | Webpage identification method and device with error-reported characteristic |
Non-Patent Citations (1)
Title |
---|
网页噪声识别与消除方法研究;秦超;《中国优秀硕士学位论文全文数据库信息科技辑》;20120515;第I139-278页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105653550A (en) | 2016-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105653550B (en) | Webpage filtering method and device | |
CN104834731B (en) | A kind of recommended method and device from media information | |
CN105117474B (en) | The method and apparatus of recommendation information load are carried out in the reading model of webpage | |
CN102930059B (en) | Method for designing focused crawler | |
CN104111941B (en) | The method and apparatus that information is shown | |
CN102426610B (en) | Microblog rank searching method and microblog searching engine | |
CN104899220B (en) | Application program recommendation method and system | |
CN107784066A (en) | Information recommendation method, device, server and storage medium | |
US10346496B2 (en) | Information category obtaining method and apparatus | |
CN109325179A (en) | A kind of method and device that content is promoted | |
CN103500213B (en) | Page hot-spot resource updating method and device based on pre-reading | |
CN104899236B (en) | A kind of comment information display methods, apparatus and system | |
CN104504096B (en) | A kind of information transferring method and web page browsing device of inter-network page | |
CN103858121B (en) | Web applications are made to obtain the method and system of database change | |
CN104077415A (en) | Searching method and device | |
CN106294815B (en) | A kind of clustering method and device of URL | |
CN107885873A (en) | Method and apparatus for output information | |
CN103365842B (en) | A kind of page browsing recommends method and device | |
CN103678325A (en) | Method and device for providing browsing page corresponding to initial page | |
CN109978580A (en) | Object recommendation method, apparatus and computer readable storage medium | |
CN103970753A (en) | Pushing method and pushing device for related knowledge | |
CN106777086A (en) | A kind of webpage buries dynamic management approach and device a little | |
JP2011227721A (en) | Interest extraction device, interest extraction method, and interest extraction program | |
CN107315753B (en) | Paging method and device across multiple databases | |
CN103870452A (en) | Method and method for recommending data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |