CN105653550B - Webpage filtering method and device - Google Patents

Webpage filtering method and device Download PDF

Info

Publication number
CN105653550B
CN105653550B CN201410648193.1A CN201410648193A CN105653550B CN 105653550 B CN105653550 B CN 105653550B CN 201410648193 A CN201410648193 A CN 201410648193A CN 105653550 B CN105653550 B CN 105653550B
Authority
CN
China
Prior art keywords
node
webpage
collections
web pages
possibility
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410648193.1A
Other languages
Chinese (zh)
Other versions
CN105653550A (en
Inventor
朱龙军
孙钟前
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410648193.1A priority Critical patent/CN105653550B/en
Publication of CN105653550A publication Critical patent/CN105653550A/en
Application granted granted Critical
Publication of CN105653550B publication Critical patent/CN105653550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of Webpage filtering method and devices, belong to Internet technical field.Include multiple webpages in the collections of web pages the described method includes: obtaining collections of web pages to be analyzed, includes multiple nodes in each webpage;For each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value is for indicating a possibility that node is specified type node size;Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;Based on fixed specified type node, treats displayed web page and be filtered.The present invention is by calculating in collections of web pages in each webpage characteristic value a possibility that each node, will likely property characteristic value be greater than specified threshold node as specified type node, fixed specified type node can be directly based upon, displayed web page is treated to be filtered, without human configuration filtering profile, it is simple and efficient to handle, save time cost and human cost.

Description

Webpage filtering method and device
Technical field
The present invention relates to Internet technical field, in particular to a kind of Webpage filtering method and device.
Background technique
With the popularity of the internet, many manufacturers can in webpage releasing advertisements, with publicize its production product, this just leads It causes in webpage to include various advertisements, seriously affects user's normal browsing webpage.
In order to filter out the advertisement in webpage, website operation personnel can be according to the advertisement in each webpage, human configuration Filtering profile, and it is uploaded to Website server, Website server can be filtered webpage according to the filtering profile.The mistake Filtering template can be blacklist or white list, and should in Website server extraction webpage when the filtering profile is blacklist The matched web page contents of filtering profile, the web page contents extracted are filtered out, when the filtering profile is white list, website clothes It is engaged in device extraction webpage filtering out other web page contents in webpage with the matched web page contents of the filtering profile.
In the implementation of the present invention, inventor has found that the prior art at least has the following deficiencies: the webpage for magnanimity When configurating filtered template, need to expend excessive human cost.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of Webpage filtering method and devices.It is described Technical solution is as follows:
In a first aspect, providing a kind of Webpage filtering method, which comprises
Collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple sections in each webpage Point;
For each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value For indicating a possibility that node is specified type node size;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered.
Second aspect, provides a kind of home page filter device, and described device includes:
Collections of web pages obtains module, includes multiple webpages in the collections of web pages for obtaining collections of web pages to be analyzed, It include multiple nodes in each webpage;
Computing module, characteristic value a possibility that for calculating the node for each node in each webpage are described Possibility characteristic value is for indicating a possibility that node is specified type node size;
Specified type node determining module, for will likely property characteristic value be greater than specified threshold node be determined as the finger Determine type node;
Filtering module is treated displayed web page and is filtered for being based on fixed specified type node.
Technical solution provided in an embodiment of the present invention has the benefit that
Method and apparatus provided in an embodiment of the present invention, can by calculate in collections of web pages each node in each webpage Energy property characteristic value, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon really Fixed specified type node, treats displayed web page and is filtered, and is not necessarily to human configuration filtering profile, simple and efficient to handle, saves Time cost and human cost.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention;
Fig. 2 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention;
Fig. 3 is webpage schematic diagram provided in an embodiment of the present invention;
Fig. 4 is specified tree structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 is possibility characteristic value calculation flow chart provided in an embodiment of the present invention;
Fig. 6 is a kind of home page filter apparatus structure schematic diagram provided in an embodiment of the present invention;
Fig. 7 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Fig. 1 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention.The execution master of the inventive embodiments Body is server, referring to Fig. 1, this method comprises:
101, collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple in each webpage Node.
102, for each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value For indicating a possibility that node is specified type node size.
103, will likely property characteristic value be greater than specified threshold node be determined as the specified type node.
104, it is based on fixed specified type node, displayed web page is treated and is filtered.
Method provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time Cost and human cost.
Optionally, should for each node in each webpage, a possibility that calculating the node characteristic value include:
According to the content of each node, calculate every in other webpages in the node and the collections of web pages in addition to the webpage The similarity of a node;
A possibility that similarity of each node in other webpages of the node and this is counted, obtains node feature Value.
Optionally, this method further include:
According to position of each node in corresponding webpage, multiple nodes in multiple webpage are grouped, are obtained Multiple node sets, multiple nodes in each node set are located at the same position in different web pages.
Optionally, should for each node in each webpage, a possibility that calculating the node characteristic value include:
For each node in each node set, according to the content of each node, the node and the node collection are calculated The similarity of other nodes in conjunction;
A possibility that counting to the similarity of the node and other nodes in the node set, obtain the node is special Value indicative.
Optionally, acquisition collections of web pages to be analyzed includes:
Obtain the multiple webpages generated in the specified duration before current point in time;
Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, this is grouped multiple webpage, obtains multiple collections of web pages and includes:
According to the publication account of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the storage catalogue of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the subdomain title of each webpage, multiple webpage is grouped, obtains multiple collections of web pages.
Optionally, it should be based on fixed specified type node, displayed web page is treated and be filtered and include:
Fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, which is filtered, is wrapped with filtering out in the original web page The specified type node included.
Optionally, it should be based on fixed specified type node, displayed web page is treated and be filtered and include:
Node in multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, which is filtered, is wrapped with filtering out in the original web page The specified type node included.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer It repeats one by one.
Fig. 2 is a kind of flow chart of Webpage filtering method provided in an embodiment of the present invention.The execution master of the inventive embodiments Body is server, referring to fig. 2, this method comprises:
201, the server is grouped multiple webpages to be analyzed, obtains multiple collections of web pages.
In embodiments of the present invention, which is used to for terminal provide webpage, the terminal can be fixed terminal or Mobile terminal, such as computer, mobile phone.When user wishes to browse webpage, the operation of access webpage can be triggered at the terminal, When the terminal gets the operation of access webpage, web page display request is sent to the server, web page display request carries Web page address.When the server receives web page display request, web page display request can be obtained according to the web page address Corresponding original web page, if the terminal is fixed terminal, which sends the original web page to the fixed terminal, this is fixed Terminal can show the original web page, if the terminal is mobile terminal, which carries out transcoding to the original web page, to this Mobile terminal sends the webpage after transcoding, which can show the webpage after the transcoding.
In practical applications, it may include in advertisement, operation instruction, recommendation information, junk information etc. in the original web page Hold, these contents are unrelated with the content of webpage itself, but easily impact to the browsing of user, and many users wish in browsing net These contents are filtered out when page.In order to meet the needs of users, which can send webpage to be presented to terminal every time Before, the content to be filtered in the webpage to be presented is determined, to be filtered to the webpage to be presented.And for the ease of determination The content to be filtered in the webpage to be presented, the server can be trained multiple webpages, identify and want in each webpage The content of filtering.
Further, in order to improve trained accuracy, which can be grouped multiple webpages, obtain multiple nets Page set, is trained each collections of web pages respectively.Specifically, which can be grouped all webpages, can also To choose multiple sample web pages from all webpages, multiple sample web page is grouped, each webpage can also be obtained Snapshots of web pages is grouped the multiple snapshots of web pages got, and it is not limited in the embodiment of the present invention.
Optionally, which is grouped multiple webpage according to specified rule, obtains multiple collections of web pages.Its In, which can be publication account, storage catalogue or the subdomain title etc. of webpage, and the embodiment of the present invention does not do this It limits.It include the webpage issued by multiple accounts in the server, when the specified rule is the publication account of webpage, the service Device is grouped multiple webpage according to the publication account of each webpage, obtains multiple collections of web pages, in same collections of web pages The publication account of webpage is identical, and the publication account of webpage is different in different web pages set.The server by multiple web storages in In different storage catalogues, when the specified rule be webpage storage catalogue when, the server according to each webpage storage mesh Record, is grouped multiple webpage, obtains multiple collections of web pages, the storage catalogue of webpage is identical in same collections of web pages, no It is different with the storage catalogue of webpage in collections of web pages.The server is each auto-building html files corresponding web page addresses, the webpage In location include subdomain title, when the specified rule be webpage subdomain title when, the server according to each webpage subdomain name Claim, multiple webpage is grouped, multiple collections of web pages are obtained, the subdomain title of webpage is identical in same collections of web pages, no It is different with the subdomain title of webpage in collections of web pages.In actual application, which can also be using other specified Rule is grouped multiple webpage, and it is not limited in the embodiment of the present invention.
In embodiments of the present invention, different collections of web pages belongs to different groups, it is subsequent when the server get to When displayed web page, the webpage to be presented can be divided according to the specified rule, it is determining to belong to together with the webpage to be presented One group of other collections of web pages, thus according to the training result in the collections of web pages, determine to be filtered in the webpage to be presented it is interior Hold.For example, obtaining the publication account of the webpage to be presented when the server gets webpage to be presented, the publication account is determined Number corresponding collections of web pages, as belongs to same group of other collections of web pages with the webpage to be presented.
202, for each node in each webpage in each collections of web pages, a possibility which calculates the node Characteristic value, the possibility characteristic value is for indicating a possibility that node is specified type node size.
Webpage can be divided into multiple nodes by the server, multiple node may include text node, picture node, The node of the multiple formats such as video node, webpage link address node.Specifically, which can will be in the text in webpage Hold and is divided into multiple text nodes according to paragraph, it, will be every in webpage using each picture in webpage as a picture node A video is as a video node, using each webpage link address in webpage as a webpage link address node, originally Inventive embodiments to the mode of the server partitioning site without limitation.
Wherein, the content of some nodes is the content of the webpage itself, and the content of the content of some nodes and the webpage It is unrelated.The node that content is unrelated with the content that it is currently located webpage is as specified type node, then the specified type node The node as to be filtered in webpage.
For each collections of web pages, in order to filter out the specified type node in webpage, the server is to the webpage Each webpage in set is analyzed, and the node for being most likely to be specified type node is therefrom found out.Specifically, for the net A possibility that each node in page set in each webpage, which calculates the node characteristic value, the possibility characteristic value It for indicating a possibility that node is specified type node size, that is to say, characteristic value is bigger a possibility that node, and indicating should A possibility that node is more likely to be specified type node, node characteristic value is smaller, indicates that the node is more unlikely to be specified class Type node.
In practical applications, for the different web pages in same collections of web pages, included by specified type node Content it is often same or similar.For example, Fig. 3 is webpage schematic diagram provided in an embodiment of the present invention comprising same account Two webpages of publication, include two different articles: " article 1 " and " article 2 " in the two webpages, but two webpages is upper Side and lower section all include the node of identical content, and the node of the identical content is likely to specified type node.
Based on These characteristics, for each node, when the node similar with the node for including in the collections of web pages When more, it is believed that the node is more likely to be specified type node, and include in the collections of web pages it is similar to the node Node it is fewer when, it is believed that the node is more unlikely to be specified type node.
For this purpose, the server can calculate the section according to the content of each node for each node in each webpage The similarity of point and each node in other webpages in the collections of web pages in addition to webpage where the node, then it is available Multiple similarities of the node and multiple nodes, the server count the multiple similarities being calculated, and obtain the section A possibility that putting characteristic value, the possibility characteristic value can be used in indicating a possibility that node is specified type node size. When being counted to multiple similarity, which can calculate multiple similarity and value or average value etc., work A possibility that for the node characteristic value, it is not limited in the embodiment of the present invention.
Referring to table 1, which includes webpage A and webpage B, includes node 1 in webpage A, in webpage B includes section Point 2 and node 3, then for node 1, calculate node 1 and the similarity of node 2 and the similarity of node 1 and node 3, Using the average value of be calculated two similarities as the similarity of node 1.
Table 1
Webpage Node
Webpage A Node 1
Webpage B Node 2 and node 3
Further, for text node, which can be preset between node content and characteristic value Corresponding relationship, as the corresponding characteristic value of word each in text node determines corresponding to each text node according to the corresponding relationship Multiple characteristic values the feature vector of each text node can be obtained by obtained multiple eigenvalue clusters at feature vector.And For picture node or webpage link address node, which can preset URL (Uniform Resource Locator, uniform resource locator) and feature vector between corresponding relationship, then the server obtain each picture node or The URL of person's webpage link address node determines each picture node or webpage link address node according to the corresponding relationship Feature vector.For each node in each webpage, the server can calculate the node feature vector and other The similarity of the feature vector of each node in webpage, obtains multiple similarities.The server can calculate the feature of the node The cosine similarity of the feature vector of each node or Euclidean distance similarity etc. in vector and other webpages, the present invention are implemented Example does not limit this.
In practical applications, for the different web pages in same collections of web pages, included by specified type node Position in corresponding webpage is often same or similar, for example, Website server can add in the lower right corner of each webpage of the webpage Add advertising node.Based on the feature, for each node, in order to reduce calculation amount, the server only calculate the node with The similarity of same position node in other webpages.Specifically, position of the server according to each node in corresponding webpage, Multiple nodes in multiple webpage are grouped, multiple node sets are obtained, multiple nodes in each node set point It Wei Yu not same position in different web pages.Then for each node in each node set, according to the content of each node, The similarity for calculating other nodes in the node and the node set, to other nodes in the node and the node set A possibility that similarity is counted, and the node is obtained characteristic value.
Citing based on table 1, it is assumed that node 3 is identical as position of the node 1 in webpage A in the position in webpage B, then should The similarity of server computing node 1 and node 3, characteristic value a possibility that as node 1.
Optionally, which can analyze each webpage in the collections of web pages, establish the finger of each webpage Determine tree structure, include multiple nodes in the specified tree structure, which can be calculated every based on the specified tree structure A possibility that a node characteristic value.Wherein, which can be tree-like for DOM (Document Object Model) Structure or other tree structures, it is not limited in the embodiment of the present invention.
It is specified in tree structure at this, multiple node has hierarchical relationship, and each node has a upper node layer, And there may be multiple next node layers.For example, one section of text node in webpage may include the text node of multirow.
By taking characteristic value a possibility that calculating the first node of the first webpage as an example, the second webpage is to remove to be somebody's turn to do in the collections of web pages Any webpage other than first webpage, for each node in second webpage, the first node is similar to the node When, it may also be similar to a upper node layer for the node, at this point, in order to improve the accuracy of possibility characteristic value, the server Maximum node similar with the first node can be chosen, the similarity of the first node and the maximum node is applied to calculate The process of the possibility characteristic value.
Fig. 4 is the specified tree structure schematic diagram of the second webpage provided in an embodiment of the present invention, and Fig. 5 is the embodiment of the present invention A possibility that offer characteristic value calculation flow chart, referring to fig. 4 and Fig. 5, a possibility which calculates the first node feature When value, following steps (1)-(9) can be executed:
(1) server chooses undermost node 111 in the specified tree structure of second webpage.
(2) server calculates first similarity of the first node Yu node 111, judges whether first similarity is big In first threshold, if so, step (4) are executed, if not, executing step (3).
In embodiments of the present invention, when first similarity is greater than the first threshold, the first node and node are indicated 111 is similar, when first similarity is not more than the first threshold, indicates that the first node and node 111 are dissimilar.Wherein, The first threshold can be predefined by technical staff, or be passed through by the server to the first node and each lowest level section The similarity of point carries out statistics determination, and it is not limited in the embodiment of the present invention.
(3) server chooses another undermost node 112, continues to execute step (2), arrives each most until choosing The node of lower layer.
(4) server chooses the node 11 positioned at one layer on node 111.
(5) server calculates second similarity of the first node Yu node 11, judges whether second similarity is big In the first threshold, if so, step (8) are executed, if not, executing step (6).
(6) server is using first similarity as similarity to be counted.
It, can be true when first similarity is greater than the first threshold, and second similarity is not more than the first threshold The fixed first node is similar to node 111, dissimilar with node 11, then the server, which determines, chooses first similarity, as The similarity of the subsequent statistical first node possibility characteristic value.
(7) server chooses one and is located at different points from node 11 from the lowest level node for specifying tree structure The node 121 of branch, continues to execute step (2).
(8) server chooses the node 1 positioned at one layer on node 11, continues to execute step (5), until choosing to most upper The node of layer.
(9) for each webpage in the collections of web pages in addition to first webpage, which repeats above-mentioned step Suddenly, when obtaining the corresponding similarity to be counted of each webpage, obtained multiple similarities is counted, the first segment is obtained A possibility that putting characteristic value.
Above-mentioned steps (1)-(9) are only to give the illustrative steps of the server calculability characteristic value, in reality In, which can also use other modes, determine maximum node similar with the first node in each webpage, obtain The corresponding similarity to be counted of each webpage is taken, to calculate the possibility characteristic value, the embodiment of the present invention does not limit this It is fixed.
203, the node that possibility characteristic value in the collections of web pages is greater than specified threshold is determined as this and specified by the server Type node.
Wherein, the specified threshold can by the server by each node a possibility that the characteristic value and webpage collection Number of nodes in conjunction is analyzed to obtain, and the corresponding specified threshold of different web pages set may be the same or different, this hair Bright embodiment does not limit this.
In embodiments of the present invention, it is believed that possibility characteristic value is greater than the node and the collections of web pages of the specified threshold In other webpages many nodes it is similar, i.e., there is " frequent " in the collections of web pages in the node, then refers to using the node as this Determine type node.And possibility characteristic value is no more than seldom section of other webpages in the node of the specified threshold and the collections of web pages Point is similar, i.e. not " frequent " occurs in the collections of web pages in the node, then the node is not the specified type node.
204, the server is based on fixed specified type node, treats displayed web page and is filtered.
When the server has determined the specified type node in the collections of web pages, can belong to together to the collections of web pages One group of other webpage to be presented is filtered, and filters out the specified type node in the webpage to be presented.Specifically, the server According to determining specified type node, template profile is generated, it is subsequent to be based on the template profile again, treat displayed web page It is filtered.
It in embodiments of the present invention, can when user wishes to browse webpage under the premise of filtering out specified type node To trigger the operation of access filtering webpage on the terminal, when the terminal gets the operation of access filtering webpage, to the clothes Business device sends network filtering and shows request, which, which shows, requests to carry web page address, which receives the webpage When showing request, it can obtain the web page display according to the web page address and request corresponding original web page, according to the specified rule Then, determining to belong to same group of other collections of web pages with the original web page, the corresponding template profile of the collections of web pages is obtained, then Based on the template profile, which is filtered, filters out the specified type node for including in the original web page, Filtered webpage is sent to the terminal, when which receives the filtered webpage, shows the filtered webpage.It should It include the content of webpage itself in filtered webpage, without including the specified type node unrelated with the web page contents, so that When user browses the webpage, salubriouser viewing experience can be provided for user to avoid the interference of specified type node.
Wherein, the template profile can be white list or blacklist, correspondingly, the step 204 may include with Any one of lower step 204a and 204b:
204a, the server export fixed specified type node into blacklist template profile, work as reception When showing request to home page filter, obtains the home page filter and show the corresponding original web page of request, matched based on the blacklist template File is set, which is filtered, to filter out the specified type node for including in the original web page.
The server can generate blacklist template profile for the collections of web pages, by fixed specified type node Output saves the blacklist template profile, then the blacklist template profile into the blacklist template profile In node be the specified type node that should be filtered out, when the server receive terminal transmission home page filter show request When, corresponding original web page is obtained, the blacklist template profile is based on, filters out the blacklist template in the original web page The node for including in configuration file, to filter out the specified type node for including in the original web page.
204b, the server export the node in multiple webpage in addition to the specified type node to white list template In configuration file, when receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request, base In the white list template profile, which is filtered, to filter out the specified class for including in the original web page Type node.
The server can for the collections of web pages generate white list template profile, by multiple webpage except this is specified Node other than type node is exported into white list template profile, saves the white list template profile, then this is white Node in list template profile is the web page joint that should retain, when the server receives the home page filter of terminal transmission When showing request, corresponding original web page is obtained, the white list template profile is based on, it is white to filter out this in the original web page Node not to be covered in list template profile, to filter out the specified type node for including in the original web page.
When user uses mobile terminal, which be can be applied in the transcoding process of the server, when the service When device gets the original web page, it is based on the template profile, transcoding is carried out to the original web page, so that the webpage after transcoding In do not include specified type node.
It should be noted that the embodiment of the present invention only using the server by the webpage being currently generated as webpage to be analyzed For be illustrated, and in practical applications, which is likely to update net due to upgrading service, anti-crawl etc. Page, once webpage is updated, the position of content or web page contents in webpage may change, then specified in webpage Type node can also change.In order to guarantee the timeliness of template profile, which also will be to template configuration text Part is updated.
Optionally, which obtains the multiple webpages generated in the specified duration before current point in time, that is to say, Duration is specified every this, which obtains the multiple webpages generated in the specified duration before current point in time, to this Multiple webpages execution above-mentioned steps 201-204, the template profile updated, the template profile based on the update, Displayed web page is treated to be filtered.Wherein, which can be by between time point of the server according to more new web page Interval determines, can be one day or several days etc., it is not limited in the embodiment of the present invention.
In order to avoid current business of the renewal process to the server impacts, which gets multiple net When page, above-mentioned steps 201-204 can be executed offline, and in the process, which can be based on old template configuration text Part is treated displayed web page and is filtered, and when the server gets the template profile of update, reloads the mould of the update Plate configuration file, the template profile based on the update are treated displayed web page and are filtered.
It is current in the related technology, by human configuration filtering profile, when Website server has updated webpage, originally configure Filtering profile will fail, operation personnel needs to monitor the update status of each webpage, could find the template of failure, then weigh New template is newly configured, excessive human cost is consumed.And in practical applications, operation personnel is difficult to find failure in time Template, poor in timeliness.And in embodiments of the present invention, which specifies duration every this, automatically obtains newly-generated more A webpage executes to repeating scrolling training step, updates template profile in time, and entire training process is unsupervised and automatic Change and repeat, greatly reduces human cost, ensure that the timeliness of template profile, and by the way of off-line training, Avoid the influence to current business.
Method provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time Cost and human cost.Further, newly-generated multiple webpages are automatically obtained, repeat training step, in time more New template configuration file, greatly reduces human cost, ensure that the timeliness of template profile, and using off-line training Mode avoids the influence to current business.
Fig. 6 is a kind of home page filter apparatus structure schematic diagram provided in an embodiment of the present invention, and referring to Fig. 6, which includes:
Collections of web pages obtains module 601, includes multiple nets in the collections of web pages for obtaining collections of web pages to be analyzed Page, it include multiple nodes in each webpage;
Computing module 602, characteristic value a possibility that for calculating the node for each node in each webpage, should Possibility characteristic value is for indicating a possibility that node is specified type node size;
Specified type node determining module 603, for will likely property characteristic value be greater than specified threshold node be determined as this Specified type node;
Filtering module 604 is treated displayed web page and is filtered for being based on fixed specified type node.
Device provided in an embodiment of the present invention, it is special a possibility that each node by calculating in collections of web pages in each webpage Value indicative, it would be possible to which property characteristic value is greater than the node of specified threshold as specified type node, can be directly based upon fixed finger Determine type node, treat displayed web page and be filtered, is not necessarily to human configuration filtering profile, it is simple and efficient to handle, save the time Cost and human cost.
Optionally, which is used for the content according to each node, calculates and removes in the node and the collections of web pages The similarity of each node in other webpages other than the webpage;To the similarity of each node in the node and other webpages A possibility that being counted, obtaining node characteristic value.
Optionally, the device further include:
Node grouping module, for the position according to each node in corresponding webpage, to multiple in multiple webpage Node is grouped, and obtains multiple node sets, and multiple nodes in each node set are located at the identical bits in different web pages It sets.
Optionally, which is used for for each node in each node set, according in each node Hold, calculates the similarity of other nodes in the node and the node set;To other sections in the node and the node set A possibility that similarity of point is counted, obtains node characteristic value.
Optionally, which obtains module 601 and generates in the specified duration before current point in time for obtaining Multiple webpages;Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, which obtains module 601 specifically for the publication account according to each webpage, to multiple net Page is grouped, and obtains multiple collections of web pages;Alternatively, multiple webpage is grouped according to the storage catalogue of each webpage, Obtain multiple collections of web pages;Alternatively, being grouped according to the subdomain title of each webpage to multiple webpage, multiple nets are obtained Page set.
Optionally, which is used to export fixed specified type node to blacklist template configuration text In part;When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;It is black based on this List template profile is filtered the original web page, to filter out the specified type node for including in the original web page.
Optionally, which is used to export the node in multiple webpage in addition to the specified type node Into white list template profile;When receiving home page filter displaying request, obtains the home page filter and show that request corresponds to Original web page;Based on the white list template profile, which is filtered, to filter out in the original web page Including specified type node.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer It repeats one by one.
It should be understood that home page filter device provided by the above embodiment is when being filtered webpage, only with above-mentioned The division progress of each functional module can according to need and for example, in practical application by above-mentioned function distribution by different Functional module is completed, i.e., the internal structure of server is divided into different functional modules, with complete it is described above whole or Person's partial function.In addition, home page filter device provided by the above embodiment and Webpage filtering method embodiment belong to same design, Its specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 7 is a kind of server architecture schematic diagram provided in an embodiment of the present invention, which can be used for above-mentioned implementation Function performed by server in the Webpage filtering method exemplified.Specifically: referring to Fig. 7, which can be because of configuration Or performance is different and generate bigger difference, may include one or more central processing units (Central Processing Unit, CPU) 722 (for example, one or more processors) and memory 732, one or more Store the storage medium 730 (such as one or more mass memory units) of application program 742 or data 744.Wherein, it deposits Reservoir 732 and storage medium 730 can be of short duration storage or persistent storage.The program for being stored in storage medium 730 may include One or more modules (diagram does not mark).
Server 700 can also include one or more power supplys 726, one or more wired or wireless networks Interface 750, one or more input/output interfaces 758, and/or, one or more operating systems 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
One perhaps more than one program be stored in memory and be configured to by one or more than one processor It executes, this or more than one program include the instruction for performing the following operation:
Collections of web pages to be analyzed is obtained, includes multiple webpages in the collections of web pages, includes multiple nodes in each webpage;
For each node in each webpage, a possibility that calculating the node characteristic value, which is used for Indicate a possibility that node is specified type node size;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered.
Optionally, also comprising the instruction for performing the following operation:
According to the content of each node, calculate every in other webpages in the node and the collections of web pages in addition to the webpage The similarity of a node;
A possibility that similarity of each node in other webpages of the node and this is counted, obtains node feature Value.
Optionally, also comprising the instruction for performing the following operation:
According to position of each node in corresponding webpage, multiple nodes in multiple webpage are grouped, are obtained Multiple node sets, multiple nodes in each node set are located at the same position in different web pages.
Optionally, also comprising the instruction for performing the following operation:
For each node in each node set, according to the content of each node, the node and the node collection are calculated The similarity of other nodes in conjunction;
A possibility that counting to the similarity of the node and other nodes in the node set, obtain the node is special Value indicative.
Optionally, also comprising the instruction for performing the following operation:
Obtain the multiple webpages generated in the specified duration before current point in time;
Multiple webpage is grouped, multiple collections of web pages are obtained.
Optionally, also comprising the instruction for performing the following operation:
According to the publication account of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the storage catalogue of each webpage, multiple webpage is grouped, obtains multiple collections of web pages;Alternatively,
According to the subdomain title of each webpage, multiple webpage is grouped, obtains multiple collections of web pages.
Optionally, also comprising the instruction for performing the following operation:
Fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, which is filtered, is wrapped with filtering out in the original web page The specified type node included.
Optionally, also comprising the instruction for performing the following operation:
Node in multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, which is filtered, is wrapped with filtering out in the original web page The specified type node included.
All the above alternatives can form alternative embodiment of the invention using any combination, herein no longer It repeats one by one.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of Webpage filtering method, which is characterized in that the described method includes:
Obtain the multiple webpages generated in the specified duration before current point in time;
According to the publication account of each webpage, the multiple webpage is grouped, obtains multiple collections of web pages;Alternatively, according to The storage catalogue of each webpage is grouped the multiple webpage, obtains multiple collections of web pages;Alternatively, according to each webpage Subdomain title, the multiple webpage is grouped, obtains multiple collections of web pages, includes multiple nets in each collections of web pages Page;
For each webpage, according to text node, picture node, video node, webpage link address node format, will The webpage is divided into multiple nodes;
For each node, a possibility that calculating the node characteristic value, the possibility characteristic value is for indicating the node It is size a possibility that specifying type node;
Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;
Based on fixed specified type node, treats displayed web page and be filtered;
The method also includes:
According to position of each node in corresponding webpage, multiple nodes in the multiple webpage are grouped, are obtained To multiple node sets, multiple nodes in each node set are located at the same position in different web pages;
It is described for each node, a possibility that calculating the node characteristic value include:
The node and the section are calculated according to the content of each node for each node in each node set The similarity of other nodes in point set;
A possibility that similarity of the node and other nodes in the node set is counted, obtains the node Characteristic value.
2. calculating the possibility of the node the method according to claim 1, wherein described for each node Property characteristic value includes:
According to the content of each node, calculate in other webpages in the node and the collections of web pages in addition to the webpage The similarity of each node;
A possibility that counting to the similarity of the node and each node in other described webpages, obtain the node is special Value indicative.
3. treating exhibition the method according to claim 1, wherein described be based on fixed specified type node Showing that webpage is filtered includes:
The fixed specified type node is exported into blacklist template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the blacklist template profile, the original web page is filtered, to filter out in the original web page Including specified type node.
4. treating exhibition the method according to claim 1, wherein described be based on fixed specified type node Showing that webpage is filtered includes:
Node in the multiple webpage in addition to the specified type node is exported into white list template profile;
When receiving home page filter displaying request, obtains the home page filter and show the corresponding original web page of request;
Based on the white list template profile, the original web page is filtered, to filter out in the original web page Including specified type node.
5. a kind of home page filter device, which is characterized in that described device includes:
Collections of web pages obtains module, for obtaining the multiple webpages generated in the specified duration before current point in time;
The collections of web pages obtains module, is also used to the publication account according to each webpage, is grouped to the multiple webpage, Obtain multiple collections of web pages;Alternatively, being grouped according to the storage catalogue of each webpage to the multiple webpage, obtain multiple Collections of web pages;Alternatively, being grouped according to the subdomain title of each webpage to the multiple webpage, multiple collections of web pages are obtained, It include multiple webpages in each collections of web pages;
The collections of web pages obtains module, is also used to for each webpage, according to text node, picture node, video section The format of point, webpage link address node divides the webpage into multiple nodes;
Computing module, characteristic value a possibility that for calculating the node for each node, the possibility characteristic value are used for Indicate a possibility that node is specified type node size;
Specified type node determining module, for will likely property characteristic value be greater than specified threshold node be determined as the specified class Type node;
Filtering module is treated displayed web page and is filtered for being based on fixed specified type node;
Described device further include:
Node grouping module, for the position according to each node in corresponding webpage, to more in the multiple webpage A node is grouped, and obtains multiple node sets, and multiple nodes in each node set are located at identical in different web pages Position;
The computing module is also used to for each node in each node set, according to the content of each node, meter Calculate the similarity of other nodes in the node and the node set;To other in the node and the node set A possibility that similarity of node is counted, obtains node characteristic value.
6. device according to claim 5, which is characterized in that the computing module is used for the content according to each node, Calculate the similarity of each node in other webpages in the node and the collections of web pages in addition to the webpage;To described A possibility that similarity of node and each node in other described webpages is counted, obtains node characteristic value.
7. device according to claim 5, which is characterized in that the filtering module is used for fixed specified type section Point output is into blacklist template profile;When receiving home page filter displaying request, obtains the home page filter and show Request corresponding original web page;Based on the blacklist template profile, the original web page is filtered, to filter out The specified type node for including in the original web page.
8. device according to claim 5, which is characterized in that the filtering module is used to that institute will to be removed in the multiple webpage The node other than specified type node is stated to export into white list template profile;Request is shown when receiving home page filter When, it obtains the home page filter and shows the corresponding original web page of request;Based on the white list template profile, to the original Beginning webpage is filtered, to filter out the specified type node for including in the original web page.
9. a kind of server for filtering web page, which is characterized in that the server include memory and one or one with Upper processor, one perhaps more than one program be stored in the memory and be configured to by one or one The above processor is executed to realize the Webpage filtering method as described in claim 1 to claim 4 any claim.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage have one or More than one program, and be configured and executed by one or more processors to realize that claim 1 to claim 4 such as is appointed Webpage filtering method described in one claim.
CN201410648193.1A 2014-11-14 2014-11-14 Webpage filtering method and device Active CN105653550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410648193.1A CN105653550B (en) 2014-11-14 2014-11-14 Webpage filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410648193.1A CN105653550B (en) 2014-11-14 2014-11-14 Webpage filtering method and device

Publications (2)

Publication Number Publication Date
CN105653550A CN105653550A (en) 2016-06-08
CN105653550B true CN105653550B (en) 2019-11-05

Family

ID=56479084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410648193.1A Active CN105653550B (en) 2014-11-14 2014-11-14 Webpage filtering method and device

Country Status (1)

Country Link
CN (1) CN105653550B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326455A (en) * 2016-08-26 2017-01-11 乐视控股(北京)有限公司 Web page browsing filtering processing method and system, terminal and cloud acceleration server
CN106599246B (en) * 2016-12-20 2020-02-11 维沃移动通信有限公司 Display content interception method, mobile terminal and control server
CN108628888A (en) * 2017-03-21 2018-10-09 中兴通讯股份有限公司 A kind of browser Ad blocking method, apparatus and terminal
CN107423059A (en) * 2017-07-07 2017-12-01 北京小米移动软件有限公司 Display methods, device and the terminal of the page
CN109756393B (en) * 2018-12-27 2021-04-30 阿里巴巴(中国)有限公司 Information processing method, system, medium, and computing device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399818A (en) * 2007-09-25 2009-04-01 日电(中国)有限公司 Theme related webpage filtering method and system based on navigation route information
CN103678313A (en) * 2012-08-31 2014-03-26 北京百度网讯科技有限公司 Method and device for assessing authority of web pages
CN103870590A (en) * 2014-03-28 2014-06-18 北京奇虎科技有限公司 Webpage identification method and device with error-reported characteristic

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399818A (en) * 2007-09-25 2009-04-01 日电(中国)有限公司 Theme related webpage filtering method and system based on navigation route information
CN103678313A (en) * 2012-08-31 2014-03-26 北京百度网讯科技有限公司 Method and device for assessing authority of web pages
CN103870590A (en) * 2014-03-28 2014-06-18 北京奇虎科技有限公司 Webpage identification method and device with error-reported characteristic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网页噪声识别与消除方法研究;秦超;《中国优秀硕士学位论文全文数据库信息科技辑》;20120515;第I139-278页 *

Also Published As

Publication number Publication date
CN105653550A (en) 2016-06-08

Similar Documents

Publication Publication Date Title
CN105653550B (en) Webpage filtering method and device
CN104834731B (en) A kind of recommended method and device from media information
CN105117474B (en) The method and apparatus of recommendation information load are carried out in the reading model of webpage
CN102930059B (en) Method for designing focused crawler
CN104111941B (en) The method and apparatus that information is shown
CN102426610B (en) Microblog rank searching method and microblog searching engine
CN104899220B (en) Application program recommendation method and system
CN107784066A (en) Information recommendation method, device, server and storage medium
US10346496B2 (en) Information category obtaining method and apparatus
CN109325179A (en) A kind of method and device that content is promoted
CN103500213B (en) Page hot-spot resource updating method and device based on pre-reading
CN104899236B (en) A kind of comment information display methods, apparatus and system
CN104504096B (en) A kind of information transferring method and web page browsing device of inter-network page
CN103858121B (en) Web applications are made to obtain the method and system of database change
CN104077415A (en) Searching method and device
CN106294815B (en) A kind of clustering method and device of URL
CN107885873A (en) Method and apparatus for output information
CN103365842B (en) A kind of page browsing recommends method and device
CN103678325A (en) Method and device for providing browsing page corresponding to initial page
CN109978580A (en) Object recommendation method, apparatus and computer readable storage medium
CN103970753A (en) Pushing method and pushing device for related knowledge
CN106777086A (en) A kind of webpage buries dynamic management approach and device a little
JP2011227721A (en) Interest extraction device, interest extraction method, and interest extraction program
CN107315753B (en) Paging method and device across multiple databases
CN103870452A (en) Method and method for recommending data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant