CN103020123A - Method for searching bad video website - Google Patents

Method for searching bad video website Download PDF

Info

Publication number
CN103020123A
CN103020123A CN2012104652132A CN201210465213A CN103020123A CN 103020123 A CN103020123 A CN 103020123A CN 2012104652132 A CN2012104652132 A CN 2012104652132A CN 201210465213 A CN201210465213 A CN 201210465213A CN 103020123 A CN103020123 A CN 103020123A
Authority
CN
China
Prior art keywords
website
webpage
video service
database
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012104652132A
Other languages
Chinese (zh)
Other versions
CN103020123B (en
Inventor
朱明�
尹文科
孙永录
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201210465213.2A priority Critical patent/CN103020123B/en
Publication of CN103020123A publication Critical patent/CN103020123A/en
Application granted granted Critical
Publication of CN103020123B publication Critical patent/CN103020123B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for searching a bad video website, which comprises the following steps: constructing a searching request according to searching keywords, acquiring returned search results, and acquiring website addresses and associated searching keywords in the search results; updating the searching keywords in a searching keyword database according to the degree of correlation between the associated searching keywords and a theme of a video service website, and the capability of generating new bad website addresses; judging whether a webpage corresponding to the website address in the search results belongs to the video service class website or not, adding the website name and the website address into a video service website database if the webpage belongs to the video service class website, and discarding the website address if the webpage does not belong to the video service class website; and judging the health degree of website addresses in the video service website database and storing the website addresses of which the health degree is lower than a first health degree threshold value into a bad video website database, as a result, the technical scheme capable of quickly and accurately searching the bad video website is provided, so that the website providing a bad video service can be conveniently and effectively supervised.

Description

A kind of method of searching for bad video website
Technical field
The present invention relates to internet information retrieval technique field, relate in particular to a kind of method of searching for bad video website.
Background technology
Along with the develop rapidly of Internet technology, people rely on day by day strong to the internet, and internet information content, especially content of multimedia are also at rapid growth.Meanwhile, also rapid growth of the bad video content in the content of multimedia of internet information.
At present, bad video service website in the internet mainly comprises: (1) directly provides the video service website of bad video content order program service, these class website characteristics are according to multiple criteria for classification, provide bad video content order program service in the catalogue browsing mode; (2) provide the P2P site for service of bad video resource sharing download; Such as BT seed file main bad video resource download site pointed; (3) provide the site for service of the real-time bad net cast of P2P.
Because above-mentioned three badness video service website quantity are many, but also constantly increasing and changing.Therefore, a kind of information search technique scheme need to be arranged, so that can from internet mass information, automatically find and retrieve the website that comprises bad video content.Yet existing internet search engine such as search engines such as Google and Baidu, still can't search out the website that bad Video service is provided accurately and effectively.
Summary of the invention
The purpose of this invention is to provide a kind of method of searching for bad video website, in order to can from a small amount of internet information, automatically find and retrieve accurately and effectively the website that comprises bad video content.
The objective of the invention is to be achieved through the following technical solutions:
A kind of method of searching for bad video website comprises:
According to the structure of the searching key word in searching key word database searching request;
Obtain the Search Results that search engine returns according to described searching request, and obtain station address and association search keyword in the Search Results;
According to the degree of correlation of the theme of the association search keyword among the current search result and video service website and the ability that produces new objectionable website address, upgrade the searching key word in the described searching key word database;
Judge whether webpage corresponding to station address in the Search Results belongs to Video service class webpage, if belong to, then extracts the title of website, and this web site name and network address are joined in the video service website database; If do not belong to, then abandon this station address;
Judge the health degree of the station address in the video service website database, and the website that health degree is lower than the first health degree threshold value is deposited in the bad video website database.
The method also comprises:
According to the keyword in the element tags in the described Search Results and the webpage Video service Topic relative and the content in the descriptor, upgrade described searching key word database;
And/or,
According to turning to linking of other websites in the described Search Results and the webpage Video service Topic relative, upgrade described searching key word database.
The described step of judging whether webpage corresponding to the station address in the Search Results belongs to Video service class webpage comprises:
Be written into webpage corresponding to website and move script on the webpage, judge whether to exist the feature HTML (Hypertext Markup Language) html tag that generates player, if exist, then determine the candidate's player in this webpage; Analyze again the visual signature of described candidate's player object, whether satisfy predetermined size threshold values with the size of the video pictures of determining player plays, if determine that then webpage corresponding to this website is Video service class webpage;
Perhaps,
Matching degree according to the webpage in webpage corresponding to described station address and the video web page template judges whether webpage corresponding to this station address is Video service class webpage.
After definite webpage corresponding to described website is Video service class webpage, this webpage is saved as the video web page template, described video web page template is used for as judging whether other webpages are the foundation of Video service class webpage.
The step of upgrading the searching key word in the described searching key word database comprises:
Judge association search keyword in the described Search Results and the theme correlation degree of video service website, if the ratio of the network address of video service website class surpasses predetermined value in the current Search Results that returns, judge that then the current search keyword produces the ability of new station address, if the ratio that website is contained in predetermined candidate's network address database in the current Search Results that returns is lower than predetermined value, then the association search keyword in this Search Results is increased in the described searching key word database, is recording the website that obtains according to Search Results before in described candidate's network address database.
Before the health degree of the station address in judging the video service website database, comprise that also it is the step of the address of video service website homepage that the non-home address in the described video service website database is merged stipulations, and this step comprises:
For two different websites in the video service website database, whether the Hostname of judging both is identical, if identical, judge then whether corresponding web site name is identical between the two, if identical, the size of both pathdepths relatively then, the website stipulations that pathdepth is large are the little website of pathdepth, the like, until all websites in the described video service website database of finishing dealing with.
The mode of extracting described web site name comprises:
Extract the content of different web pages heading label under the same website, and utilize the longest common characters string algorithm to extract the frequency of occurrences is the highest in the different web pages heading label of same website content as the title of website.
The step of the health degree of the station address in the described judgement video service website database comprises:
For website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is made up the webpage body figure of correspondence;
Calculate the similarity of body figure of the webpage of the corresponding degree of depth in website in each webpage body figure and the predetermined objectionable website template, and according to the scoring of the website in described similarity and the objectionable website template, determine that website to be assessed is with respect to the healthy score of the website in this objectionable website template;
According to the healthy score of website to be assessed with respect to the website in all objectionable website templates, calculate the health value of described website to be assessed.
If the health value of described website to be assessed is lower than the second health degree threshold value of setting, then should add in the objectionable website template website to be assessed, comprise the objectionable website template base of described objectionable website template in order to renewal.
Described the first health degree threshold value is greater than described the second health degree threshold value.
As seen from the above technical solution provided by the invention, the embodiment of the invention provides has solved this technical matters that the existing network search engine can't be accurately and found efficiently and search for video service website effectively.Thereby a kind of technical scheme that can automatically find and retrieve accurately and effectively the website that comprises bad video content from a small amount of internet information is provided, and then has been the condition of providing convenience for effective control and management of bad video website.
Description of drawings
In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the accompanying drawing of required use was done to introduce simply during the below will describe embodiment, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite of not paying creative work, can also obtain other accompanying drawings according to these accompanying drawings.
The treatment scheme synoptic diagram of the method that Fig. 1 provides for the embodiment of the invention;
Fig. 2 A is the overall realization configuration diagram one of the embodiment of the invention;
Fig. 2 B is the overall realization configuration diagram two of the embodiment of the invention;
Fig. 3 is the search agent module process flow diagram in the embodiment of the invention;
Fig. 4 is the Search Results processing module process flow diagram in the embodiment of the invention;
Fig. 5 is the keyword evaluation module process flow diagram in the embodiment of the invention;
Fig. 6 is the network address analysis module process flow diagram in the embodiment of the invention;
Fig. 7 is that the network address in the embodiment of the invention merges the module process flow diagram;
Fig. 8 is the website health degree estimation flow figure in the embodiment of the invention;
Fig. 9 is that the webpage body figure in the embodiment of the invention makes up process flow diagram;
Figure 10 is the webpage body figure OG in the embodiment of the invention D1
Figure 11 is the webpage body figure OG in the embodiment of the invention D2
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on embodiments of the invention, those of ordinary skills belong to protection scope of the present invention not making the every other embodiment that obtains under the creative work prerequisite.
Below in conjunction with accompanying drawing the embodiment of the invention is described in further detail.
The embodiment of the invention provides a kind of method of searching for bad video website, and its specific implementation can comprise following processing procedure as shown in Figure 1:
Step 11 according to the structure of the searching key word in searching key word database searching request, comprises at least one searching key word in this searching key word database;
Step 12 is obtained the Search Results that search engine returns according to described searching request, and obtains station address and association search keyword in the Search Results;
Wherein, described association search keyword can be the relevant search keyword that described result of page searching below lists, the relevant search keyword under the result of page searching that returns such as Baidu or Google;
Step 13 according to the degree of correlation of the theme of the association search keyword among the current search result and video service website and the ability that produces new objectionable website address, is upgraded the searching key word in the described searching key word database;
Particularly, the specific implementation of upgrading the searching key word in the described searching key word database can comprise following processing procedure:
(1) the association search keyword in the described Search Results of judgement and the theme correlation degree of video service website, if the ratio of the network address of video service website class is lower than predetermined value (as surpassing 75% etc.) in the current Search Results that returns, think that then relevant search keyword and the search for of current result of page searching below differ far away, so the association search keyword in this Search Results is not increased in the described searching key word database, otherwise, implementation (2);
(2) judge that the current search keyword produces the ability of new station address, if website the ratio in predetermined candidate's network address database of being contained in is lower than predetermined value (as be lower than 75% etc.) in the current Search Results that returns, it is stronger to think that then the relevant search keyword of current result of page searching below produces the ability of new video website address, so the association search keyword in this Search Results is increased in the described searching key word database, otherwise, it is relatively poor to think that then the relevant search keyword of current result of page searching below produces the ability of new video website address, so the association search keyword in this Search Results is not increased in the described searching key word database; Wherein, recording the website that obtains according to Search Results before in described candidate's network address database.
Step 14 judges whether webpage corresponding to station address in the Search Results belongs to Video service class webpage, if belong to, then extracts the title of website, and this web site name and network address are joined in the video service website database; If do not belong to, then abandon this station address;
Particularly, in this step 14, judge processing mode that whether webpage corresponding to station address in the Search Results belongs to Video service class webpage specifically can comprise following any one:
(1) be written into webpage corresponding to website and move script on the webpage, judge whether to exist the feature HTML(HTML (Hypertext Markup Language) that generates player) label, if exist, then determine the candidate's player in this webpage; Analyze again the visual signature of described candidate's player object, such as size, coordinate etc., whether satisfy predetermined size threshold values (whether the right margin such as the video pictures of player plays satisfies certain threshold values to distance, the lower boundary of page coboundary to the distance of page lower boundary to distance, the coboundary of the right margin of the page) with the size of the video pictures of determining player plays, if determine that then webpage corresponding to this website is Video service class webpage;
(2) judge according to the matching degree of the webpage in webpage corresponding to described station address and the video web page template whether webpage corresponding to this station address is Video service class webpage.
Alternatively, because the dom tree structure of same type webpage is essentially identical, therefore, after definite webpage corresponding to described website is Video service class webpage, this webpage can also be saved as the video web page template, to be used for as judging that other webpages are whether as the foundation of Video service class webpage.
Need to prove, the execution sequential of above-mentioned steps 13 and step 14 in no particular order, perhaps, two steps also can be carried out simultaneously.
Step 15 is judged the health degree of the station address in the video service website database, and the website that health degree is lower than the first health degree threshold value is deposited in the bad video website database;
Particularly, in this step 15, judge the station address in the video service website database health degree step can but be not limited to comprise:
At first, for website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is made up the webpage body figure of correspondence;
Wherein, can include the term information that exists in the webpage among the corresponding webpage body figure, and between the different terms that obtain according to the frequency that exists closeness relation between the term and term in the webpage term, to occur the directed edge of Weighted Coefficients and directed edge to numberical value of quantity;
Afterwards, calculate the corresponding degree of depth webpage in website in each webpage body figure and the predetermined objectionable website template webpage body figure similarity (for example, for bad template website A, calculate the mean value of the similarity of the different web pages of website to be assessed and bad template website A, thereby draw website to be assessed for the similarity of this bad template website A), and according to the scoring of the website in described similarity and the objectionable website template, determine that website to be assessed is with respect to the healthy score of the website in this objectionable website template; Namely can be after the similarity that draws website to be assessed bad template website A for this, according to the scoring of the bad template website A in similarity and the objectionable website template, can draw healthy score with respect to the website to be assessed of bad template website A based on predetermined algorithm;
At last, according to the healthy score of website to be assessed with respect to the website in all objectionable website templates, calculate the health value of described website to be assessed; Particularly, can calculate respectively based on other all template websites that the objectionable website template comprises the healthy score of website to be assessed, then calculate a plurality of healthy score average of website to be assessed, thereby calculate the health value that obtains website to be assessed.
Alternatively, in the processing procedure of the health degree of the station address in judging the video service website database, the technical scheme that the embodiment of the invention provides can also comprise:
If the health value of described website to be assessed is lower than the second health degree threshold value of setting, then should add in the objectionable website template website to be assessed, comprise the objectionable website template base of described objectionable website template in order to renewal.Further, described the first health degree threshold value can be greater than described the second health degree threshold value.Particularly, if the health value of website to be assessed is lower than the threshold value (i.e. the second health degree threshold value) that the objectionable website template is set, as less than 1, then website to be assessed can be added in the objectionable website template, in order to upgrade the objectionable website template base, can also detect whether there is discarded objectionable website template in the objectionable website template base simultaneously, if have, then with its deletion, in addition, also need this website to be assessed is added in the bad video website database; If website to be assessed health value is less than the first health degree threshold value and greater than the second health degree threshold value, then as less than 3 large, then can only this website to be assessed be added in the bad video website database, and not with in its adding objectionable website template base.If website to be assessed health value greater than the first health degree threshold value, then abandons this website to be assessed.
In above-mentioned processing procedure, the searching key word that comprises in the corresponding searching key word database can also adopt following at least a mode to upgrade processing, wherein:
Mode one: according to the keyword in the element tags in the described Search Results and the webpage Video service Topic relative and the content in the descriptor, upgrade described searching key word database; Namely after the process website evaluation process of above-mentioned steps 14, in the webpage of acquisition and Video service Topic relative, and according to "<meta name=" keywords in the webpage of this and Video service Topic relative " label and "<meta name=description " content of label; parse corresponding searching key word is updated to the searching key word that parses in the described searching key word database again;
Mode two: according to turning to linking of other websites in the described Search Results and the webpage Video service Topic relative, upgrade described searching key word database, for example, can be according to the network address in the link that turns to other websites, the key words that obtains in this network address is described label, and utilizes this key words to describe label and come more new keywords.
In the embodiment of the invention, before execution in step 15 is with the health degree of judging the station address in the video service website database, can also comprise that it is the treatment step of the address of video service website homepage that the non-home address in the described video service website database is merged stipulations, and this treatment step can comprise specifically:
For two different websites in the video service website database, whether the Hostname of judging both is identical, if identical, judge then whether corresponding web site name is identical between the two, if identical, the size of both pathdepths relatively then, the website stipulations that pathdepth is large are the little website of pathdepth, the like, until all websites in the described video service website database of finishing dealing with.For example, for two different station address U1 and U2, whether the Hostname of at first judging them is identical, if difference then can not merge, if both Hostnames are identical, whether the web site name of then further judging their correspondences is identical, if difference can not merge, if both web site name are identical, the further size of the pathdepth of these two station addresses relatively then, if the pathdepth of U1, thinks then that U2 is the part of website corresponding to U1 less than U2, can be U1 with the U2 stipulations; Vice versa.
Alternatively, in above-mentioned processing procedure, for obtaining the web site name of website, then can also comprise the treatment step that extracts described network title, and the mode of extracting described web site name can comprise specifically:
Extract the content of different web pages heading label under the same website (i.e. "<title〉" label), and utilize the longest common characters string algorithm to extract the frequency of occurrences is the highest in the different web pages heading label of same website content as the title of website, with the operation that realizes extracting described web site name accordingly.
For ease of understanding, below in conjunction with concrete application implementation procedure of the present invention is elaborated.
The embodiment of the invention can comprise each search engine agency shown in Fig. 2 A in concrete application process, and result's extraction, keyword assessment, network address analysis, network address merging and website health degree evaluation module.And share same database between each module, each module can be deployed on the independent machine, also can be deployed in same the machine.Processing framework shown in Figure 1 can be supported arbitrarily " N+1 " pattern, and wherein, N represents any multiple host (comprising above-mentioned each processing module in the main frame), the database that 1 expression is shared.So just, can make any multiple host move same group of processing module, and each main frame carry out the exchange of data by shared data bank.Can effectively improve the overall performance of system by such processing framework, namely improve the handling property of search objectionable website.
In Fig. 1, corresponding shared data bank can comprise: searching key word database, candidate's network address database, video website template database, video website database, middle volatile data base and bad video website database, and bad video website template database, video network address template database and video network address database.
In Fig. 2 A, in order to search for to greatest extent and find video service website, Integrated using Baidu, Google, this four large search engine of Bing, Yahoo, the corresponding different search engine proxy module of each search engine, result's extraction, keyword assessment, network address analysis, network address merge and website health degree evaluation module then can be shared.
The overall realization framework of the embodiment of the invention can comprise that search agent module M1, Search Results processing module M2, keyword evaluation module M3, network address analysis module M4, network address merge module M5 and website health degree evaluation module M6 shown in Fig. 2 B.The below will describe respectively the function of modules, wherein:
Search agent module M1 is used for automatically generating the searching request of search engine according to the searching key word database, and obtains the Search Results that returns based on searching request;
Search Results processing module M2, being used for Search Results extracts, concrete being used for resolved the Search Results that above-mentioned search agent module M1 obtains, locate and extract the station address (being website) in the Search Results and the result of page searching that returns below relevant search keyword C(be the association search keyword);
Keyword evaluation module M3, be used for judging association search keyword C and theme (such as themes such as cuisines or the footballs) degree of correlation of video service website and the ability that produces new video website address of current search return results page below, if the search for degree of correlation of relevant search keyword C is low or produce a little less than the ability of new video website address, then no longer it is expanded in the searching key word database;
Network address analysis module M4, be used for utilizing the video network address automatically to identify knowledge and network address and video website template similarity, judge whether current web page belongs to the Video service class, if, then extract the title of website, it is the video network address database that this web site name and network address L are joined video service website collection D3() in, and utilize the content of specific html tag in the webpage that searching key word database D 1 is expanded; If not, then abandon this network address, and revise the type of corresponding network address among candidate's network address database D2, so that keyword evaluation module M3 is for referencial use when carrying out the keyword assessment; Wherein, the type of revising corresponding network address is for those non-video websites and will revises its type, and identifying its type is the non-video website, to this type of website, in keyword when assessment, can obtain whether containing this type of website among the result and contain what of these network address quantity according to keyword, surpass certain threshold value if contain this type of network address, then keyword expansion ability as can be known can be ignored more, less such as what do not contain or contain, be lower than threshold value, think that then the keyword expansion ability is strong, can keep.
Network address merges module M5, being used for video network address set D3(is the video network address database) non-home address to merge stipulations be the address of video service website homepage, to find out the homepage in the video service website, obtain video website set D4, i.e. video website database.
Website health degree evaluation module M6 carries out the health degree assessment for the video website of the video website set D4 that network address is merged module M5 acquisition, with acquisition bad video website wherein, determines bad video website set D5, i.e. bad video website database.
In order better to understand the implication of the general frame synoptic diagram that Fig. 2 presents, the below will be described in detail the function of each processing module in the Organization Chart.
(1) search agent module M1
The treatment scheme of search agent module can comprise as shown in Figure 3:
Step 31, search agent module judge whether also have untapped searching key word in the searched key database, if having, then take out this searching key word, and this searching key word of juxtaposition is for using state;
Step 32, search agent module generate the searching request of search engine according to the searching key word that takes out;
Step 33 is obtained the Search Results that returns, i.e. corresponding result of page searching, and in the middle of the result of page searching that returns deposited in the volatile data base;
Step 34 judges whether to read the last page of Search Results, if do not read last page, then continues to read the content of lower one page; If read last page, then turn back to step 31, namely judge whether also have untapped keyword in the database;
Wherein whether the search agent module specifically can change to judge whether to have read last page according to the result of page searching content of returning.
(2) Search Results processing module M2
The treatment scheme of Search Results processing module can comprise as shown in Figure 4:
Step 41 is obtained the searching key word that certain has used from the searching key word database, and the searching key word that has used that obtains is carried out mark;
Step 42 is found out all thus result of page searching of searching key word discovery from the result of page searching that returns of middle volatile data base;
Step 43 reads out the result of page searching by this searching key word discovery of finding out, and deletes this result of page searching in middle volatile data base;
Step 44, extract the relevant search keyword that returns the result of page searching below, and the station address that comprises in the result of page searching that returns, and the station address that returns put into candidate's network address database, so that search in the processing procedure of bad video website as judging whether network address is the foundation of new discovery network address at follow-up other;
If the result of page searching of front predetermined quantity (such as 20 pages), then also need the relevant search keyword that will extract from result of page searching below and station address put in the middle of the searched key vocabulary of volatile data base, in this searched key vocabulary, recording relevant search keyword to be assessed (being called for short searching key word to be assessed), and website associated with it.
Repeat above-mentioned processing procedure, until determine that according to the situation of mark the searching key word that has used in the searching key word data all can finish dealing with.
(3) keyword evaluation module M3
The treatment scheme of keyword evaluation module can comprise as shown in Figure 5:
Whether step 51 exists searching key word to be assessed in the searched key vocabulary of volatile data base in the middle of judging, if do not have, then program withdraws from, if having, then execution in step 52;
Step 52 is taken out all website records that are associated with this searching key word to be assessed, when taking out the website record it is deleted from middle volatile data base;
Step 53 is called the website evaluation module these station addresses is carried out analysis and evaluation;
Step 54 judges whether the ratio of all non-video site for services surpasses 75%, if so, then returns step 51, otherwise, execution in step 55;
Step 55 utilizes candidate's network address database to judge whether newfound network address of these websites, and candidate's network address database is being deposited the website that obtains in all result of page searching that return;
Step 56 judges that all are not whether the ratio of newfound network address surpasses 75%, if so, then returns step 51, otherwise, execution in step 57;
Step 57 is put into the searching key word database with the searching key word of this assessment, to realize the renewal processing to the searching key word in the corresponding searching key word database.
(4) network address analysis module M4
The treatment scheme of network address analysis module can comprise as shown in Figure 6:
Step 61 judges whether there is network address to be assessed in candidate's network address database, if do not have, then program withdraws from, otherwise, execution in step 62;
Step 62, read video network address template database, judge network address to be assessed and video network address template similarity wherein, obtain preliminary recognition result, if determine it is not video service website according to recognition result, then get back to step 61, otherwise, utilize the video network address automatically to identify knowledge preliminary recognition result is further identified;
Corresponding automatically identification knowledge mainly comprises: be written into webpage and move script on it, judge the feature html tag that whether generates player; The visual signature of candidate's player object in the analyzing web page, such as size, coordinate etc., whether the wide and height with the picture of determining player plays satisfies certain threshold values, and whether its right margin satisfies certain threshold values to distance, the lower boundary of page coboundary to the distance of page lower boundary to distance, the coboundary of the right margin of the page;
By the processing of step 62, if determine that this network address to be assessed is not Video service class station address, then get back to step 61, otherwise, execution in step 63;
Step 63, extract web site name, utilize "<meta name=" keywords in the webpage " label and "<metaname=description " the content update searching key word database of label, and web site name and corresponding network address put into the video network address database; Subsequent web pages simultaneously this page or leaf can be saved as video network address template (namely depositing in the video network address template database), so that can take a decision as to whether Video service class webpage according to the similarity of itself and video network address template.
Particularly, corresponding web site name extracts the mode that can adopt and comprises: the content that extracts different web pages under the same website "<title〉" label, then utilize the longest common characters string algorithm to extract the highest content of the frequency of occurrences in "<title〉" label, and with the title of this character string as the website.
(5) network address merges module M5
The treatment scheme of network address merging module can comprise as shown in Figure 7:
Step 71 judges whether there is network address to be combined in the video network address database, if there is no, then program withdraws from, otherwise, execution in step 72;
Step 72 is taken out a network address U to be combined, obtains its web site name and Hostname;
Step 73 is found out all in the video network address database and U has the network address of same hostname and web site name to gather;
Step 74, this network address set of circular treatment;
Particularly, can at first take out a network address from this set, if the pathdepth of this network address is then deleted U less than the pathdepth of U from database, circulation finishes (namely can re-execute step 71); Otherwise, delete record corresponding to this network address, circulation continues (namely again take out a network address and carry out corresponding pathdepth judgement processing from this set); Through corresponding circular treatment, if the degree of depth of U is little, U finally can stay in the database so, otherwise be that the network address less than U pathdepth stayed in the database, therefore this algorithm always can guarantee the address of the video service website homepage that obtains expecting, and non-home address can be deleted in merging process.
(6) website health degree evaluation module M6
The treatment scheme of website health degree assessment can comprise as shown in Figure 8:
Step 81 is manually set some objectionable website templates (being bad video website template database), and extraction webpage wherein makes up the webpage body figure corresponding to each webpage of the website in the objectionable website template;
Step 82 for website to be assessed, is got its some webpage according to network station deeply, then to each webpage wherein, makes up its webpage body figure;
The similarity of the webpage body figure that the webpage of the corresponding degree of depth is corresponding in the website in the step 83, the webpage body figure that calculates website to be assessed and objectionable website template;
Step 84, the mean value of calculating similarity draws the health degree of website to be assessed;
For example for template website A, can calculate the mean value of the similarity of the different web pages of website to be assessed and A, thereby draw the similarity for this website to be assessed, according to the similarity of website to be assessed and the scoring of template website A, draw with respect to the healthy score in the website to be assessed of A based on respective algorithms again; Afterwards, calculate the healthy score in website to be assessed based on other all the template websites in the objectionable website template, and calculate the healthy score average of website to be assessed, draw the health value of website to be assessed;
Step 85, judge whether the health degree of website to be measured (being website to be assessed) is lower than the threshold value of setting, if website to be measured health value is lower than objectionable website template setting threshold (i.e. the second health degree threshold value), as less than 1, then website to be measured is added in the objectionable website template base (being bad video website template database), in order to upgrade the objectionable website template base, detect simultaneously and whether have discarded objectionable website template in the bad template base, and with its deletion, also website just to be measured adds in the bad video website database; If website to be measured health value, directly adds it in bad video website database as less than 3 less than health degree threshold value (i.e. the first health degree threshold value).If website to be measured health value greater than the first health degree threshold value, then abandons this website to be measured.
In above-mentioned processing procedure, corresponding webpage body figure makes up flow process as shown in Figure 9, can comprise:
Step 91 is extracted webpage, obtains web page text;
Step 92 extracts the term that belongs to bad video keyword field among the webpage, obtains the term tabulation;
Step 93, according between the different keywords of known bad video keyword domain body figure with the directed edge of weights, make up the term vector, draw webpage body figure according to mapping table; Wherein, corresponding domain body figure is used for the closeness relation between two different terms of statement; Domain body figure can be obtained by known algorithm, it specifically can obtain value from a term to expression closeness relation degree concrete another term according to corresponding semantic between them, classification and structural closeness relation with different terms, and sets up corresponding domain body figure.
Suppose that the known domain body chart of keyword (being term) A, B, C, D, E that exists is as shown in table 1:
Table 1
? A B C D E
A 1 0.1 0.2 0.3 0.4
B 0.1 1 0.2 0.3 0.4
[0145]?
C 0.1 0.2 1 0.3 0.4
D 0.1 0.2 0.3 1 0.4
E 0.1 0.2 0.3 0.4 1
Then based on above-mentioned table 1, corresponding webpage body figure building process may further comprise the steps:
The first step obtains the web page text key words content, supposes that d1 is webpage to be measured, and d2 is the template webpage, and:
D1:A – A – B – D (text size=4), d2:D – D – D – E (text size=4);
Second step obtains the term weight, and its algorithm is term weight w=term occurrence number/term list size, draw Td1={ (A, 0.5), (B, 0.25), (D, 0.25) } (term quantity=3), Td2={ (D, 0.75), (E, 0.25) } (term quantity=2), the numeral of keyword back is term weight corresponding to keyword in the bracket;
The 3rd step created webpage body figure Weight term vector, and detailed process can comprise:
A is as example in the d1, and the term weight of A is 0.5 by second step as can be known among the d1, and its value to B is 0.1 as shown in Table 1, so the value of Weight term vector A → B is 0.5*0.1=0.05, by that analogy, obtains all values;
For example, centered by term node A, obtain the value that it points to all weighted-vectors of other nodes and other nodes sensing A, vector value 0.05 calculating of A → B in the second row the 3rd row in the table 2, know that by table 1 directed edge of A → B is 0.1, the first step knows that the frequency of occurrences of A among the d1 is 0.5 again simultaneously, draw accordingly in the table 2 that Weight term vector is 0.5*0.1=0.05 among A → B, the similar B of getting → A is 0.05; Obtain afterwards it and point to the value that other nodes and other nodes point to all weighted-vectors of A centered by B, the second row the 3rd row A → B result of calculation is the analog value that the frequency of B in the first step multiply by A → B in the table 1 in the table 2, obtains 0.1*0.25=0.025; Next calculate all weighted-vector values that other nodes of sensing centered by other nodes and other nodes point to itself, fill in a form according to the maximal value of same position at last, for example maximal value is the weighted-vector value 0.05 of gained centered by A in the second row the 3rd row, so the position fills out 0.05;
The 4th step made up the tabulation of webpage body vector plot, and same vector has repetition values, then gets wherein the maximum, consists of the tabulation of webpage body figure with this, below table 2 and table 3 be respectively webpage body figure OG D1And OG D2Tabulation:
Table 2
? A B C D E
A 0.5 0.05 0.1 0.15 0.2
B 0.05 0.25 0.05 0.075 0.1
C 0.05 0.05 - 0.075 -
D 0.05 0.05 0.075 0.25 0.1
E 0.05 0.05 - 0.1 -
Table 3
? A B C D E
A - - - 0.225 0.1
B - - - 0.225 0.1
[0158]?
C - - - 0.225 0.1
D 0.075 0.15 0.225 0.75 0.3
E 0.025 0.05 0.075 0.3 0.25
Can draw out the webpage body figure of d1 and d2 according to above-mentioned table 2 and table 3, respectively as shown in Figure 10 and Figure 11.
In above-mentioned processing procedure, in the step 83 and 84, calculate the mean value of similarity and similarity, and then the processing procedure that draws the health degree of website to be assessed can comprise:
At first, the process of judging webpage to be measured and template webpage similarity can comprise:
For the similarity between webpage to be measured and the template webpage, can adopt following formula to judge, its formula is
Figure BDA00002420272900141
OG m=OG D1∩ OG D2Refer to be had by both, and all weights sums of getting numerical value reckling in the same limit; The OG that obtains accordingly mBe listed as follows,
Table 4
? A B C D E
A - - - 0.15 0.1
B - - - 0.075 0.1
C - - - 0.075 -
D 0.05 0.05 0.075 0.25 0.1
E 0.025 0.05 - 0.1 -
OG as can be known m=0.15+0.1+0.075+0.1+0.075+0.05+0.05+0.075+0.25+0.1+0.025+ 0.05+0.1=1.2, so as can be known:
sim(OG d1,OG d2)=1.2/2.275=0.527;
Suppose to have five webpages and carry out similarity determination, remaining value is respectively 0.489,0.501,0.496,0.515, its similarity average mean=(0.527+0.489+0.501+0.496+0.515)/5=0.5056 then is so website to be measured is 0.5056 with respect to the similarity of one of them website as can be known.
Afterwards, determine the health degree of website to be assessed according to similarity mean value;
Particularly, website health degree value is respectively 0,1,2,3,4,5, its health value computing formula can for:
Wherein, the health degree of template website is 0, if only have a template website, then according to above-mentioned computing formula as can be known the health degree of website to be assessed be 2, namely health value is 2.
Based on the technical scheme that the invention described above embodiment provides, can search for rapidly and accurately and find bad video website, so that can carry out effective control and management to corresponding bad video website.
The above; only for the better embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (10)

1. the method for the bad video website of search is characterized in that, comprising:
According to the structure of the searching key word in searching key word database searching request;
Obtain the Search Results that search engine returns according to described searching request, and obtain station address and association search keyword in the Search Results;
According to the degree of correlation of the theme of the association search keyword among the current search result and video service website and the ability that produces new objectionable website address, upgrade the searching key word in the described searching key word database;
Judge whether webpage corresponding to station address in the Search Results belongs to Video service class webpage, if belong to, then extracts the title of website, and this web site name and network address are joined in the video service website database; If do not belong to, then abandon this station address;
Judge the health degree of the station address in the video service website database, and the website that health degree is lower than the first health degree threshold value is deposited in the bad video website database.
2. method according to claim 1 is characterized in that, the method also comprises:
According to the keyword in the element tags in the described Search Results and the webpage Video service Topic relative and the content in the descriptor, upgrade described searching key word database;
And/or,
According to turning to linking of other websites in the described Search Results and the webpage Video service Topic relative, upgrade described searching key word database.
3. method according to claim 1 is characterized in that, the described step of judging whether webpage corresponding to the station address in the Search Results belongs to Video service class webpage comprises:
Be written into webpage corresponding to website and move script on the webpage, judge whether to exist the feature HTML (Hypertext Markup Language) html tag that generates player, if exist, then determine the candidate's player in this webpage; Analyze again the visual signature of described candidate's player object, whether satisfy predetermined size threshold values with the size of the video pictures of determining player plays, if determine that then webpage corresponding to this website is Video service class webpage;
Perhaps,
Matching degree according to the webpage in webpage corresponding to described station address and the video web page template judges whether webpage corresponding to this station address is Video service class webpage.
4. method according to claim 3 is characterized in that, the method also comprises:
After definite webpage corresponding to described website is Video service class webpage, this webpage is saved as the video web page template, described video web page template is used for as judging whether other webpages are the foundation of Video service class webpage.
5. method according to claim 1 is characterized in that, the step of upgrading the searching key word in the described searching key word database comprises:
Judge association search keyword in the described Search Results and the theme correlation degree of video service website, if the ratio of the network address of video service website class surpasses predetermined value in the current Search Results that returns, judge that then the current search keyword produces the ability of new station address, if the ratio that website is contained in predetermined candidate's network address database in the current Search Results that returns is lower than predetermined value, then the association search keyword in this Search Results is increased in the described searching key word database, is recording the website that obtains according to Search Results before in described candidate's network address database.
6. method according to claim 1, it is characterized in that, before the health degree of the station address in judging the video service website database, comprise that also it is the step of the address of video service website homepage that the non-home address in the described video service website database is merged stipulations, and this step comprises:
For two different websites in the video service website database, whether the Hostname of judging both is identical, if identical, judge then whether corresponding web site name is identical between the two, if identical, the size of both pathdepths relatively then, the website stipulations that pathdepth is large are the little website of pathdepth, the like, until all websites in the described video service website database of finishing dealing with.
7. method according to claim 6 is characterized in that, the mode of extracting described web site name comprises:
Extract the content of different web pages heading label under the same website, and utilize the longest common characters string algorithm to extract the frequency of occurrences is the highest in the different web pages heading label of same website content as the title of website.
8. method according to claim 1 is characterized in that, the step of the health degree of the station address in the described judgement video service website database comprises:
For website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is made up the webpage body figure of correspondence;
Calculate the similarity of body figure of the webpage of the corresponding degree of depth in website in each webpage body figure and the predetermined objectionable website template, and according to the scoring of the website in described similarity and the objectionable website template, determine that website to be assessed is with respect to the healthy score of the website in this objectionable website template;
According to the healthy score of website to be assessed with respect to the website in all objectionable website templates, calculate the health value of described website to be assessed.
9. method according to claim 8 is characterized in that, the method also comprises:
If the health value of described website to be assessed is lower than the second health degree threshold value of setting, then should add in the objectionable website template website to be assessed, comprise the objectionable website template base of described objectionable website template in order to renewal.
10. method according to claim 9 is characterized in that, described the first health degree threshold value is greater than described the second health degree threshold value.
CN201210465213.2A 2012-11-16 2012-11-16 A kind of method searching for bad video website Expired - Fee Related CN103020123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210465213.2A CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210465213.2A CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Publications (2)

Publication Number Publication Date
CN103020123A true CN103020123A (en) 2013-04-03
CN103020123B CN103020123B (en) 2016-08-24

Family

ID=47968727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210465213.2A Expired - Fee Related CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Country Status (1)

Country Link
CN (1) CN103020123B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699806A (en) * 2015-03-20 2015-06-10 无锡天脉聚源传媒科技有限公司 Method and device for searching video
WO2017084279A1 (en) * 2015-11-16 2017-05-26 乐视控股(北京)有限公司 Network live broadcast method, apparatus and system
CN106919835A (en) * 2015-12-24 2017-07-04 中国电信股份有限公司 Method and apparatus for processing malicious websites
CN108647225A (en) * 2018-03-23 2018-10-12 浙江大学 A kind of electric business grey black production public sentiment automatic mining method and system
CN108804444A (en) * 2017-04-28 2018-11-13 北京京东尚科信息技术有限公司 Information extraction method and device
CN110309423A (en) * 2019-06-28 2019-10-08 北京奇艺世纪科技有限公司 A kind of sensitive information recognition methods, device and electronic equipment
CN110442775A (en) * 2019-08-13 2019-11-12 杭州安恒信息技术股份有限公司 Acquisition methods, device and the electronic equipment of multiple level marketing Website publicity address
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN113536086A (en) * 2021-06-30 2021-10-22 北京百度网讯科技有限公司 Model training method, account scoring method, device, equipment, medium and product
CN116822805A (en) * 2023-08-29 2023-09-29 深圳市纬亚森科技有限公司 Education video quality monitoring method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604324A (en) * 2009-07-15 2009-12-16 中国科学技术大学 A kind of searching method and system of the video service website based on unit search
CN102523130A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Bad webpage detection method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604324A (en) * 2009-07-15 2009-12-16 中国科学技术大学 A kind of searching method and system of the video service website based on unit search
CN102523130A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Bad webpage detection method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699806B (en) * 2015-03-20 2018-05-08 无锡天脉聚源传媒科技有限公司 A kind of video searching method and device
CN104699806A (en) * 2015-03-20 2015-06-10 无锡天脉聚源传媒科技有限公司 Method and device for searching video
WO2017084279A1 (en) * 2015-11-16 2017-05-26 乐视控股(北京)有限公司 Network live broadcast method, apparatus and system
CN106919835A (en) * 2015-12-24 2017-07-04 中国电信股份有限公司 Method and apparatus for processing malicious websites
CN108804444A (en) * 2017-04-28 2018-11-13 北京京东尚科信息技术有限公司 Information extraction method and device
CN108647225A (en) * 2018-03-23 2018-10-12 浙江大学 A kind of electric business grey black production public sentiment automatic mining method and system
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN110309423A (en) * 2019-06-28 2019-10-08 北京奇艺世纪科技有限公司 A kind of sensitive information recognition methods, device and electronic equipment
CN110442775A (en) * 2019-08-13 2019-11-12 杭州安恒信息技术股份有限公司 Acquisition methods, device and the electronic equipment of multiple level marketing Website publicity address
CN113536086A (en) * 2021-06-30 2021-10-22 北京百度网讯科技有限公司 Model training method, account scoring method, device, equipment, medium and product
CN113536086B (en) * 2021-06-30 2023-07-14 北京百度网讯科技有限公司 Model training method, account scoring method, device, equipment, medium and product
CN116822805A (en) * 2023-08-29 2023-09-29 深圳市纬亚森科技有限公司 Education video quality monitoring method based on big data
CN116822805B (en) * 2023-08-29 2023-12-15 北京菜鸟无忧教育科技有限公司 Education video quality monitoring method based on big data

Also Published As

Publication number Publication date
CN103020123B (en) 2016-08-24

Similar Documents

Publication Publication Date Title
CN103020123A (en) Method for searching bad video website
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN100405371C (en) Method and system for abstracting new word
CN103955529B (en) A kind of internet information search polymerize rendering method
CN1924858B (en) Method and device for fetching new words and input method system
CN105045875B (en) Personalized search and device
CN101599089B (en) Method and system for automatically searching and extracting update information on content of video service website
Zheng et al. Template-independent news extraction based on visual consistency
US20080168041A1 (en) System and method for focused re-crawling of web sites
CN106204156A (en) A kind of advertisement placement method for network forum and device
CN102622445A (en) User interest perception based webpage push system and webpage push method
CN101587478A (en) Methods and devices for training, automatically labeling and searching images
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN103294781A (en) Method and equipment used for processing page data
CN103886020B (en) A kind of real estate information method for fast searching
CN104182412A (en) Webpage crawling method and webpage crawling system
CN101393565A (en) Facing virtual museum searching method based on noumenon
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN103246732A (en) Online Web news content extracting method and system
CN103530429A (en) Webpage content extracting method
Prajapati A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining
CN108959580A (en) A kind of optimization method and system of label data
CN104915422A (en) Webpage collecting method and device based on browser
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN101894109A (en) Database building method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160824

Termination date: 20211116

CF01 Termination of patent right due to non-payment of annual fee