CN103020123B - A kind of method searching for bad video website - Google Patents

A kind of method searching for bad video website Download PDF

Info

Publication number
CN103020123B
CN103020123B CN201210465213.2A CN201210465213A CN103020123B CN 103020123 B CN103020123 B CN 103020123B CN 201210465213 A CN201210465213 A CN 201210465213A CN 103020123 B CN103020123 B CN 103020123B
Authority
CN
China
Prior art keywords
website
webpage
search
video
video service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210465213.2A
Other languages
Chinese (zh)
Other versions
CN103020123A (en
Inventor
朱明�
尹文科
孙永录
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201210465213.2A priority Critical patent/CN103020123B/en
Publication of CN103020123A publication Critical patent/CN103020123A/en
Application granted granted Critical
Publication of CN103020123B publication Critical patent/CN103020123B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of method searching for bad video website, including: according to search key word structure searching request, obtain the Search Results returned, and obtain the station address in Search Results and association search key word;According to association search key word and the degree of correlation of the theme of video service website and the ability producing new objectionable website address, the search key word in more new search keyword database;Judge whether the webpage that the station address in Search Results is corresponding belongs to Video service class webpage, the most then web site name and network address are joined in video service website data base;Otherwise, this station address is abandoned;Judge the health degree of station address in video service website data base, and health degree is stored in bad video website data base less than the website of the first health degree threshold value.Thus provide the technical scheme that can search bad video website rapidly and accurately, it is simple to the website providing bad Video service is effectively supervised.

Description

A kind of method searching for bad video website
Technical field
The present invention relates to internet information retrieval technique field, particularly relate to a kind of method searching for bad video website.
Background technology
Along with developing rapidly of Internet technology, the Internet is relied on day by day strong by people, internet information content, especially It is content of multimedia, is also increasing rapidly.Meanwhile, the bad video content in the content of multimedia of internet information is also Swift and violent growth.
At present, the bad video service website in the Internet specifically includes that (1) directly provides bad video content program request clothes The video service website of business, this kind of website feature is according to multiple criteria for classification, in providing bad video in catalogue browsing mode Hold order program service;(2) site for service of P2P bad video resource sharing download is provided;Main as pointed by BT seed file Bad video resource download site;(3) site for service of the real-time bad net cast of P2P is provided.
Owing to above-mentioned three badness video service website quantity are many, but also it is being continuously increased and is changing.Therefore, Need a kind of information search technique scheme, in order to can automatically find and retrieve bag from internet mass information Website containing bad video content.But, existing internet search engine, such as search engines such as Google and Baidu, still The website that bad Video service is provided cannot be searched out accurately and effectively.
Summary of the invention
It is an object of the invention to provide a kind of method searching for bad video website, so as to accurately and effectively from a small quantity Internet information in automatically find and retrieve the website comprising bad video content.
It is an object of the invention to be achieved through the following technical solutions:
A kind of method searching for bad video website, including:
According to the search key word structure searching request in search keyword database;
Obtain the Search Results that search engine returns according to described searching request, and obtain the station address in Search Results With association search key word;
Degree of correlation according to the association search key word in current search result and the theme of video service website and Produce the ability of new objectionable website address, update the search key word in described search keyword database;
Judging whether the webpage that the station address in Search Results is corresponding belongs to Video service class webpage, if belonging to, then taking out Take out the title of website, this web site name and network address are joined in video service website data base;If being not belonging to, then abandon this Station address;
Judge the health degree of station address in video service website data base, and health degree is healthy less than first The website of degree threshold value is stored in bad video website data base.
The method also includes:
According to the key word in the element tags in the webpage relevant to Video service theme in described Search Results and Content in description information, updates described search keyword database;
And/or,
According to the webpage relevant to Video service theme in described Search Results turn to linking of other websites, more New described search keyword database.
Whether the webpage that the described station address judged in Search Results is corresponding belongs to the step bag of Video service class webpage Include:
It is loaded into webpage corresponding to website and runs the script on webpage, it may be judged whether there is the feature generating player HTML html tag, if existing, it is determined that the candidate's player in this webpage;Analyze described candidate's player again The visual signature of object, to determine whether the size of the video pictures of player plays meets predetermined size threshold values, the most then Determine that the webpage that this website is corresponding is Video service class webpage;
Or,
The matching degree of the webpage in the webpage corresponding according to described station address and video web-pages template judges this ground, website Whether webpage corresponding to location is Video service class webpage.
After determining that the webpage that described website is corresponding is Video service class webpage, this webpage is saved as video web-pages mould Plate, described video web-pages template is for as judging that whether other webpages are the foundation of Video service class webpage.
The step updating the search key word in described search keyword database includes:
Judge the theme correlation degree of the association search key word in described Search Results and video service website, if currently In the Search Results returned, the ratio of the network address of video service website class exceedes predetermined value, then judge that current search key word produces The ability of new station address, if website is contained in predetermined candidate's network address database in the current Search Results returned Ratio less than predetermined value, then the association search key word in this Search Results is increased to described search keyword database In, described candidate's network address database records the website obtained according to Search Results before.
Before the health degree of the station address in judging video service website data base, also include taking described video Non-home address in business site databases merges the step of the address that stipulations are video service website homepage, and this step bag Include:
For two different websites in video service website data base, it is judged that both Hostnames are the most identical, If identical, then judge that between the two corresponding web site name is the most identical, if identical, then compare both pathdepth big Little, it is the website that pathdepth is little by website stipulations big for pathdepth, the like, until having processed described Video service All websites in site databases.
The mode extracting described web site name includes:
Extract the content of different web pages heading label under same website, and utilize the longest common characters string algorithm to extract Go out content that in the different web pages heading label of same website, the frequency of occurrences the is the highest title as website.
The step of the health degree of the station address in described judgement video service website data base includes:
For website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is built correspondence Webpage ontology diagram;
Calculate the ontology diagram of each webpage ontology diagram and the webpage of the corresponding degree of depth in website in predetermined objectionable website template Similarity, and according to the scoring of the website in described similarity and objectionable website template, determine that website to be assessed is relative to this The healthy score of the website in objectionable website template;
According to website to be assessed relative to the healthy score of the website in all objectionable website templates, calculate described to be assessed The health value of website.
If the health value of described website to be assessed is less than the second health degree threshold value set, then this website to be assessed is added In objectionable website template, in order to update the objectionable website template base comprising described objectionable website template.
Described first health degree threshold value is more than described second health degree threshold value.
As seen from the above technical solution provided by the invention, efficiently solving that the embodiment of the present invention provides is existing Network search engines cannot accurately and efficiently find this technical problem with search video site for service.Thus provide one Plant and can automatically find from a small amount of internet information accurately and effectively and retrieve the website comprising bad video content Technical scheme, and then provide convenience condition for the management that effectively controls for bad video website.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, required use in embodiment being described below Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only some embodiments of the present invention, for this From the point of view of the those of ordinary skill in field, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings Accompanying drawing.
The handling process schematic diagram of the method that Fig. 1 provides for the embodiment of the present invention;
Fig. 2 A be the embodiment of the present invention totally realize configuration diagram one;
Fig. 2 B be the embodiment of the present invention totally realize configuration diagram two;
Fig. 3 is the search agent block flow diagram in the embodiment of the present invention;
Fig. 4 is the Search Results processing module flow chart in the embodiment of the present invention;
Fig. 5 is the key word evaluation module flow chart in the embodiment of the present invention;
Fig. 6 is that the network address in the embodiment of the present invention analyzes block flow diagram;
Fig. 7 is that the network address in the embodiment of the present invention merges block flow diagram;
Fig. 8 is the website health degree estimation flow figure in the embodiment of the present invention;
Fig. 9 is that the webpage ontology diagram in the embodiment of the present invention builds flow chart;
Figure 10 is the webpage ontology diagram OG in the embodiment of the present inventiond1
Figure 11 is the webpage ontology diagram OG in the embodiment of the present inventiond2
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments.Based on this Inventive embodiment, the every other enforcement that those of ordinary skill in the art are obtained under not making creative work premise Example, broadly falls into protection scope of the present invention.
Below in conjunction with accompanying drawing, the embodiment of the present invention is described in further detail.
Embodiments provide a kind of method searching for bad video website, its specific implementation as it is shown in figure 1, Can include following processing procedure:
Step 11, according to the search key word structure searching request in search keyword database, at this search key word Data base comprises at least one search key word;
Step 12, obtains the Search Results that search engine returns according to described searching request, and obtains in Search Results Station address and association search key word;
Wherein, described association search key word can be that the relevant search listed below described result of page searching is closed Keyword, such as the related search keywords under the result of page searching that Baidu or Google return;
Step 13, according to the relevant journey of the association search key word in current search result to the theme of video service website Spend and produce the ability of new objectionable website address, update the search key word in described search keyword database;
Specifically, update in described search keyword database search key word specific implementation can include with Lower processing procedure:
(1) theme correlation degree of the association search key word in described Search Results and video service website is judged, if In the current Search Results returned, the ratio of the network address of video service website class is less than predetermined value (such as more than 75% etc.), then it is assumed that Related search keywords below current result of page searching differs farther out with search for, therefore not by this Search Results Association search key word increases in described search keyword database, otherwise, performs process (2);
(2) judge that current search key word produces the ability of new station address, if net in the current Search Results returned The ratio that network address of standing is contained in predetermined candidate's network address database is less than predetermined value (as less than 75% etc.), then it is assumed that search specifically The ability that related search keywords below rope results page produces new video website address is relatively strong, therefore by this Search Results In association search key word increase in described search keyword database, otherwise, then it is assumed that under current result of page searching The ability that the related search keywords of side produces new video website address is poor, therefore the association in this Search Results is not searched Rope key word increases in described search keyword database;Wherein, described candidate's network address database records according to before Search Results obtain website.
Step 14, it is judged that whether the webpage that station address in Search Results is corresponding belongs to Video service class webpage, if belonging to In, then extract the title of website, this web site name and network address are joined in video service website data base;If being not belonging to, Then abandon this station address;
Specifically, in this step 14, it is judged that whether the webpage that station address in Search Results is corresponding belongs to video clothes Business class webpage processing mode specifically can include following any one:
(1) it is loaded into webpage corresponding to website and runs the script on webpage, it may be judged whether existing and generate player Feature HTML(HTML) label, if existing, it is determined that the candidate's player in this webpage;Analyze described candidate again The visual signature of player object, such as size, coordinate etc., to determine whether the size of the video pictures of player plays meets pre- Fixed size threshold values is (if the right margin of video pictures of player plays is to the distance of the right margin of the page, coboundary to the page Whether the distance of coboundary, lower boundary meet certain threshold values to the distance of page lower boundary), if, it is determined that this website Corresponding webpage is Video service class webpage;
(2) this website is judged according to the matching degree of webpage corresponding to described station address with the webpage in video web-pages template Whether webpage corresponding to address is Video service class webpage.
Alternatively, owing to the DOM tree structure of same type webpage is essentially identical, therefore, described website net is being determined After webpage corresponding to location is Video service class webpage, it is also possible to this webpage is saved as video web-pages template, for as judgement Whether other webpages are the foundation of Video service class webpage.
It should be noted that the execution sequential of above-mentioned steps 13 and step 14 is in no particular order, or, two steps can also Perform simultaneously.
Step 15, it is judged that the health degree of the station address in video service website data base, and health degree is less than The website of the first health degree threshold value is stored in bad video website data base;
Specifically, in this step 15, it is judged that the step of the health degree of the station address in video service website data base Suddenly can be, but not limited to include:
Firstly, for website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is built Corresponding webpage ontology diagram;
Wherein, corresponding webpage ontology diagram can include term information present in webpage, and according to there is art The directed edge of Weighted Coefficients between the different terms that the frequency that closeness relation between language and term occur in webpage term obtains And the numerical value of directed edge vector;
Afterwards, the webpage of each webpage ontology diagram and the website corresponding degree of depth webpage in predetermined objectionable website template is calculated The similarity of ontology diagram (such as, for bad template website A, calculates the different web pages of website to be assessed and bad template website A The meansigma methods of similarity, thus draw the website to be assessed similarity for this bad template website A), and according to described phase Like the scoring of degree with the website in objectionable website template, determine that website to be assessed is relative to the website in this objectionable website template Healthy score;I.e. can be after drawing the website to be assessed similarity for this bad template website A, according to similarity with bad The scoring of bad template website A in website form, can draw treating relative to bad template website A based on predetermined algorithm The healthy score of assessment website;
Finally, according to website to be assessed relative to the healthy score of the website in all objectionable website templates, calculate described The health value of website to be assessed;Specifically, can calculate respectively based on other all template website that objectionable website template comprises Go out the healthy score of website to be assessed, then calculate the multiple healthy score average of website to be assessed, thus it is to be evaluated to calculate acquisition Estimate the health value of website.
Alternatively, in the processing procedure of the health degree of the station address in judging video service website data base, this The technical scheme that inventive embodiments provides can also include:
If the health value of described website to be assessed is less than the second health degree threshold value set, then this website to be assessed is added In objectionable website template, in order to update the objectionable website template base comprising described objectionable website template.Further, described first Health degree threshold value can be more than described second health degree threshold value.Specifically, if the health value of website to be assessed is less than bad net Stand the threshold value (the i.e. second health degree threshold value) of template sets, such as less than 1, then website to be assessed can be added objectionable website template In, in order to update objectionable website template base, can also detect whether objectionable website template base exists simultaneously and discard not Good website form, if having, is then deleted, it addition, also need to add in bad video website data base this website to be assessed; If website to be assessed health value is less than the first health degree threshold value and more than the second health degree threshold value, such as less than 3 the most then, the most permissible Only this website to be assessed is added in bad video website data base, and be not added in objectionable website template base.If it is to be evaluated Estimate website health value and be more than the first health degree threshold value, then abandon this website to be assessed.
In above-mentioned processing procedure, the search key word comprised in corresponding search keyword database can also use with At least one mode lower is updated processing, wherein:
Mode one: according to the pass in the element tags in the webpage relevant to Video service theme in described Search Results Content in keyword and description information, updates described search keyword database;I.e. at the website through above-mentioned steps 14 After assessment processes, it is thus achieved that in the webpage relevant to Video service theme, and according in this webpage relevant to Video service theme " < meta name=" keywords " label and " < " content of label parses corresponding meta name=description Search key word, then the search key word parsed is updated in described search keyword database;
Mode two: according to other websites that turn in the webpage relevant to Video service theme in described Search Results Link, updates described search keyword database, for example, it is possible to according to the network address in the link turning to other websites, it is thus achieved that should Key words in network address describes label, and utilizes this key words to describe label to carry out more new keywords.
In the embodiment of the present invention, performing step 15 to judge the health of the station address in video service website data base Before degree, it is also possible to include that the non-home address in described video service website data base is merged stipulations is Video service net The process step of the address of homepage of standing, and this process step specifically may include that
For two different websites in video service website data base, it is judged that both Hostnames are the most identical, If identical, then judge that between the two corresponding web site name is the most identical, if identical, then compare both pathdepth big Little, it is the website that pathdepth is little by website stipulations big for pathdepth, the like, until having processed described Video service All websites in site databases.Such as, for two different station address U1 and U2, their host name is first determined whether Claim the most identical, if difference, can not merge, if both Hostnames are identical, then determine whether the net of they correspondences Station name is the most identical, if difference can not merge, if both web site name are identical, compares this two nets the most further The size of the pathdepth of station address, if the pathdepth of U1 is less than U2, then it is assumed that U2 is a part for website corresponding for U1, Can be U1 by U2 stipulations;Vice versa.
Alternatively, for obtaining the web site name of website in above-mentioned processing procedure, then can also include extracting described network The process step of title, and the mode extracting described web site name specifically may include that
Extract the content of different web pages heading label under same website (i.e. "<title>" label), and utilize the longest Common characters string algorithm extracts content that in the different web pages heading label of same website, the frequency of occurrences is the highest as website Title, to realize the corresponding operation extracting described web site name.
For ease of understanding, below in conjunction with concrete application, the present invention is realized process and be described in detail.
The embodiment of the present invention can be acted on behalf of including each search engine, Yi Jijie in concrete application process as shown in Figure 2 A Fruit extraction, key word assessment, network address analysis, network address merge and website health degree evaluation module.And share same between each module Individual data base, each module can be disposed on a separate machine, it is also possible to is deployed in same machine.Shown in Fig. 1 Processing framework and can support " N+1 " pattern arbitrarily, wherein, N represents that any multiple host (comprises above-mentioned each to process in main frame Module), 1 represents the data base shared.The most just any multiple host can be made to run same group of processing module, and each main frame leads to Cross shared data bank and carry out the exchange of data.The overall performance of system can be effectively improved, i.e. by such process framework Improve the process performance of search objectionable website.
In FIG, corresponding shared data bank may include that search keyword database, candidate's network address database, video Website form data base, video website data base, middle volatile data base and bad video website data base, and bad video Website form data base, video network address template database and video network address database.
In fig. 2, in order to search for greatest extent and find video service website, comprehensively employ Baidu, Google, This four big search engine of Bing, Yahoo, the corresponding different search engine proxy module of each search engine, and result is taken out Take, key word assessment, network address analysis, network address merge and website health degree evaluation module can be then shared.
The embodiment of the present invention totally realize framework as shown in Figure 2 B, search agent module M1, Search Results can be included Processing module M2, key word evaluation module M3, network address analyze module M4, network address merges module M5 and website health degree evaluation module M6.The function of modules will be respectively described, wherein below:
Search agent module M1, for automatically generating the searching request of search engine according to search keyword database, and Obtain the Search Results returned based on searching request;
Search Results processing module M2, extracts for Search Results, obtains specifically for resolving above-mentioned search agent module M1 The Search Results taken, positions and extracts the station address (i.e. website) in Search Results and the result of page searching returned The related search keywords C(i.e. association search key word of lower section);
Key word evaluation module M3, for judge current search return association search key word C below results page with Theme (such as the theme such as cuisines or the football) degree of correlation of video service website and produce the ability of new video website address, If the search for degree of correlation of related search keywords C is low or to produce the ability of new video website address weak, the most not Expanded to again search in keyword database;
Network address analyzes module M4, and knowledge is similar to video website template with network address to be used for utilizing video network address automatically to identify Degree, it is judged that whether current web page belongs to Video service class, the most then extract the title of website, by this web site name and network address L Join video service website collection D3(i.e. video network address database) in, and utilize the content pair of specific html tag in webpage Search keyword database D1 is extended;If it is not, then abandon this network address, and revise respective wire in candidate's network address database D2 The type of location, in order to key word evaluation module M3 carries out making reference during key word assessment;Wherein, the type of corresponding network address is revised Being and to revise its type for those non-video websites, identifying its type is non-video website, to this type of website, crucial During word assessment can according to key word obtain whether result contain this type of website and contain these network address quantity number, if Exceed certain threshold value containing this type of network address more, then understand keyword expansion ability, negligible, as do not contained or containing relatively Few, less than threshold value, then it is assumed that keyword expansion ability is strong, can retain.
Network address merges module M5, for by the video i.e. video network address database of network address set D3() in non-home address close And the address that stipulations are video service website homepage, to find out the homepage in video service website, it is thus achieved that video website set D4, I.e. video website data base.
Website health degree evaluation module M6, for merging the video in the video website set D4 that module M5 obtains to network address Website carries out health degree assessment, to obtain bad video website therein, determines bad video website set D5, the worst video Site databases.
In order to be best understood from the implication of the general frame schematic diagram that Fig. 2 presents, each in Organization Chart will be processed below The function of module is described in detail.
(1) search agent module M1
The handling process of search agent module is as it is shown on figure 3, may include that
Step 31, search agent module judges to search in critical data storehouse whether also have untapped search key word, as Fruit has, then take out this search key word, and this search key word of juxtaposition is for using state;
Step 32, search agent module generates the searching request of search engine according to the search key word taken out;
Step 33, obtains the Search Results returned, i.e. corresponding result of page searching, and the result of page searching that will return It is stored in middle volatile data base;
Step 34, it may be judged whether read the last page of Search Results, without reading last page, then continues Read the content of lower one page;If read last page, then returning to step 31, i.e. judging whether data base also has not The key word used;
Wherein whether search agent module specifically can change according to the result of page searching content returned and judge Whether read last page.
(2) Search Results processing module M2
The handling process of Search Results processing module as shown in Figure 4, may include that
Step 41, obtains certain search key word used from search keyword database, and makes obtain Search key word be marked;
Step 42, finds out all thus search key words from the result of page searching of the return of middle volatile data base and sends out Existing result of page searching;
Step 43, reads out the result of page searching found by this search key word found out, and at middle ephemeral data Storehouse is deleted this result of page searching;
Step 44, extracts the related search keywords returned below result of page searching, and the Search Results returned The station address comprised in the page, and the station address of return being put in candidate's network address database, in order at follow-up its He searches in the processing procedure of bad video website as judging that whether network address is the foundation of new discovery network address;
If the result of page searching of front predetermined quantity (such as page 20), then also need to below result of page searching The related search keywords extracted and station address are put in the search antistop list of middle volatile data base, close in this search Keyword table records related search keywords (being called for short search key word to be assessed) to be assessed, and associated there Website.
Repeat above-mentioned processing procedure, until determining having used in search keyword data according to the situation of labelling Search key word all can process.
(3) key word evaluation module M3
The handling process of key word evaluation module is as it is shown in figure 5, may include that
Step 51, it is judged that whether there is search key word to be assessed in the search antistop list of middle volatile data base, If it is not, program exits, if it has, then perform step 52;
Step 52, takes out all website records that the search key word to be assessed with this is associated, is taking out website While network address record, it is deleted from middle volatile data base;
Step 53, calls website evaluation module and these station addresses is analyzed assessment;
Step 54, it is judged that whether the ratio of all non-video services websites is more than 75%, if it is, return step 51, no Then, step 55 is performed;
Step 55, utilizes candidate's network address database to judge whether these websites are newfound network address, candidate's network address Data base deposits the website obtained in the result of page searching of all returns;
Step 56, it is judged that all be not the ratio of newfound network address whether more than 75%, if it is, return step 51, Otherwise, step 57 is performed;
Step 57, puts in search keyword database by the search key word of this assessment, to realize corresponding search The renewal of the search key word in keyword database processes.
(4) network address analyzes module M4
Network address analyzes the handling process of module as shown in Figure 6, may include that
Step 61, it is judged that whether there is network address to be assessed in candidate's network address database, if it is not, program exits, Otherwise, step 62 is performed;
Step 62, reads video network address template database, it is judged that network address to be assessed is similar to wherein video network address template Degree, obtains preliminary recognition result, if determining it is not video service website according to recognition result, then returns to step 61, otherwise, Preliminary recognition result is further identified by knowledge to utilize video network address automatically to identify;
The corresponding knowledge of identification automatically specifically includes that loading webpage and runs script thereon, it is judged that broadcast either with or without generation Put the feature html tag of device;Analyze the visual signature of candidate's player object in webpage, such as size, coordinate etc., broadcast to determine Whether width and the height of putting the picture that device is play meet certain threshold values, and the distance of the right margin of its right margin to the page, coboundary are arrived Whether the distance of page coboundary, lower boundary meet certain threshold values to the distance of page lower boundary;
By the process of step 62, if it is determined that this network address to be assessed is not Video service class station address, then return to Step 61, otherwise, performs step 63;
Step 63, extracts web site name, utilize " < meta name=" keywords in webpage " label and " < Metaname=description " the content update search keyword database of label, and by web site name and corresponding network address Put into video network address database;This page can be saved as video network address template (being i.e. stored in video network address template database) simultaneously, So that subsequent web pages can determine whether Video service class webpage according to its similarity with video network address template.
Specifically, the extraction of corresponding web site name can include in the way of employing: extracts different web pages under same website " <title>" then the content of label utilizes the longest common characters string algorithm to extract in "<title>" label the frequency of occurrences High content, and using this character string as the title of website.
(5) network address merges module M5
Network address merges the handling process of module as it is shown in fig. 7, may include that
Step 71, it is judged that whether there is network address to be combined in video network address database, if it does not exist, then program is moved back Go out, otherwise, perform step 72;
Step 72, takes out network address U to be combined, it is thus achieved that its web site name and Hostname;
Step 73, finds out all address sets having same hostname and web site name with U in video network address database Close;
Step 74, this network address set of circular treatment;
Specifically, a network address first can be taken out from this set, if the path that the pathdepth of this network address is less than U The degree of depth, then delete U from data base, and loop ends (i.e. can re-execute step 71);Otherwise, this network address is deleted corresponding Record, circulation continues (the most again take out a network address from this set and carry out corresponding pathdepth judgement process);Through phase The circular treatment answered, if the degree of depth of U is little, then U eventually stays in data base, is otherwise the network address less than U pathdepth Staying in data base, therefore this algorithm always can ensure that the address obtaining intended video service website homepage, non-home address Can be deleted in merging process.
(6) website health degree evaluation module M6
The handling process of website health degree assessment as shown in Figure 8, may include that
Step 81, the artificial some objectionable website templates (the worst video website template database) that set, and extract wherein Webpage build the webpage ontology diagram that each webpage of website in objectionable website template is corresponding;
Step 82, for website to be assessed, takes its some webpage according to network station deeply, then to each of which net Page, builds its webpage ontology diagram;
Step 83, calculates webpage ontology diagram and the corresponding degree of depth in the website in objectionable website template of website to be assessed The similarity of the webpage ontology diagram that webpage is corresponding;
Step 84, calculates the meansigma methods of similarity, draws the health degree of website to be assessed;
Such as template website A, the meansigma methods of the different web pages of website to be assessed and the similarity of A can be calculated, from And draw the similarity for this website to be assessed, further according to the scoring of similarity and template website A of website to be assessed, based on Respective algorithms draws the website to be assessed health score relative to A;Afterwards, based on all templates of other in objectionable website template Calculating website to be assessed, website health score, and calculate the healthy score average of website to be assessed, draw the strong of website to be assessed Health value;
Step 85, it is judged that whether the health degree of website to be measured (website the most to be assessed) is less than the threshold value set, if to be measured Website health value be less than objectionable website template sets threshold value (the i.e. second health degree threshold value), such as less than 1, then website to be measured is added In objectionable website template base (the worst video website template database), in order to update objectionable website template base, detect not simultaneously Whether there is the objectionable website template discarded in good template base, and be deleted, website the most to be measured adds bad regarding Frequently in site databases;If website to be measured health value be less than health degree threshold value (the i.e. first health degree threshold value), such as less than 3, by it It is directly added in bad video website data base.If website to be measured health value is more than the first health degree threshold value, then abandon this to be measured Website.
In above-mentioned processing procedure, corresponding webpage ontology diagram builds flow process as it is shown in figure 9, may include that
Step 91, extracts webpage, obtains web page text;
Step 92, extracts the term belonging to bad Video Key word field among webpage, obtains term list;
Step 93, according to oriented with weights between known bad Video Key word domain body figure difference key word Limit, builds term vector, draws webpage ontology diagram according to mapping table;Wherein, corresponding domain body figure is for statement two not With the closeness relation between term;Domain body figure can be obtained by known algorithm, and it specifically can be by different term roots Tool from a term to another term is obtained according to the closeness relation in semantic, classification corresponding between them and structure The value representing closeness relation degree of body, and set up corresponding domain body figure.
It is assumed that the known domain body chart that there is key word (i.e. term) A, B, C, D, E is as shown in table 1:
Table 1
A B C D E
A 1 0.1 0.2 0.3 0.4
B 0.1 1 0.2 0.3 0.4
C 0.1 0.2 1 0.3 0.4
D 0.1 0.2 0.3 1 0.4
E 0.1 0.2 0.3 0.4 1
Then based on above-mentioned table 1, corresponding webpage ontology diagram building process comprises the following steps:
The first step, it is thus achieved that web page text key words content, it is assumed that d1 is webpage to be measured, d2 is Template web page, and:
D1:A A B D (text size=4), d2:D D D E (text size=4);
Second step, it is thus achieved that term weight, its algorithm is term weight w=term occurrence number/term list size, draws Td1={ (A, 0.5), (B, 0.25), (D, 0.25) } (term quantity=3), Td2={ (D, 0.75), (E, 0.25) } (term quantity= 2), in bracket, key word numeral below is the term weight that key word is corresponding;
3rd step, creates webpage ontology diagram Weight term vector, and detailed process may include that
In d1 as a example by A, in d1, the term weight of A is understood by second step is 0.5, and it is 0.1 to the value of B as shown in Table 1, Therefore the value of Weight term vector A → B is 0.5*0.1=0.05, by that analogy, obtains all values;
Such as, centered by term node A, obtain it point to other nodes and all weighted-vectors of other nodes sensing A Value, in table 2, in the second row the 3rd row, the vector value 0.05 of A → B calculates, table 1 know that the directed edge of A → B is 0.1, simultaneously During the first step knows d1 again, the frequency of occurrences of A is 0.5, show that in table 2, in A → B, Weight term vector is 0.5*0.1=accordingly 0.05, being similar to and can obtaining B → A is 0.05;Centered by B, obtain it afterwards point to other nodes and other nodes and point to all of A The value of weighted-vector, in table 2, second row the 3rd row A → B result of calculation is the phase of A → B during the frequency of B is multiplied by table 1 in the first step Should be worth, obtain 0.1*0.25=0.025;Next calculate other nodes of sensing centered by other nodes and other nodes refer to To all weighted-vector values of itself, the maximum finally according to same position is filled in a form, such as in the second row the 3rd row Big value is the weighted-vector value 0.05 of gained centered by A, so position fills out 0.05;
4th step, builds the list of webpage body vector diagram, and same vector has repetition values, then take wherein the maximum, with this Constitute the list of webpage ontology diagram, table 2 below and table 3 and be respectively webpage ontology diagram OGd1And OGd2List:
Table 2
A B C D E
A 0.5 0.05 0.1 0.15 0.2
B 0.05 0.25 0.05 0.075 0.1
C 0.05 0.05 - 0.075 -
D 0.05 0.05 0.075 0.25 0.1
E 0.05 0.05 - 0.1 -
Table 3
A B C D E
A - - - 0.225 0.1
B - - - 0.225 0.1
C - - - 0.225 0.1
D 0.075 0.15 0.225 0.75 0.3
E 0.025 0.05 0.075 0.3 0.25
The webpage ontology diagram of d1 and d2 can be drawn out, the most as shown in Figure 10 and Figure 11 according to above-mentioned table 2 and table 3.
In above-mentioned processing procedure, in step 83 and 84, calculate similarity and the meansigma methods of similarity, and then draw to be evaluated The processing procedure of the health degree estimating website may include that
First, it is determined that the process of webpage to be measured and Template web page similarity may include that
For the similarity between webpage to be measured and Template web page, equation below can be used to judge, its formula isOGm=OGd1∩OGd2Refer to be had by both, and take in same limit numerical value All weights sums of little person;The OG obtained accordinglymIt is listed as follows,
Table 4
A B C D E
A - - - 0.15 0.1
B - - - 0.075 0.1
C - - - 0.075 -
D 0.05 0.05 0.075 0.25 0.1
E 0.025 0.05 - 0.1 -
Understand OGm=0.15+0.1+0.075+0.1+0.075+0.05+0.05+0.075+0.25+0.1+0.025+0.05+ 0.1=1.2, therefore understand:
sim(OGd1, OGd2)=1.2/2.275=0.527;
Assuming that having five webpages carries out similarity determination, remaining value is respectively 0.489, and 0.501,0.496,0.515, Then its similarity average mean=(0.527+0.489+0.501+0.496+0.515)/5=0.5056, therefore understand website to be measured phase Similarity for one of them website is 0.5056.
Afterwards, the health degree of website to be assessed is determined according to similarity meansigma methods;
Specifically, website health angle value is respectively 0, and 1,2,3,4,5, its health value computing formula can be:
Wherein, the health degree of template website is 0, if only one of which template website, then according to above-mentioned computing formula The health degree of website to be assessed is 2, i.e. health value is 2.
The technical scheme provided based on the invention described above embodiment, can search for the bad video net of discovery rapidly and accurately Stand in order to be able to corresponding bad video website is effectively controlled and manages.
The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, Any those familiar with the art in the technical scope of present disclosure, the change that can readily occur in or replacement, All should contain within protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Enclose and be as the criterion.

Claims (8)

1. the method searching for bad video website, it is characterised in that including:
According to the search key word structure searching request in search keyword database;
Obtain the Search Results that search engine returns according to described searching request, and obtain the station address in Search Results and pass Connection search key word;
According to the association search key word in current search result and the degree of correlation of the theme of video service website and generation The ability of new objectionable website address, updates the search key word in described search keyword database;Its step includes: judge Association search key word in described Search Results and the theme correlation degree of video service website, if the current search knot returned In Guo, the ratio of the network address of video service website class exceedes predetermined value, then judge that current search key word produces new station address Ability, if the ratio that in the current Search Results returned, website is contained in predetermined candidate's network address database is less than pre- Definite value, then increase in described search keyword database by the association search key word in this Search Results, described candidate Network address database records the website obtained according to Search Results before;
Judging whether the webpage that the station address in Search Results is corresponding belongs to Video service class webpage, if belonging to, then extracting The title of website, joins this web site name and network address in video service website data base;If being not belonging to, then abandon this website Address;
Judge the health degree of station address in video service website data base, and by health degree less than the first health degree threshold The website of value is stored in bad video website data base;
Wherein, the step of the health degree of the station address in described judgement video service website data base includes:
For website to be assessed, choose the webpage of predetermined quantity according to network station deeply, and each webpage is built corresponding net Page ontology diagram;The step building webpage ontology diagram includes: extracts webpage, obtains web page text;Belong among webpage The term in bad Video Key word field extracts, and obtains term list;According to known bad Video Key word field originally With the directed edge of weights between body figure difference key word, build term vector, draw webpage ontology diagram according to mapping table;Its In, corresponding domain body figure closeness relation between two different terms of statement;
Calculate the phase of each webpage ontology diagram and the ontology diagram of the webpage of the corresponding degree of depth in website in predetermined objectionable website template Like degree, and according to the scoring of described similarity with the website in objectionable website template, determine that website to be assessed is bad relative to this The healthy score of the website in website form;
According to website to be assessed relative to the healthy score of the website in all objectionable website templates, calculate described website to be assessed Health value.
Method the most according to claim 1, it is characterised in that the method also includes:
According to the key word in the element tags in the webpage relevant to Video service theme in described Search Results and description Content in information, updates described search keyword database;
And/or,
According to the webpage relevant to Video service theme in described Search Results turn to linking of other websites, update institute State search keyword database.
Method the most according to claim 1, it is characterised in that the net that station address in described judgement Search Results is corresponding Whether page belongs to the step of Video service class webpage includes:
It is loaded into webpage corresponding to website and runs the script on webpage, it may be judged whether the feature that there is generation player is super civilian This markup language html tag, if existing, it is determined that the candidate's player in this webpage;Analyze described candidate's player object again Visual signature, to determine whether the size of the video pictures of player plays meets predetermined size threshold values, if, it is determined that The webpage that this website is corresponding is Video service class webpage;
Or,
The matching degree of the webpage in the webpage corresponding according to described station address and video web-pages template judges this station address pair Whether the webpage answered is Video service class webpage.
Method the most according to claim 3, it is characterised in that the method also includes:
After determining that the webpage that described website is corresponding is Video service class webpage, this webpage is saved as video web-pages template, Described video web-pages template is for as judging that whether other webpages are the foundation of Video service class webpage.
Method the most according to claim 1, it is characterised in that the station address in judging video service website data base Health degree before, also include by described video service website data base non-home address merge stipulations be Video service The step of the address of website homepage, and this step includes:
For two different websites in video service website data base, it is judged that both Hostnames are the most identical, if phase With, then judge that web site name corresponding between the two is the most identical, if identical, then compare the size of both pathdepths, will The website stipulations that pathdepth is big are the website that pathdepth is little, the like, until having processed described video service website All websites in data base.
Method the most according to claim 5, it is characterised in that the mode extracting described web site name includes:
Extract the content of different web pages heading label under same website, and it is same to utilize the longest common characters string algorithm to extract The content that in one website different web pages heading label, the frequency of occurrences is the highest is as the title of website.
Method the most according to claim 1, it is characterised in that the method also includes:
If the health value of described website to be assessed is less than the second health degree threshold value set, then this website to be assessed is added bad In website form, in order to update the objectionable website template base comprising described objectionable website template.
Method the most according to claim 7, it is characterised in that described first health degree threshold value is more than described second health degree Threshold value.
CN201210465213.2A 2012-11-16 2012-11-16 A kind of method searching for bad video website Expired - Fee Related CN103020123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210465213.2A CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210465213.2A CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Publications (2)

Publication Number Publication Date
CN103020123A CN103020123A (en) 2013-04-03
CN103020123B true CN103020123B (en) 2016-08-24

Family

ID=47968727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210465213.2A Expired - Fee Related CN103020123B (en) 2012-11-16 2012-11-16 A kind of method searching for bad video website

Country Status (1)

Country Link
CN (1) CN103020123B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699806B (en) * 2015-03-20 2018-05-08 无锡天脉聚源传媒科技有限公司 A kind of video searching method and device
CN105897672A (en) * 2015-11-16 2016-08-24 乐视云计算有限公司 Network broadcast method, device and system
CN106919835B (en) * 2015-12-24 2020-11-24 中国电信股份有限公司 Method and device for processing malicious website
CN108804444B (en) * 2017-04-28 2022-03-04 北京京东尚科信息技术有限公司 Information capturing method and device
CN108647225A (en) * 2018-03-23 2018-10-12 浙江大学 A kind of electric business grey black production public sentiment automatic mining method and system
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN110309423A (en) * 2019-06-28 2019-10-08 北京奇艺世纪科技有限公司 A kind of sensitive information recognition methods, device and electronic equipment
CN110442775A (en) * 2019-08-13 2019-11-12 杭州安恒信息技术股份有限公司 Acquisition methods, device and the electronic equipment of multiple level marketing Website publicity address
CN113536086B (en) * 2021-06-30 2023-07-14 北京百度网讯科技有限公司 Model training method, account scoring method, device, equipment, medium and product
CN116822805B (en) * 2023-08-29 2023-12-15 北京菜鸟无忧教育科技有限公司 Education video quality monitoring method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604324A (en) * 2009-07-15 2009-12-16 中国科学技术大学 A kind of searching method and system of the video service website based on unit search
CN102523130A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Bad webpage detection method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604324A (en) * 2009-07-15 2009-12-16 中国科学技术大学 A kind of searching method and system of the video service website based on unit search
CN102523130A (en) * 2011-12-06 2012-06-27 中国科学院计算机网络信息中心 Bad webpage detection method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Also Published As

Publication number Publication date
CN103020123A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103020123B (en) A kind of method searching for bad video website
Zhang et al. Ad hoc table retrieval using semantic similarity
CN108415902B (en) Named entity linking method based on search engine
CN103955529B (en) A kind of internet information search polymerize rendering method
US10997256B2 (en) Webpage classification method and apparatus, calculation device and machine readable storage medium
CN101587478B (en) Methods and devices for training, automatically labeling and searching images
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN105045875B (en) Personalized search and device
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN107609152A (en) Method and apparatus for expanding query formula
CN103294681B (en) Method and device for generating search result
CN102622445A (en) User interest perception based webpage push system and webpage push method
CN103886020B (en) A kind of real estate information method for fast searching
CN103631794A (en) Method, device and equipment for sorting search results
Yang OntoCrawler: A focused crawler with ontology-supported website models for information agents
CN105138558A (en) User access content-based real-time personalized information collection method
CN106844482B (en) Search engine-based retrieval information matching method and device
CN103310013A (en) Subject-oriented web page collection system
CN103914538B (en) theme capturing method based on anchor text context and link analysis
An et al. A heuristic approach on metadata recommendation for search engine optimization
CN106202312B (en) A kind of interest point search method and system for mobile Internet
Setayesh et al. Presentation of an Extended Version of the PageRank Algorithm to Rank Web Pages Inspired by Ant Colony Algorithm
Annam et al. Entropy based informative content density approach for efficient web content extraction
CN104281693A (en) Semantic search method and semantic search system
CN107133317A (en) A kind of network public-opinion subject extraction method based on neologisms

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160824

Termination date: 20211116

CF01 Termination of patent right due to non-payment of annual fee