WO2018053863A1

WO2018053863A1 - Identifying video pages

Info

Publication number: WO2018053863A1
Application number: PCT/CN2016/100192
Authority: WO
Inventors: Albert Joseph Kishan Thambiratnam; Bo Han; Wangli CHAO
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2018-03-29
Also published as: CN108475275A

Abstract

The present disclosure provides method, apparatus and system for identifying video pages. To identify video pages, structured contents of a web page on the Internet may be obtained firstly. Video object properties may be extracted from the structured contents of the web page. A page-based classification model built through machine learning may be used for determining whether the web page is a video page. The video object properties may be used as input features for the page-based classification model.

Description

IDENTIFYING VIDEO PAGES

BACKGROUND

Search engines are widely used by Internet users to search for interested contents. In some circumstances, a user may desire to only receive a certain type of search results, such as video pages. For example, the user may want to query a video content, and in response to such a query from the user, a search engine may return a list of video pages that are associated with the video content queried by the user. Here, a video page indicates a web page that includes at least one video and sets the at least one video as a dominant content.

In order to respond to queries for video contents, a search engine provider should preestablish, such as, a database or a Big Table containing indices of video pages on the Internet, wherein the database and the Big Table may be based on local or distributed storage. It is required to firstly identify video pages on the Internet for establishing the database or the Big Table. There are some existing approaches for identifying video pages, such as a template-based approach, a URL-based approach, a site map-based approach, etc. Usually, these approaches direct to a small range of video web sites, such as top and popular video web sites, and rely on rule information of video pages on these web sites that is defined manually, such as, template rules, URL rules, etc. For example, for the template-based approach or the URL-based approach, some video web sites may manually design rules, such as page content template rules, page layout template rules or URL rules, for video pages on the video web sites, and a search engine provider may summarize rule information from some video pages on the video web sites and utilize the rule information to further identify video pages on the video web sites. For another example, for the site map-based approach, an operator of a top and popular video web site may initiatively provide a list of web pages on the web site together with corresponding meta data that is used to identify whether a web page is a video page to a search engine provider, and then the search engine provider may utilize the meta data and a site map of the web site to identify video pages on the web site.

Precision of identifying video pages according to the existing approaches depends on the rule information manually defined. Furthermore, the range of video pages that can be detected usually focuses on a small number of top and popular video web sites.

SUMMARY

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure may provide method, apparatus and system for identifying video pages.

In an aspect, the present disclosure provides a method for identifying video pages. According to the method, structured contents of a web page may be obtained. The structured contents may be used for extracting video object properties. A page-based classification model built through machine learning may be used for determining whether the web page is a video page. The video object properties may be used as input features for the page-based classification model.

In another aspect, the present disclosure provides an apparatus for identifying video pages. The apparatus may comprise a structured content obtaining module, a property extractor and a video page classifier. The structured content obtaining module may be configured for obtaining structured contents of a web page. The property extractor may be configured for extracting video object properties from the structured contents. The video page classifier may be configured for determining whether the web page is a video page by using a page-based classification model built through machine learning, the video object properties being as input features for the page-based classification model.

In another aspect, the present disclosure provides a system for identifying video pages. The system may comprise one or more processors and a memory. The memory may store computer-executable instructions that, when executed, cause the one or more processors to perform any operations of the methods according to various aspects of the present disclosure.

In another aspect, the present disclosure provides a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods according to various aspects of the present disclosure.

It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of a few of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

FIG. 1 is a flowchart of an exemplary method for identifying video pages according to an embodiment of the present disclosure.

FIG. 2 illustrates an exemplary apparatus for identifying video pages according to an embodiment of the present disclosure.

FIG. 3 is a flowchart of an exemplary method for obtaining structured contents of a web page according to an embodiment of the present disclosure.

FIG. 4 illustrates an exemplary apparatus for identifying video pages according to an embodiment of the present disclosure.

FIG. 5 is a flowchart of an exemplary method for determining a potential video page based on URL according to an embodiment of the present disclosure.

FIG. 6 illustrates an exemplary apparatus for identifying video pages according to an embodiment of the present disclosure.

FIG. 7 illustrates an exemplary system for identifying video pages according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

Embodiments of the present disclosure may provide method, apparatus and system for identifying video pages. The embodiments of the present disclosure are applicable for various search engines. For example, through implementing the embodiments of the present disclosure, video pages on the Internet can be effectively identified, and may be further added to a database or a Big Table containing indices of video pages maintained for a search engine. Thus, when a user queries a certain video content through the search engine, the search engine may retrieve video page indices from the database or the Big Table and return to the user.

In an aspect, the present disclosure proposes utilizing a page-based classification model for identifying video pages. The page-based classification model may be built through machine learning. Various machine learning algorithms may be applied for building the page-based classification model based on various video object properties of web pages. By using the page-based classification model built through machine learning, precision of identifying video pages and a recall rate can be increased. In another aspect, the present disclosure proposes further utilizing a URL-based classification model for determining potential video pages. The URL-based classification model may be built through machine learning in advance. By jointly using the URL-based classification model and the page-based classification model, the precision of identifying video pages and the recall rate can be further increased, and the efficiency of identifying video pages can be dramatically improved.

The embodiments of the present disclosure can perform video page identification on a relatively large range of web pages on the Internet, not restricted to the small number of top and popular video web sites. Thus, more video pages could be identified for search engines. The embodiments of the present disclosure can also achieve automatic video page identification, and thus can continually identify video pages on the Internet.

It is to be understood that the exemplary environment described above is only for the purpose of illustration without suggesting any limitations as to the scope of the present disclosure. The present disclosure can be embodied with a different structure and/or functionality.

FIG. 1 is a flowchart of an exemplary method 100 for identifying video pages according to an embodiment of the present disclosure.

The method 100 starts at 102 and proceeds to 104. At 104, structured contents of a web page may be obtained. It is known that a web page may be implemented by various markup languages, such as, HTML, XML, etc. A source code document of a web page, e.g., a HTML document, can be converted into structured contents. The embodiments of the present disclosure may adopt any approaches for obtaining structured contents of a web page. In an implementation, obtaining structured contents of a web page at 104 may merely indicate receiving the structured contents. In this case, the structured contents may be provided by any web page techniques or by any third parties. In another implementation, obtaining structured contents of a web page at 104 may include a process of generating the structured contents. For example, the structured contents may be generated by using Document Object Model (DOM) techniques which provide structured representation approaches for, such as, HTML documents, XML documents, etc. In this case, at 104, for example, a DOM tree may be obtained for the web page. That is, the structured contents may be represented as a DOM tree. Here, the web page, for which structured contents are obtained, may be any one of web pages on the Internet. Alternatively, this web page may be a potential video page determined through a specific approach, as described later.

At 106, video object properties may be extracted from the structured contents. Usually, structured contents of a web page may comprise various properties of objects in the web page, such as, logical structure information, layout information, or type information. Thus, properties of video objects in the web page can be obtained from the structured contents.

If the web page only includes one video object, properties of the video object in the web page may be obtained at 106. While if the web page includes a plurality of video objects, properties of one or more of the plurality of video objects in the web page may be obtained at 106. For example, in an implementation, properties of all or some of the plurality of video objects may be obtained. In another implementation, only properties of a video object with the largest size among the plurality of video objects may be obtained.

In an implementation, the operation at 106 may comprise firstly detecting video objects from the structured contents based on video identification information. A video object indicates a video element of a certain video type in the web page. Common video types may include webm, ogg, mp4, avi, flv, etc. The video identification information may be any types of information in the structured contents that can indicate a video object. For example, the video identification information may be HTML tags or HTML5 video tags within which a video object is embedded directly. The video identification information may also be Iframe tags that embed a video page. The embodiments of the present disclosure are not limited to any specific type of video objects or any specific type of video identification information.

After detecting video objects, the operation at 106 may further obtain properties of the video objects detected from the structured contents. As mentioned above, if the web page includes a plurality of video objects, the operation at 106 may obtain properties of one or more of the plurality of video objects detected from the structured contents.

In an implementation, the extracted video object properties may comprise layout information associated with a video object in the web page. The layout information may be one or more of width, height and position, or any combination thereof, or any derived information therefrom.

For example, in an implementation, the layout information may comprise one or more of width, height, top and left of the video object. The width, height, top and left of the video object are usually defined in the structured contents and thus can be directly extracted from the structured contents. Otherwise, they can be deduced by using various approaches. In another implementation, the layout information may comprise one or more of width, height, top and left of at least one container for the video object. The at least one container for the video object may include one or more containers used for the video object in different levels. The width, height, top and left of the at least one container are also usually defined in the structured contents and thus can be directly extracted from the structured contents. Otherwise, they can be deduced by using various approaches. In another implementation, the layout information may comprise distance between the video object’s horizontal center and the web page’s horizontal center. This distance may be calculated based on the relative position between the video object and the web page. In another implementation, the layout information may comprise distance between the video object’s top or bottom and the web page’s top or bottom. This distance may be, such as, distance between the video object’s top and the web page’s top, distance between the video object’s bottom and the web page’s bottom, distance between the video object’s top and the web page’s bottom, or distance between the video object’s bottom and the web page’s top. It should be appreciated that the layout information may comprise at least one of the above exemplary implementations. Moreover, in some cases, the above layout information may be in a normalized form. For example, the width or height of the web page may be used for normalizing the layout information.

In an implementation, the video object properties may comprise video type information. For example, the type of the video object may be determined from the structured contents. As mentioned above, the video type may be flv, webm, ogg, mp4, etc.

In an implementation, the video object properties may comprise container type information. For example, the type of the at least one container for the video object may be determined from the structured contents. The container type may be, such as, div, p, a, span, etc.

In an implementation, the video object properties may comprise depth information of the video object in a DOM tree. As mentioned above, the structured contents may be represented as a DOM tree. Various approaches may be adopted to determine the depth of the video object in the DOM tree.

In an implementation, the video object properties may comprise text information associated with the video object and/or the at least one container for the video object. The text information may indicate, such as, names of “class” or “id” of tags associated with the video object and/or the at least one container for the video object, or any other text.

It should be appreciated that the video object properties may comprise at least one of layout information, video type information, container type information, depth information of the video object in the DOM tree, and text information associated with the video object and/or the at least one container for the video object.

At 108, it may be determined whether the web page is a video page by using a page-based classification model. The video object properties obtained at 106 may be used as input features for the page-based classification model. The page-based classification model may be built through machine learning in advance, for determining whether a web page is a video page based on video object properties extracted for the web page. The building of the page-based classification model through machine learning will be described later.

According to the present disclosure, those web pages containing a video as a dominant content tend to be determined as video pages. As for web pages containing a video but not setting the video as a dominant content, these web pages tend to be determined as non-video pages, because the video is likely to be, such as, an advertisement video and may not be desired by the user. In the case that a web page includes a plurality of videos, if at least one video among the plurality of videos is a dominant content, this web page tends to be determined as a video page.

For example, if a video object is displayed in the horizontal center of the web page or near the horizontal center of the web page, or although not near the horizontal center but with a relatively large size, the web page would be determined as a video page with a high possibility. While if the video object is displayed close to the edge of the web page and with a relatively small size, or even near the horizontal center but with a relatively small size, the web page would be determined as a non-video page with a high possibility. For another example, if the web page has a large vertical size and the video object is displayed at the bottom of the web page, even the video object is displayed near the horizontal center of the web page, the web page would still be determined as a non-video page with a high possibility； instead, if this video object is displayed in the first screen of the web page and near the horizontal center of the web page, the web page would be determined as a video page with a high possibility. Thus, layout information in the video object properties may be used by the page-based classification model to determine a video page.

Moreover, the depth of the video object in the DOM tree in the video object properties may also be used by the page-based classification model to determine a video page. For example, if it is known that depths of video objects in DOM trees for most video pages lie in a range of 5 to 7, then a web page with its video object having a depth in this range is more likely to be determined as a video page. Furthermore, the type of the video object or the type of the at least one container in the video object properties may also be used by the page-based classification model to determine a video page. Moreover, the text information associated with the video object and/or the at least one container for the video object may also be used by the page-based classification model to determine a video page. For example, if “id” of a tag associated with the video object is named as “video_stage” , the web page is very likely a video page since the use of term “video” , while if the name of “id” includes “gallery” , this web page may be determined as a non-video page.

The method 100 ends at 110. However, it should be appreciated that the determination result at 108 may be used for any further application scenarios. For example, if a web page is identified through the method 100 as a video page, this web page may be indexed and added into a database or a Big Table containing indices of video pages that is maintained for a search engine.

As mentioned above, in the method 100, the page-based classification model may be built through machine learning. The embodiments of the present disclosure may adopt various techniques for performing machine learning. For example, in an implementation, one or more of Boosted Trees, Random Forrest, Neural Network, and Support Vector Machine (SVM) may be used as a machine learning model.

In order to build a machine learning model that is to be used as the page-based classification model, firstly, a number of training web pages (such as, thousands of or more web pages) may be manually labeled as video pages or non-video pages. Video object properties may be extracted from these web pages in a similar way as

operations

104 and 106 in FIG. 1. Then, the extracted video object properties of the web pages and the labels of video pages or non-video pages determined for the web pages may be inputted into the machine learning model as input training features. The input training features may be in any form that can be interpreted by the machine learning model. For example, if the input training features are in a numeral form, the video object properties and the labels may be numeralized in advance. Based on the input training features, the machine learning model may be trained so as to establish a machine learning-based mechanism for determining video pages. The trained machine learning model may be used as the page-based classification model for determining whether a web page is a video page. For example, when it is to determine whether a certain web page is a video page, the page-based classification model may receive video object properties of the web page and utilize the built machine learning-based mechanism therein to return a determination result.

Through adopting a machine learning approach for building the page-based classification model, the present disclosure may establish a determining mechanism, with higher performance than existing manually defined rules, for determining video pages. Thus, the present disclosure may increase precision of identifying video pages and a recall rate.

FIG. 2 illustrates an exemplary apparatus 200 for identifying video pages according to an embodiment of the present disclosure. In an implementation, the apparatus 200 may be configured for performing the operations of the method 100.

The apparatus 200 may comprise a structured content obtaining module 202, a property extractor 204 and a video page classifier 206. The structured content obtaining module 202 may be configured for obtaining structured contents of a web page. For example, the structured content obtaining module 202 may perform the operation 104 in the method 100. The property extractor 204 may be configured for extracting video object properties from the structured contents. For example, the property extractor 204 may perform the operation 106 in the method 100. The video page classifier 206 may include a page-based classification model built through machine learning, and may be configured for determining whether the web page is a video page by using the page-based classification model. For example, the video page classifier 206 may perform the operation 108 in the method 100.

FIG. 3 is a flowchart of an exemplary method 300 for obtaining structured content of a web page according to an embodiment of the present disclosure. The method 300 is an exemplary implementation of the operation 104 in the method 100. It should be appreciated that the operation 104 in the method 100 may also be implemented by any other approaches.

The method 300 starts at 302 and proceeds to 304. At 304, crawling operation may be performed on a web page based on a URL of the web page. In the implementations of the present disclosure, Web pages on the Internet may be classified as static web pages or dynamic web pages. For a static web page, after corresponding source codes (such as, HTML codes) are generated, contents and display effects of the page will not change unless the source codes are amended. While for a dynamic web page, even its source codes are not amended, at least a part of displayed contents can still be changed with time, database operations, etc. A dynamic web page may be generated by a combination of HTML with other high-level programming languages (such as, Java, C#, C++, etc. ) . For example, the dynamic web page may include a segment of program codes, and through executing the segment of program codes, interactions with background databases, web servers or users can be performed.

Various crawling approaches may be adopted by the embodiments of the present disclosure. Web pages may be crawled in a static crawling mode or in a dynamic crawling mode. Usually, the static crawling mode may be used for crawling static web pages, and the dynamic crawling mode may be used for crawling dynamic web pages. However, in some cases, for example, if source codes of a web page including dynamic contents could generate a video player in a standard web browser with a single source code document, this web page may also be crawled in the static crawling mode rather than the dynamic crawling mode. In some implementations, “domain” field in the URL of a web page may also be used to determine which crawling mode can be applied for this web page. Different crawling results may be obtained for the static crawling mode and the dynamic crawling mode respectively. For example, if the web page is a static web page, the operation 304 may crawl a single source code document of the static web page, such as a HTML document, in the static crawling mode. While if the web page is a dynamic web page, the operation 304 may usually crawl a source code document and at least one scripting file of the dynamic web page in the dynamic crawling mode. The scripting file may be a file that facilitates to achieve dynamic displaying of certain objects in a dynamic web page. For example, the scripting file may be or may include a segment of program codes (such as, Javascript codes) that, when executed, perform interactions with background databases, web servers or users. Moreover, as mentioned above, in some cases, the operation 304 may also crawl a single source code document of a dynamic web page in a static crawling mode.

At 306, it may be determined whether to perform static parsing or dynamic parsing. Various approaches may be adopted for the determination at 306. For example, if a single source code document is crawled at 304, it may be determined at 306 to perform static parsing, and thus the method 300 proceeds to 308. At 308, the source code document may be parsed into structured contents.

Otherwise, if a source code document and at least one scripting file are crawled at 304, it may be determined at 306 to perform dynamic parsing, and thus the method 300 proceeds to 310. At 310, the source code document and the at least one scripting file may be parsed into structured contents.

In an implementation, the at least one scripting file may be automatically executed in order to facilitate the parsing operation at 310. That is, the parsing operation at 310 may include parsing the source code document and the executed at least one scripting file into structured contents. For example, if the scripting file is a Javascript file, then codes in the Javascript file may be executed such that further object information of the web page can be obtained and used for generating the structured contents. As an example, in the case that a dynamic web page includes a play button corresponding to a scripting file, an operation of clicking on the play button may be automatically simulated. Such operation may cause executing the scripting file directly, or sending another Http request to a server to get an updated dynamic web page which may contain a video. When receiving the updated dynamic web page, the above crawling and parsing operations may be further applied on the updated dynamic web page.

The method 300 ends at 312. However, it should be appreciated that the structured contents obtained at 308 and 310 may be further provided to the operation 106 of the method 100 in FIG. 1 for subsequent processes.

FIG. 4 illustrates an exemplary apparatus 400 for identifying video pages according to an embodiment of the present disclosure. The apparatus 400 is a further implementation of the apparatus 200 in FIG. 2.

As shown in FIG. 4, the structured content obtaining module 202 may further include a page crawler 402 and a page parser 404. The page crawler 402 and the page parser 404 may be jointly configured for performing the operations of the method 300.

The page crawler 402 may be configured for performing crawling on a web page based on a URL of the web page. For example, the page crawler 402 may perform the operation 304 in the method 300. In the static crawling mode, the page crawler 402 may crawl a source code document of a web page, and in the dynamic crawling mode, the page crawler 402 may crawl a source code document and at least one scripting file of a web page.

The page parser 404 may be configured for performing parsing on the document or file crawled by the page crawler 402 so as to generate structured contents of the web page. For example, the page parser 404 may perform any of the

operations

306, 308 and 310 in the method 300.

The page parser 404 may determine whether to perform static parsing or dynamic parsing. If a single source code document is crawled by the page crawler 402, the page parser 404 may parse the source code document into structured contents. While if a source code document and at least one scripting file are crawled by the page crawler 402, the page parser 404 may parse the source code document and the at least one scripting file into structured contents.

The property extractor 204 and the video page classifier 206 in FIG. 4 are the same as the property extractor 204 and the video page classifier 206 shown in FIG. 2.

Return back to FIG. 1, the method 100 may be applied for any web pages on the Internet. Alternatively, in some embodiments, the method 100 may be applied only for potential video pages. Through determining potential video pages before applying the page-based method for identifying video pages as shown in FIG. 1, the present disclosure may largely decrease the number of web pages that are to be processed, thus dramatically improving efficiency of identifying video pages. Furthermore, by predetermining whether a web page is a potential video page, the present disclosure can further increase precision of identification and ensure a high recall rate. Various approaches may be adopted for determining potential video pages. According to an embodiment of the present disclosure, a URL of a web page may be utilized for determining whether this web page is a potential video page.

All the web sites on the Internet may be divided into two groups, one group including top and popular video web sites, and another group including all other web sites. For top and popular video web sites, a URL pattern-based approach for determining potential video pages may be used. For example, a set of URL patterns obtained from a URL of a web page on a certain top and popular video web site may be used for determining whether this web page is a potential video page. For other web sites than top and popular video web sites, a URL keyword-based approach for determining potential video pages may be used. For example, a set of URL keywords obtained from a URL of a web page on said other web sites may be used for determining whether this web page is a potential video page.

FIG. 5 is a flowchart of an exemplary method 500 for determining a potential video page based on URL according to an embodiment of the present disclosure.

The method 500 starts at 502, and proceeds to 504. At 504, a URL of a web page may be obtained. The URL of the web page may be obtained through various approaches. For example, web crawlers may be used for automatically obtaining URLs of web pages on the Internet. In this case, the operations of method 500 may always be triggered by events that new URLs are obtained.

At 506, it may be determined whether a URL pattern-based approach or a URL keyword-based approach is to be applied for determining potential video pages. For example, a “domain” field of the URL obtained at 504 may be extracted and used for determining whether this URL directs to a web page on a top and popular video web site and thus should be applied the URL pattern-based approach.

If it is determined to apply the URL pattern-based approach at 506, the method 500 proceeds to 508. At 508, URL pattern parsing may be performed on the URL so as to obtain a set of URL patterns corresponding to the URL. In an implementation, each of the set of URL patterns may be a combination of one or more URL features of the URL. The URL features may comprise features extracted from at least one of scheme, domain, path list, suffix, query list, etc. in the URL. For example, assuming that a URL is “http: //www. abcde. com/video/cn/263578. html” , then a set of URL features may be extracted, such as, scheme ＝ “http” , domain ＝ “abcd. com” , path 1 ＝ “video” , path 2 ＝ “cn” , path 3 ＝ “263578” , path 3 is a decimal number, etc. Accordingly, a set of URL patterns may be formed from any combinations of one or more of these URL features. For example, a first URL pattern may be a combination of [scheme＝ “http” , domain ＝ “abcd. com” , path 1 ＝ “video” ] , a second URL pattern may a combination of [domain ＝ “abcd. com” , path 1 ＝ “video” , path 2 ＝ “cn” , path 3 is a decimal number] , and so on.

At 510, URL pattern-based classification may be performed based on the set of URL patterns obtained at 508 so as to determine whether the web page is a potential video page. In an implementation, a URL pattern-based classification model corresponding to the top and popular video web site to which the web page belongs may be used for performing the classification at 510. The URL pattern-based classification model may be built through machine learning. In this case, the operation at 508 may alternatively include selecting the URL pattern-based classification model corresponding to the top and popular video web site among a number of URL pattern-based classification models built for a number of top and popular video web sites respectively. For example, the selecting may be performed based on the “domain” field of the URL. The set of URL patterns obtained at 508 may be used as input features for the URL pattern-based classification model. The URL pattern-based classification model may be built through machine learning in advance, for determining whether a web page is a potential video page based on URL patterns obtained from the URL of the web page. The building of the URL pattern-based classification model through machine learning will be described later.

If it is determined not to apply the URL pattern-based approach at 506, the method 500 proceeds to 512. At 512, URL keyword parsing may be performed on the URL so as to obtain a set of URL keywords corresponding to the URL. For example, assuming that a URL is “http: //www. edcba. com/video/show/145937. html” , then a set of URL keywords, such as, “video” , “show” , etc. may be extracted from the URL.

At 514, URL keyword-based classification may be performed based on the set of URL keywords obtained at 512 so as to determine whether the web page is a potential video page. In an implementation, a URL keyword-based classification model built through machine learning may be used for performing the classification at 514. The set of URL keywords obtained at 512 may be used as input features for the URL keyword-based classification model. The URL keyword-based classification model may be built through machine learning in advance, for determining whether a web page is a potential video page based on URL keywords obtained from the URL of the web page. The building of the URL keyword-based classification model through machine learning will be described later.

At 516, if the web page is determined as a potential video page, the method 500 proceeds to 518. At 518, the determined potential video page is returned. Otherwise, if the web page is determined as not a potential video page, the method 500 ends at 520.

Although the method 500 ends at 520, the potential video page determined by the method 500 may be further provided to the operation 104 of the method 100 in FIG. 1, and thus the following operations of the method 100 may be applied on the potential video page.

As mentioned above, in the method 500, the URL pattern-based classification model and the URL keyword-based classification model may be built through machine learning. The embodiments of the present disclosure may adopt various techniques for performing machine learning. For example, in an implementation, one or more of Boosted Trees, Random Forrest, Neural Network, and SVM may be used as a machine learning model.

It should be appreciated that URL pattern-based classification models are specific for respective top and popular video web sites. That is, for each top and popular video web site, a URL pattern-based classification model shall be built separately. In order to build a machine learning model that is to be used as the URL pattern-based classification model corresponding to a certain top and popular video web site, URLs of a number of training web pages on this web site that have been labeled as video pages or non-video pages may be obtained firstly. In an implementation, these training web pages may be manually labeled and provided. In another implementation, these training web pages may be automatically labeled and provided by use of the apparatus 200 for identifying video pages as described above. For example, for a certain top and popular video web site, the apparatus 200 may be used for classifying web pages on the web site into video pages or non-video pages. Accordingly, these web pages may be labeled and provided as training web pages for the URL pattern-based classification model. In this case, since the training web pages can be provided by use of the apparatus 200 for identifying video pages, the expense for training models may be reduced, and the training efficiency may be increased. URL patterns may be extracted from the URLs of these web pages in a similar way as the operation 508 in FIG. 5. Then, the extracted URL patterns and the labels of video pages or non-video pages may be inputted into the machine learning model as input training features. Based on the input training features, the machine learning model may be trained so as to establish a machine learning-based mechanism for determining a potential video page. The trained machine learning model may be used as the URL pattern-based classification model for determining whether a web page on the top and popular video web site is a potential video page. For example, when it is to determine whether a certain web page on a certain top and popular video web site is a potential video page, a URL pattern-based classification model corresponding to the web site may receive a set of URL patterns of the web page and utilize the built machine learning-based mechanism therein to return a determination result.

The URL keyword-based classification model may be built through machine learning in a similar way as the URL pattern-based classification model, except that the machine learning model is trained by URL keywords of the training web pages extracted from the URLs of these web pages as the operation 512 in FIG. 5. For example, the trained machine learning model may include a number of learned keywords (such as, “v” , “show” , “play” , “video” , “tv” , “vplay” , etc. ) and corresponding weights of these keywords.

Through adopting a machine learning approach for building the URL pattern-based classification model and the URL keyword-based classification model, the present disclosure may establish a determining mechanism with high performance for determining a potential video page.

FIG. 6 illustrates an exemplary apparatus 600 for identifying video pages according to an embodiment of the present disclosure. The apparatus 600 is a further implementation of the apparatus 200 in FIG. 2.

As shown in FIG. 6, in addition to the structured content obtaining module 202, the property extractor 204 and the video page classifier 206, the apparatus 600 for identifying video pages may further include a URL parser 602 and a URL classifier 604. The URL parser 602 and the URL classifier 604 may be jointly configured for performing the method 500 for determining a potential video page based on URL as shown in FIG. 5.

The URL parser 602 may be configured for performing URL parsing on a URL of a web page. In an implementation, the URL parser 602 may include a URL pattern parser 612 and a URL keyword parser 614. For example, the URL pattern parser 612 may be configured for performing URL pattern parsing on a URL which directs to a web page on a top and popular video web site, and the URL keyword parser 614 may be configured for performing URL keyword parsing on a URL which does not direct to a web page on any top and popular video web site.

The URL classifier 604 may be configured for performing URL-based classification by using a URL-based classification model. In an implementation, the URL classifier 604 may further include a URL pattern-based classifier 616 and a URL keyword-based classifier 618, and the URL-based classification model may further include a URL pattern-based classification model and a URL keyword-based classification model. For example, the URL pattern-based classifier 616 may include the URL pattern-based classification model, and may be configured for performing URL pattern-based classification by using the URL pattern-based classification model. The URL keyword-based classifier 618 may include the URL keyword-based classification model, and may be configured for performing URL keyword-based classification by using the URL keyword-based classification model.

The potential video page determined by the URL classifier 604 may be further provided to the subsequent modules for performing corresponding processes. For example, the structured content obtaining module 202 may obtain structured contents of the potential video page determined by the URL classifier 604.

FIG. 7 illustrates an exemplary system 700 for identifying video pages according to an embodiment of the present disclosure. The system 700 may comprise one or more processors 702. The system 700 may further comprise a memory 704 that is connected with the one or more processors 702. The memory 704 may store computer-executable instructions that, when executed, cause the one or more processors 702 to perform any operations of the methods for identifying video pages according to the embodiments of the present disclosure as mentioned above.

The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for identifying video pages according to the embodiments of the present disclosure as mentioned above.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip) , an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors (e.g., cache or register) .

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A method for identifying video pages, comprising:

obtaining structured contents of a web page；

extracting video object properties from the structured contents； and

determining whether the web page is a video page by using a page-based classification model built through machine learning, the video object properties being as input features for the page-based classification model.
The method of claim 1, prior to the obtaining, further comprising:

determining that a URL of the web page directs to a potential video page, by using a URL keyword-based classification model or a URL pattern-based classification model built through machine learning.
The method of claim 1, wherein the obtaining comprises:

crawling a source code document of the web page in a static crawling mode； and

parsing the source code document into the structured contents.
The method of claim 1, wherein the obtaining comprises:

crawling a source code document and at least one scripting file of the web page in a dynamic crawling mode； and

parsing the source code document and the at least one scripting file into the structured contents.
The method of claim 4, further comprising:

executing the at least one scripting file, and

wherein the parsing comprises parsing the source code document and the executed at least one scripting file into the structured contents.
The method of claim 1, wherein the structured contents are represented as a DOM tree.
The method of claim 1, wherein the video object properties comprise at least one of: layout information, video type information, container type information, depth information of a video object in a DOM tree, and text information associated with the video object and/or at least one container for the video object.
The method of claim 7, wherein the layout information comprises at least one of:

(a) one or more of width, height, top and left of the video object；

(b) one or more of width, height, top and left of at least one container for the video object；

(c) distance between the video object’s horizontal center and the web page’s horizontal center； and

(d) distance between the video object’s top or bottom and the web page’s top or bottom.
The method of claim 8, wherein the layout information is normalized by use of height or width of the web page.
The method of claim 1, wherein the extracting comprises:

detecting video objects from the structured contents based on video identification information.
The method of claim 10, wherein the video identification information comprises at least one of: HTML tags embedding a video object directly, HTML5 video tags embedding a video object directly, and Iframe tags embedding a video page.
The method of claim 1, wherein the web page includes a plurality of video objects, and the extracting further comprises:

extracting video object properties of one or more of the plurality of video objects from the structured contents.
The method of claim 1, wherein the page-based classification model is built by using one or more of Boosted Trees, Random Forrest, Neural Network, and Support Vector Machine (SVM) as a machine learning model during the machine learning.
The method of claim 1, wherein the page-based classification model is built by performing the machine learning on video object properties of a number of web pages with labels of video pages or non-video pages and the labels of the number of web pages.
An apparatus for identifying video pages, comprising:

a structured content obtaining module, for obtaining structured contents of a web page；

a property extractor, for extracting video object properties from the structured contents； and

a video page classifier, for determining whether the web page is a video page by using a page-based classification model built through machine learning, the video object properties being as input features for the page-based classification model.
The apparatus of claim 15, further comprising:

a URL classifier, for determining that a URL of the web page directs to a potential video page, by using a URL keyword-based classification model or a URL pattern-based classification model built through machine learning.
The apparatus of claim 15, wherein the video object properties comprise at least one of: layout information, video type information, container type information, and depth information of a video object in a DOM tree, and text information associated with the video object and/or at least one container for the video object.
The apparatus of claim 17, wherein the layout information comprises at least one of:

(a) one or more of width, height, top and left of the video object；

(b) one or more of width, height, top and left of at least one container for the video object；

(c) distance between the video object’s horizontal center and the web page’s horizontal center； and

(d) distance between the video object’s top or bottom and the web page’s top or bottom.
The apparatus of claim 18, wherein the web page includes a plurality of video objects, and the property extractor is further configured for:

extracting video object properties of one or more of the plurality of video objects from the structured contents.
A system for identifying video pages, comprising:

one or more processors； and

a memory, storing computer-executable instructions that, when executed, cause the one or more processors to perform the method according to claims 1-14.