CN108475275A - Identify video page - Google Patents

Identify video page Download PDF

Info

Publication number
CN108475275A
CN108475275A CN201680077528.6A CN201680077528A CN108475275A CN 108475275 A CN108475275 A CN 108475275A CN 201680077528 A CN201680077528 A CN 201680077528A CN 108475275 A CN108475275 A CN 108475275A
Authority
CN
China
Prior art keywords
video
page
web page
video object
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201680077528.6A
Other languages
Chinese (zh)
Inventor
A·J·K·塔姆比拉纳姆
韩博
巢望礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of CN108475275A publication Critical patent/CN108475275A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

Present disclose provides the methods, devices and systems of video page for identification.In order to identify video page, the structured content of the web page on internet can be obtained first.The video object attribute can be extracted from the structured content of web page.It can use and determine whether web page is video page based on the disaggregated model of the page via what machine learning was built.The video object attribute is used as the input feature vector of the disaggregated model based on the page.

Description

Identify video page
Background technology
Search engine is widely used to search for interested content in Internet user.In some cases, user may It is expected that only receiving the search result of a certain type, such as video page.For example, user may wish to inquiry video content, ring Should be in this inquiry from the user, the associated video page of search engine can be returned with user is inquired video content List.Herein, video page indicates to include at least one video and sets at least one video to the Web page of main contents Face.
In response to the inquiry to video content, search engine provider should pre-establish, for example, comprising on internet Video page index database or big table (Big Table), wherein database or big table can based on it is local or point The storage of cloth.It needs to identify the video page on internet first to establish database or big table.It is existing that there are some The mode of video page for identification, for example, the mode based on template, the mode based on URL, the mode based on map of website Deng.In general, these modes are directed to small range of video website, the welcome video website such as mainstream, and depend on to this The artificially defined Rule Information of video page on a little websites, such as pattern rule, URL rules.For example, for being based on template Mode or mode based on URL, some video websites can be to the artificially design rule of the video page in video website, example Such as, content of pages pattern rule, page layout template rule or URL rules, and search engine provider can be from video network Rule Information is summed up in some video pages on standing and the video in video website is further identified using the Rule Information The page.In another example, for the mode based on map of website, the operator of a welcome video website of mainstream can Initially to provide the web page list of the website to search engine provider and whether web page is video for identification The respective meta-data of the page, then, search engine provider can be known using the map of website of the metadata and the website Video page not on the website.
Artificially defined Rule Information is depended on according to the accuracy rate of the identification video page of existing way.In addition, institute's energy The range of the video page enough detected is generally focused on the welcome video website of mainstream of lesser amt.
Invention content
The content of present invention is provided to introduce one group of concept in simplified form, this group of concept will be in specific embodiment party below It is described further in formula.The content of present invention is not intended to the key features or essential features of mark institute subject matter, also not purport In the range of the subject matter for limiting.
Embodiment of the disclosure can provide the methods, devices and systems of video page for identification.
In one aspect, present disclose provides a kind of methods of video page for identification.According to this method, can obtain The structured content of web page.It can be by structured content for extracting the video object attribute.It can will be via machine learning structure The disaggregated model based on the page built is for determining whether web page is video page.The video object attribute is used as base In the input feature vector of the disaggregated model of the page.
On the other hand, present disclose provides a kind of devices of video page for identification.The device may include knot Structure content obtaining module, attribute extractor and video page grader.Structured content acquisition module can be configured as Obtain the structured content of web page.Attribute extractor can be configured as the extraction the video object attribute from structured content. Video page grader, which can be configured as, to be determined by using what is built via machine learning based on the disaggregated model of the page Whether web page is video page, input feature vector of the video object attribute as the disaggregated model based on the page.
On the other hand, present disclose provides a kind of systems of video page for identification.The system may include:One A or multiple processors and memory.The memory can store computer executable instructions, when the computer is executable When instruction is run so that one or more of processors execute the arbitrary steps of the method according to disclosure various aspects.
On the other hand, present disclose provides a kind of non-volatile computer-readable mediums.The non-volatile computer Readable medium may include instruction, when described instruction is run so that one or more processors execute each according to the disclosure The arbitrary steps of the method for a aspect.
It should be noted that the spy that the above one or more aspects are specifically noted in including described in detail below and claim Sign.The particular exemplary feature of one or more of aspects has been set forth in detail in following specification and attached drawing.These features are only Some modes in the various ways for the principle that only instruction can implement various aspects, and the disclosure is intended to include all these Aspect and its equivalents.
Description of the drawings
Below with reference to the disclosed many aspects of attached drawing description, these attached drawings are provided public to illustrative and not limiting institute The many aspects opened.
Fig. 1 is the flow chart according to the illustrative methods of the video page for identification of the embodiment of the present disclosure.
Fig. 2 shows the exemplary means according to the video page for identification of the embodiment of the present disclosure.
Fig. 3 is the flow according to the illustrative methods of the structured content for obtaining web page of the embodiment of the present disclosure Figure.
Fig. 4 shows the exemplary means of the video page for identification according to the embodiment of the present disclosure.
Fig. 5 is the stream according to the illustrative methods for determining potential video page based on URL of the embodiment of the present disclosure Cheng Tu.
Fig. 6 shows the exemplary means of the video page for identification according to the embodiment of the present disclosure.
Fig. 7 shows the exemplary system of the video page for identification according to the embodiment of the present disclosure.
Specific implementation mode
The disclosure is discussed referring now to various exemplary embodiment.It should be appreciated that the discussion of these embodiments Be used only for so that those skilled in the art can better understand that and thereby implement embodiment of the disclosure, and not instruct pair Any restrictions of the scope of the present disclosure.
Embodiment of the disclosure can provide the methods, devices and systems of video page for identification.The implementation of the disclosure Example can be applied to various search engines.For example, by implementing embodiment of the disclosure, can efficiently identify on internet Video page, and it includes the number for maintaining the video page for search engine to index that can and then be added to video page According to library or big table.Therefore, when user's video content a certain by search engine inquiry, search engine can be from database or big Video page is extracted in table to index and return to user.
In one aspect, the present disclosure proposes use the disaggregated model based on the page to identify video page.It can be via Machine learning builds the disaggregated model based on the page.Various machine learning algorithms can be applied to based on each of web page The video object attribute is planted to build the disaggregated model based on the page.By using via constructed by machine learning based on the page Disaggregated model can improve the accuracy rate of identification video page and recall (recall) rate.On the other hand, the disclosure is also It proposes and determines potential video page using based on the disaggregated model of URL.It can be based in advance via machine learning to build The disaggregated model of URL.By being used in conjunction with disaggregated model based on URL and based on the disaggregated model of the page, can further carry The accuracy rate and recall rate of height identification video page, and the efficiency of identification video page can be significantly improved.
Embodiment of the disclosure can execute the large range of web page on internet video page identification, and simultaneously It is not limited to the welcome video website of the mainstream of lesser amt.Therefore, more video pages can be identified for search Engine.Embodiment of the disclosure can also realize automatic video page identification, therefore can constantly identify on internet Video page.
It should be appreciated that example described above environment is intended solely for illustrative purposes, and not instruct to the disclosure Any restrictions of range.The disclosure can be implemented using different structure and/or function.
Fig. 1 is the flow chart according to the illustrative methods 100 of the video page for identification of the embodiment of the present disclosure.
Method 100 starts at 102 and proceeds to 104.At 104, the structured content of web page can be obtained.It is many Well known, web page is that various markup languages, such as HTML, XML etc. may be used to realize.It can be by the source of web page Code document, such as html document, are converted into structured content.Embodiment of the disclosure may be used any for obtaining web The mode of the structured content of the page.In one embodiment, the structured content of the acquisition web page at 104 can be with Only refer to reception structured content.In such a case, it is possible to provide structuring by any Web page surface technology or any third party Content.In another embodiment, the structured content of the acquisition web page at 104 may include in generating structure The process of appearance.It is, for example, possible to use DOM Document Object Model (DOM) technology generates structured content, wherein for DOM technologies Such as html document, XML document provide structured representation mode.In this case, at 104, for example, correspondence can be obtained In the dom tree of web page.That is, structured content can be expressed as dom tree.Herein, the Web page of structured content is obtained Face can be any one web page on internet.Optionally, which can be the certain party by describing below Formula and the potential video page of determination.
At 106, the video object attribute can be extracted from structured content.In general, the structured content of web page can With each attribute including the object in web page, for example, logical structure information, layout information or type information.Therefore, web The attribute of the video object in the page can be obtained from structured content.
If web page only includes a video object, the video object during web page can be obtained at 106 Attribute.However, if web page includes multiple the video objects, multiple videos during web page can be obtained at 106 The attribute of one or more of object the video object.For example, in one embodiment, can obtain in multiple the video objects All videos object or some the video objects attribute.In another embodiment, multiple the video objects can only be obtained In with maximum sized the video object attribute.
In one embodiment, the operation at 106 may include being primarily based on video identifier information to come from structuring The video object is detected in content.The video object indicates the video unit of a certain video type in web page.Common video type May include webm, ogg, mp4, avi, flv etc..Video identifier information can be that the video object can be indicated in structured content Any kind of information.For example, video identifier information can be directly embedded into the video object html tag or HTML5 regard Frequency marking label.Video identifier information can also be to be embedded with the Iframe labels of video page.Embodiment of the disclosure is not limited to Any certain types of the video object or any certain types of video identifier information.
After detecting the video object, operation at 106 can be obtained further to be detected from structured content The attribute of the video object.As described above, if web page includes multiple the video objects, the operation at 106 can be obtained from knot The attribute of one or more of the multiple the video objects detected in structure content the video object.
In one embodiment, the video object attribute extracted may include and the video object phase in web page Associated layout information.Layout information can be one or more of width, height and position or its arbitrary combination or Any therefrom derived information of person.
For example, in one embodiment, layout information may include width, height, top and the left part of the video object One or more of.In general, the width of the video object, height, top and left part are defined in structured content, therefore, It can directly be extracted from structured content.Otherwise, can make to derive in various manners.In another embodiment, Layout information may include width, height, top and the left part corresponding at least one container (container) of the video object One or more of.At least one container corresponding to the video object may include that the video object is used in different levels One or more containers.In general, width, height, top and the left part of at least one container are defined within structured content In, therefore, can directly it be extracted from structured content.Otherwise, can make to derive in various manners.Implement in another kind In mode, layout information may include the distance between the horizontal centre of the video object and the horizontal centre of web page.The distance Can be calculated based on the relative position between the video object and web page.In another embodiment, layout information May include top or the distance between the top or bottom of bottom and web page of the video object.The distance can be example Such as, between the distance between the top of the video object and the top of web page, the bottom of the video object and the bottom of web page Distance, the top of the video object and the distance between the bottom of web page or the video object bottom and web page The distance between top.It should be appreciated that layout information may include at least one of foregoing exemplary embodiment.In addition, In some cases, normalized form may be used in the above layout information.For example, can be by the width or height of web page For layout information to be normalized.
In one embodiment, the video object attribute may include Video type information.For example, can be out of structuring The type of the video object is determined in appearance.As described above, video type can be flv, webm, ogg, mp4 etc..
In one embodiment, the video object attribute may include Container Type information.For example, can be out of structuring The type of at least one container corresponding to the video object is determined in appearance.Container Type can be, for example, div, p, a, span Deng.
In one embodiment, the video object attribute may include depth information of the video object in dom tree.As above Described, structured content can be expressed as dom tree.Various modes may be used to determine depth of the video object in dom tree Degree.
In one embodiment, the video object attribute may include with the video object and/or corresponding to the video object At least one associated text message of container.Text information can indicate, for example, with the video object and/or corresponding to regarding " class " or " id " title of the associated label of at least one container of frequency object or any other text.
It should be appreciated that the video object attribute may include at least one of the following:Layout information, video type letter Breath, depth information in dom tree of Container Type information, the video object and with the video object and/or correspond to the video object The associated text message of at least one container.
At 108, it can determine whether web page is video page by using based on the disaggregated model of the page. The video object attribute obtained at 106 is used as the input feature vector of the disaggregated model based on the page.It can be in advance via machine Device study builds the disaggregated model based on the page, and web is determined for being based upon the video object attribute of web page extraction Whether the page is video page.It is described later the structure to the disaggregated model based on the page via machine learning.
According to the disclosure, it is intended to video page will be determined as comprising those of video as main contents web page. For comprising video but not setting the video to the web page of main contents, it is intended to which these web pages are determined as non-regard The frequency page, because the video may be such as advertisement video and may not be that user is desired.Include multiple in web page In the case of video, if at least one of multiple video video is main contents, tend to determine the web page For video page.
For example, if the video object be displayed on web page horizontal centre or close to web page horizontal centre, or Person has relatively large size although not close to horizontal centre, then will be determined as the web page with higher probability Video page.However, if the video object be shown as close to web page edge and have relatively small size, or Even if person is close to horizontal centre but has relatively small size, then the web page will be determined as with higher probability non- Video page.For another example, if there is web page larger vertical dimension and the video object to be displayed on Web page The bottom in face, even if still can should with higher probability if then the video object is shown as the horizontal centre close to web page Web page is determined as the non-video page;However, if the video object is displayed in the first screen of web page and close to web The web page then will be determined as video page by the horizontal centre of the page with higher probability.Therefore, point based on the page Class model can be by the layout information in the video object attribute for determining video page.
In addition, the disaggregated model based on the page can also be by the video object attribute, the video object in dom tree depth Degree is for determining video page.For example, if it is known that for most of video pages, depth of the video object in dom tree Degree is located in the range of 5 to 7, then is more likely to there is the web page of the depth within the scope of this to be determined as video its video object The page.In addition, the disaggregated model based on the page can also be by type in the video object attribute, the video object or at least one The type of container is for determining video page.In addition, the disaggregated model based on the page can also will be with the video object and/or corresponding In the associated text message of at least one container of the video object for determining video page.For example, and if the video object " id " of associated label is named as " video_stage (video _ grade) ", then, should due to the use of word " video " It is video page that web page, which is very likely to, and if the title of " id " includes " gallery (picture library) ", which can be with It is confirmed as the non-video page.
Method 100 terminates at 110.It will be appreciated, however, that can be by the definitive result at 108 for any further Application scenarios.For example, if web page is identified as video page by method 100, which can be encoded It includes the database for maintaining the video page for search engine to index or big table to index and be added to.
As described above, in the method 100, the disaggregated model based on the page can be built via machine learning.The disclosure Embodiment the various technologies for executing machine learning may be used.For example, in one embodiment, it can be by boosted tree One or more of (Boosted Tree), random forest, neural network and support vector machines (SVM) are used as machine learning mould Type.
It will be used as the machine learning model of the disaggregated model based on the page to build, it is possible, firstly, to artificially will be big The training web page (for example, thousands of or more web pages) of amount is labeled as video page or the non-video page.It can lead to Cross in Fig. 1 operation 104 and 106 similar modes come from these web pages extract the video object attribute.Then, may be used With by the video object attribute of web page extracted and be video page or non-video page mark determined by web page Note be input to machine learning model, using as input training characteristics.Machine learning model may be used in input training characteristics can Any form explained.For example, if input training characteristics use digital form, it can be in advance to the video object attribute and mark Remember digitized into.It, can be with training machine learning model, to establish for determining video page based on input training characteristics Mechanism based on machine learning.The machine learning model trained can be used as the disaggregated model based on the page, for true Determine whether web page is video page.For example, when whether to determine some web page is video page, based on the page Disaggregated model can receive the video object attribute of web page, and use the mechanism based on machine learning constructed in it To return to definitive result.
The disaggregated model based on the page is built by using machine learning mode, the disclosure can be established to be regarded for determination The determination mechanism of the frequency page, the determination mechanism have performance more higher than existing artificially defined rule.Therefore, the disclosure can To improve the accuracy rate and recall rate of identification video page.
Fig. 2 shows the exemplary means 200 according to the video page for identification of the embodiment of the present disclosure.In a kind of embodiment party In formula, device 200 can be configured as the operation of execution method 100.
Device 200 may include structured content acquisition module 202, attribute extractor 204 and video page grader 206.Structured content acquisition module 202 can be configured as the structured content for obtaining web page.For example, structured content Acquisition module 202 can execute the operation 104 in method 100.Attribute extractor 204 can be configured as from structured content Extract the video object attribute.For example, attribute extractor 204 can execute the operation 106 in method 100.Video page grader 206 may include that and can be configured as by using base via the disaggregated model based on the page constructed by machine learning Determine whether web page is video page in the disaggregated model of the page.For example, video page grader 206 can the side of execution Operation 108 in method 100.
Fig. 3 is the stream according to the illustrative methods 300 of the structured content for obtaining web page of the embodiment of the present disclosure Cheng Tu.Method 300 is the exemplary implementation of the operation 104 in method 100.It should be appreciated that any other mode can also be passed through Carry out the operation 104 in implementation 100.
Method 300 starts at 302 and proceeds to 304.It, can be based on the URL of web page come to web page at 304 Execute grasping manipulation.In embodiment of the present disclosure, the web page on internet can be classified as static web page or Dynamic web page.For static web page, after generating corresponding source code (for example, HTML code), the content of the page It will not change with display effect, unless source code is changed.And for dynamic web page, even if its source code is not repaiied Change, at least part of displayed content can also change with time, database manipulation etc..Can by HTML with it is other The combination of high-level programming language (for example, Java, C#, C++ etc.) generates dynamic web page.For example, dynamic web page can be with Including program code segments, and by running the program code segments, can execute and background data base, web server or user Interaction.
Various grasp modes may be used in embodiment of the disclosure.Static grasp mode or dynamic crawl side can be passed through Formula captures web page.In general, can be by static grasp mode for capturing static web page, it can be by dynamic grasp mode For capturing dynamic web page.However, in some cases, for example, if the source code of the web page including dynamic content Video player can be generated in standard web browsers using single source code document, then can also use static crawl side Formula rather than dynamic grasp mode capture the web page.In some embodiments it is possible to by the URL of web page Which kind of grasp mode can be applied to the web page by " domain (domain) " field for determination.For static grasp mode and Dynamic grasp mode can obtain different crawl results respectively.For example, if web page is static web page, operate 304 can grab the single source code document of the static state web page by static grasp mode, for example, html document.And If web page is dynamic web page, operation 304 usually can grab the dynamic web page by dynamic grasp mode The source code document in face and at least one script file.Script file can contribute to realize to certain in dynamic web page The file of the Dynamic Announce of a little objects.For example, script file can be or may include program code (for example, Javascript Code) section, when being run, which executes and background data base, web server or the interaction of user.In addition, such as Upper described, in some cases, operation 304 can also capture the single source generation of dynamic web page by static grasp mode Code document.
At 306, it may be determined that be to execute static parsing or dynamic analysis.Various modes may be used to be used for 306 The determination at place.For example, if grabbing single source code document at 304, the static parsing of execution can be determined at 306, To which method 300 proceeds to 308.At 308, source code document can be parsed into structured content.
Otherwise, it if grabbing source code document and at least one script file at 304, can be determined at 306 Dynamic analysis is executed, to which method 300 proceeds to 310.It, can be by source code document and at least one script file at 310 It is parsed into structured content.
In one embodiment, at least one script file can be automatically run, to contribute at 310 Parsing operation.That is, the parsing operation at 310 may include by source code document and at least one script file through operation It is parsed into structured content.For example, if script file is Javascript files, can run in Javascript files Code, so as to obtain the further object information of web page and use it for generating structure content.As reality Example, in the case where dynamic web page includes the broadcast button corresponding to script file, can automatic simulation to broadcast button Clicking operation.Such operation can cause direct Run Script file or send another http request to server To obtain the newer dynamic web page that may include video.It, can will be above-mentioned when receiving newer dynamic web page Crawl and parsing operation further apply the newer dynamic web page.
Method 300 terminates at 312.It will be appreciated, however, that the structured content obtained at 308 and 310 can be into And the operation 106 in the method 100 of Fig. 1 is provided to carry out subsequent processing.
Fig. 4 shows the exemplary means 400 of the video page for identification according to the embodiment of the present disclosure.Device 400 is Fig. 2 The further implementation of middle device 200.
As shown in Figure 4, structured content acquisition module 202 may further include page grabber 402 and page solution Parser 404.Page grabber 402 and page parsing device 404 can be jointly configured as the operation of execution method 300.
Page grabber 402 can be configured as the URL based on web page to execute crawl to web page.For example, page Face grabber 402 can execute the operation 304 in method 300.In static grasp mode, page grabber 402 can capture The source code document of web page, and in dynamic grasp mode, page grabber 402 can capture the source code text of web page Shelves and at least one script file.
Page parsing device 404 can be configured as the document captured to page grabber 402 or file executes parsing, with Just the structured content of web page is generated.For example, page parsing device 404 can execute the operation 306,308 and 310 of method 300 In any operation.
Page parsing device 404 can be determined that the static parsing of execution or dynamic analysis.If page grabber 402 captures To single source code document, then source code document can be parsed into structured content by page parsing device 404.And if the page is grabbed Device 402 is taken to grab source code document and at least one script file, then page parsing device 404 can by source code document and extremely A few script file is parsed into structured content.
It attribute extractor 204 and video page grader 206 in Fig. 4 and attribute extractor 204 shown in Fig. 2 and regards Frequency page classifier 206 is identical.
Back to Fig. 1, method 100 can be applied to any web page on internet.Optionally, in some implementations In example, can method 100 be only applied to potential video page.Pass through for identification video page as shown in Figure 1 in application The method based on the page before determine that potential video page, the disclosure can be greatly reduced web page to be processed Quantity, to be significantly improved the efficiency of identification video page.In addition, by determining whether web page is potential video page, The disclosure can further increase recognition accuracy and ensure higher recall rate.Various modes may be used to determine potential regard The frequency page.It in accordance with an embodiment of the present disclosure, can be by the URL of web page for determining whether the web page is potential video page Face.
All websites on internet can be divided into two groups, one group includes the welcome video website of mainstream, another Group includes all other website.For the welcome video website of mainstream, the base for determining potential video page can be used In the mode of URL pattern (pattern).For example, can will be from the web page in the welcome video website of a certain mainstream One group of acquired URL pattern is for determining whether the web page is potential video page in URL.It is welcome for mainstream Other websites except video website can use the mode based on URL keyword for determining potential video page.Example It such as, can be by acquired one group of URL keyword from the URL of the web page on other websites for determining the Web page Whether face is potential video page.
Fig. 5 is according to the embodiment of the present disclosure for determining the illustrative methods 500 of potential video page based on URL Flow chart.
Method 500 starts at 502 and proceeds to 504.At 504, the URL of web page can be obtained.It can be by each Kind of mode obtains the URL of web page.For example, can be by web crawl device for automatically obtaining the web page on internet URL.In this case, the operation of method 500 can be triggered always by getting event as new URL.
At 506, it may be determined that be used for really by the mode based on URL pattern or by the mode based on URL keyword Fixed potential video page.For example, " domain " field of the URL obtained at 504 can be extracted, and uses it for determining and be somebody's turn to do Whether URL is directed toward the web page in the welcome video website of mainstream and thereby should be by the mode based on URL pattern of application.
If determination will apply the mode based on URL pattern, method 500 to proceed to 508 at 506.It, can at 508 To execute URL pattern parsing to URL, to obtain one group of URL pattern for corresponding to URL.In one embodiment, one group Each URL pattern in URL pattern can be the combination of one or more URL features of URL.URL features may include from Scheme (scheme), domain, path list (path list), suffix (suffix), query list (inquiry row in URL Table) etc. at least one of the feature extracted.For example, it is assumed that URL is " http://www.abcde.com/video/cn/ 263578.html " can then extract one group of URL feature, for example, scheme=" http ", domain=" abcd.com ", Path 1=" video ", path 2=" cn ", path 3=" 263578 ", path 3 are decimal numbers etc..It is thus possible to root One group of URL pattern is formed according to the arbitrary combination of one or more of these URL features URL features.For example, the first URL moulds Formula can be the combination of [scheme=" http ", domain=" abcd.com ", path 1=" video "], the second URL pattern It can be the group of [domain=" abcd.com ", path 1=" video ", path 2=" cn ", path 3 are decimal numbers] Close, etc..
At 510, the classification based on URL pattern can be executed based on the one group of URL pattern obtained at 508, with Just determine whether the web page is potential video page.In one embodiment, can by with belonging to the web page The corresponding disaggregated model based on URL pattern of the welcome video website of mainstream is used to execute the classification at 510.It can be via Machine learning builds the disaggregated model based on URL pattern.In this case, the operation at 508 can be optionally included in Among the multiple disaggregated models based on URL pattern built respectively for the welcome video website of multiple mainstreams, selection corresponds to The disaggregated model based on URL pattern of the welcome video website of the mainstream.For example, the selection can be based in URL " domain " field is performed.Obtained at 508 one group of URL pattern can be used as the classification mould based on URL pattern The input feature vector of type.Can the disaggregated model based on URL pattern be built via machine learning in advance, for being based on from web URL pattern acquired in the URL of the page determines whether web page is potential video page.It is described later via engineering Practise the structure to the disaggregated model based on URL pattern.
If being determined at 506 does not apply the mode based on URL pattern, method 500 to proceed to 512.It, can at 512 To execute URL keyword parsing to URL, to obtain one group of URL keyword for corresponding to URL.For example, it is assumed that URL is “http://www.edcba.com/video/show/145937.html " can then extract one group of URL from the URL and close Key word, for example, " video (video) ", " show (performance) " etc..
At 514, point based on URL keyword can be executed based on the one group of URL keyword obtained at 512 Class, to determine whether web page is potential video page.In one embodiment, it can will be built via machine learning The disaggregated model based on URL keyword be used to execute classification at 514.It can be crucial by obtained at 512 one group of URL Word is used as the input feature vector of the disaggregated model based on URL keyword.It can be built in advance based on URL keys via machine learning The disaggregated model of word, for determining whether web page is potential based on the URL keyword acquired in the URL from web page Video page.It is described later the structure to the disaggregated model based on URL keyword via machine learning.
At 516, if web page is determined as potential video page, method 500 proceeds to 518.At 518, return Potential video page determined by returning.Otherwise, if web page is determined as not to be potential video page, method 500 exists Terminate at 520.
Although method 500 terminates at 520, the potential video page determined by method 500 can and then be carried The operation 104 in the method 100 of Fig. 1 is supplied, it is thus possible to apply the subsequent operation in method 100 in potential video page On.
As described above, in method 500, the disaggregated model and base based on URL pattern can be built via machine learning In the disaggregated model of URL keyword.Various technologies may be used to execute machine learning in embodiment of the disclosure.For example, one In kind embodiment, one or more of boosted tree, random forest, neural network and SVM can be used as machine learning mould Type.
It should be appreciated that the disaggregated model based on URL pattern is specific for the welcome video website of each mainstream.That is, For the welcome video website of each mainstream, the disaggregated model based on URL pattern should be individually built.In order to build by It is used as the machine learning model of the disaggregated model based on URL pattern corresponding with the video website that a certain mainstream is welcome, The URL for multiple trained web pages that video page or the non-video page are had been labeled as on the website can be obtained first. In a kind of embodiment, these training web pages artificially can be marked and provided.In another embodiment, Ke Yili Come automatic terrestrial reference note and these training web pages are provided with the device 200 of video page for identification described above.For example, For the welcome video website of a certain mainstream, device 200 can be used for the web page on the website being categorized into video page Or the non-video page.To which these web pages can be labeled and provide the training as the disaggregated model based on URL pattern Web page.In this case, trained web page is provided since the device 200 of video page for identification can be used, The expense for training pattern can be reduced, and improves training effectiveness.It can be by similar with the operation 508 in Fig. 5 Mode extracts URL pattern from the URL of these web pages.It is then possible to by the URL pattern extracted and video page Or the label of the non-video page is input to machine learning model, using as input training characteristics.It, can be with based on input training characteristics Training machine learning model, to establish the mechanism based on machine learning for determining potential video page.It can will be instructed Experienced machine learning model is used as the disaggregated model based on URL pattern, should be used to determine the master based on the disaggregated model of URL pattern Flow whether the web page in welcome video website is potential video page.For example, being regarded when determining that a certain mainstream is welcome When a certain web page on frequency website is potential video page, the disaggregated model based on URL pattern for corresponding to the website can be with It receives one group of URL pattern of the web page, and determining knot is returned to based on the mechanism of machine learning using constructed in it Fruit.
It can be built based on URL via machine learning by the mode similar with the disaggregated model based on URL pattern The disaggregated model of keyword, the difference is that the machine learning model is using such as
The URL keyword of these web pages extracted from the URL of training web page shown in operation 512 in Fig. 5 Come what is be trained.For example, housebroken machine learning model may include multiple learnt keywords (for example, " v ", " show ", " play ", " video ", " tv ", " vplay " etc.) and these keywords respective weights.
Disaggregated model based on URL pattern is built by using machine learning mode and based on the classification of URL keyword Model, the disclosure can establish the determination mechanism for determining potential video page with higher performance.
Fig. 6 shows the exemplary means 600 of the video page for identification according to the embodiment of the present disclosure.Device 600 is Fig. 2 In device 200 further implementation.
As shown in fig. 6, in addition to structured content acquisition module 202, attribute extractor 204 and video page grader Except 206, the device 600 of video page can also include URL parser 602 and URL classifier 604 for identification.URL is parsed Device 602 and URL classifier 604 can be jointly configured as to execute as shown in Figure 5 determines potential video page based on URL Method 500.
URL parser 602 can be configured as executes URL parsings to the URL of web page.In one embodiment, URL parser 602 may include URL pattern resolver 612 and URL keyword resolver 614.For example, URL pattern resolver 612 URL that can be configured as the web page in the video website welcome to being directed toward mainstream execute URL pattern parsing, and URL keyword resolver 614 can be configured as the web page in the video website welcome to not being directed toward any mainstream URL executes URL keyword parsing.
URL classifier 604 can be configured as executes the classification based on URL using the disaggregated model based on URL.One In kind embodiment, URL classifier 604 may further include the grader 616 based on URL pattern and be based on URL keyword Grader 618, and the disaggregated model based on URL may further include disaggregated model based on URL pattern and be based on URL The disaggregated model of keyword.For example, the grader 616 based on URL pattern may include the disaggregated model based on URL pattern, and And it can be configured as and execute the classification based on URL pattern using the disaggregated model based on URL pattern.Based on URL keyword Grader 618 may include the disaggregated model based on URL keyword, and can be configured as using be based on URL keyword Disaggregated model execute the classification based on URL keyword.
It is corresponding to execute that the potential video page determined by URL classifier 604 can be provided to subsequent module in turn Processing.For example, structured content acquisition module 202 can obtain the potential video page determined by URL classifier 604 Structured content.
Fig. 7 shows the exemplary system 700 of the video page for identification according to the embodiment of the present disclosure.System 700 can be with Including one or more processors 702.System 700 can also include memory 704, with one or more of processors 702 connections.Memory 704 can store computer executable instructions, when the computer executable instructions are run so that One or more of processors 702 execute the method for the video page for identification according to the embodiment of the present disclosure as described above Arbitrary operation.
Embodiment of the disclosure can be embodied in non-volatile computer-readable medium.The non-volatile computer is readable Medium may include instruction, when described instruction is run so that one or more processors are executed according to this public affairs as described above Open the arbitrary operation of the method for the video page for identification of embodiment.
It should be appreciated that all operations in process as described above are all only exemplary, the disclosure is not restricted to The sequence of any operation or these operations in method, but should cover all other equivalent under same or similar design Transformation.
It is also understood that all modules in arrangement described above can be implemented by various modes.These moulds Block may be implemented as hardware, software, or combinations thereof.In addition, these moulds any module in the block can be further divided into Submodule or with other block combiners.
It has been combined various device and method and describes processor.These processors can use electronic hardware, computer Software or its arbitrary combination are implemented.These processors, which are implemented as hardware or software, will depend on specifically applying and applying The overall design constraints being added in system.As an example, the arbitrary portion of processor, processor involved in the disclosure or The arbitrary combination of processor may be embodied as microprocessor, microcontroller, digital signal processor (DSP), field programmable gate It array (FPGA), programmable logic device (PLD), state machine, gate logic, discrete hardware circuit and is configured to carry out The other suitable processing component of various functions described in the disclosure.This disclosure relates to processor, processor arbitrary portion Divide or the function of arbitrarily combining of processor may be embodied as being put down by microprocessor, microcontroller, DSP or other suitable Software performed by platform.
Software should be viewed broadly as indicate instruction, instruction set, code, code segment, program code, program, subprogram, Software module, application, software application, software package, routine, subroutine, object, active thread, process, function etc..Software can be with It is resident in computer-readable medium.Computer-readable medium may include such as memory, and memory can be, for example, magnetism Storage device (e.g., hard disk, floppy disk, magnetic stripe), CD, smart card, flash memory device, random access memory (RAM), read-only storage Device (ROM), programming ROM (PROM), erasable PROM (EPROM), electric erasable PROM (EEPROM), register or removable Moving plate.Although this disclosure relates to many aspects in memory is illustrated as detaching with processor, memory can To be located inside processor (e.g., caching or register).
This specification is provided for that those skilled in the art is allow to implement aspects described herein.These The various modifications of aspect are apparent to those skilled in the art, and general principle described herein can be applied to it Its aspect.Therefore, claim is not intended to be limited to aspect shown in this article.About it is known to those skilled in the art or i.e. It, all will be by drawing by all equivalents structurally and functionally of elements knowing, to various aspects described by the disclosure With and be expressly incorporated herein, and be intended to be covered by claim.

Claims (20)

1. a kind of method of video page for identification, including:
Obtain the structured content of web page;
The video object attribute is extracted from the structured content;And
Determine whether the web page is video page based on the disaggregated model of the page by using what is built via machine learning Face, input feature vector of the video object attribute as the disaggregated model based on the page.
2. according to the method described in claim 1, before the acquisition, further include:
By using the disaggregated model based on URL keyword built via machine learning or based on the disaggregated model of URL pattern, To determine that the URL of the web page is directed toward potential video page.
3. according to the method described in claim 1, wherein, the acquisition includes:
The source code document of the web page is captured with static grasp mode;And
The source code document is parsed into the structured content.
4. according to the method described in claim 1, wherein, the acquisition includes:
With dynamic grasp mode come the source code document for capturing the web page and at least one script file;And
The source code document and at least one script file are parsed into the structured content.
5. according to the method described in claim 4, further including:
At least one script file is run,
Wherein, the parsing include the source code document and at least one script file through operation are parsed into it is described Structured content.
6. according to the method described in claim 1, wherein, the structured content is expressed as dom tree.
7. according to the method described in claim 1, wherein, the video object attribute includes at least one of the following: Depth information in dom tree of layout information, Video type information, Container Type information, the video object and with the video Object and/or the associated text message of at least one container corresponding to the video object.
8. according to the method described in claim 7, wherein, the layout information includes at least one of the following:
(a) one or more of the width of the video object, height, top and left part;
(b) correspond to one or more of width, height, top and the left part of at least one container of the video object;
(c) the distance between the horizontal centre of the horizontal centre of the video object and the web page;And
(d) the distance between the top or bottom of the video object and the top of the web page or bottom.
9. according to the method described in claim 8, wherein, the layout information is the height or width using the web page And it is normalized.
10. according to the method described in claim 1, wherein, the extraction includes:
Based on video identifier information the video object is detected from the structured content.
11. according to the method described in claim 10, wherein, the video identifier information includes at least one in the following terms :It is directly embedded into the html tag of the video object, the HTML5 video tabs of the video object is directly embedded into and is embedded with The Iframe labels of video page.
12. according to the method described in claim 1, wherein, the web page includes multiple the video objects, and the extraction Further include:
The video object category of one or more of the multiple the video object the video object is extracted from the structured content Property.
13. according to the method described in claim 1, wherein, by during the machine learning by boosted tree, random forest, One or more of neural network and support vector machines (SVM) build point based on the page as machine learning model Class model.
14. according to the method described in claim 1, wherein, by multiple with video page or non-video page marks The video object attribute of web page and the label of the multiple web page execute the machine learning, described to build Disaggregated model based on the page.
15. a kind of device of video page for identification, including:
Structured content acquisition module, the structured content for obtaining web page;
Attribute extractor, for extracting the video object attribute from the structured content;And
Video page grader, for described to determine based on the disaggregated model of the page by using being built via machine learning Whether web page is video page, input feature vector of the video object attribute as the disaggregated model based on the page.
16. device according to claim 15, further includes:
URL classifier, for by using the disaggregated model based on URL keyword built via machine learning or being based on URL The disaggregated model of pattern, to determine that the URL of the web page is directed toward potential video page.
17. device according to claim 15, wherein the video object attribute includes at least one in the following terms :Depth information in dom tree of layout information, Video type information, Container Type information, the video object and with it is described The video object and/or the associated text message of at least one container corresponding to the video object.
18. device according to claim 17, wherein the layout information includes at least one of the following:
(a) one or more of the width of the video object, height, top and left part;
(b) correspond to one or more of width, height, top and the left part of at least one container of the video object;
(c) the distance between the horizontal centre of the horizontal centre of the video object and the web page;And
(d) the distance between the top or bottom of the video object and the top of the web page or bottom.
19. device according to claim 18, wherein the web page includes multiple the video objects, and the attribute Extractor is additionally configured to:
The video object category of one or more of the multiple the video object the video object is extracted from the structured content Property.
20. a kind of system of video page for identification, including:
One or more processors;And
Memory, store computer executable instructions, when the computer executable instructions are run so that it is one or Multiple processors execute the method according to claim 1-14.
CN201680077528.6A 2016-09-26 2016-09-26 Identify video page Pending CN108475275A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/100192 WO2018053863A1 (en) 2016-09-26 2016-09-26 Identifying video pages

Publications (1)

Publication Number Publication Date
CN108475275A true CN108475275A (en) 2018-08-31

Family

ID=61689295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680077528.6A Pending CN108475275A (en) 2016-09-26 2016-09-26 Identify video page

Country Status (2)

Country Link
CN (1) CN108475275A (en)
WO (1) WO2018053863A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11442749B2 (en) 2019-11-11 2022-09-13 Klarna Bank Ab Location and extraction of item elements in a user interface
US11379092B2 (en) 2019-11-11 2022-07-05 Klarna Bank Ab Dynamic location and extraction of a user interface element state in a user interface that is dependent on an event occurrence in a different user interface
US11366645B2 (en) 2019-11-11 2022-06-21 Klarna Bank Ab Dynamic identification of user interface elements through unsupervised exploration
US11726752B2 (en) 2019-11-11 2023-08-15 Klarna Bank Ab Unsupervised location and extraction of option elements in a user interface
US11409546B2 (en) 2020-01-15 2022-08-09 Klarna Bank Ab Interface classification system
US11386356B2 (en) 2020-01-15 2022-07-12 Klama Bank AB Method of training a learning system to classify interfaces
US10846106B1 (en) 2020-03-09 2020-11-24 Klarna Bank Ab Real-time interface classification in an application
US11496293B2 (en) 2020-04-01 2022-11-08 Klarna Bank Ab Service-to-service strong authentication

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system
CN103455600A (en) * 2013-09-03 2013-12-18 小米科技有限责任公司 Video URL (Uniform Resource Locator) grabbing method and device and server equipment
US20150356195A1 (en) * 2014-06-05 2015-12-10 Apple Inc. Browser with video display history
US20160037071A1 (en) * 2013-08-21 2016-02-04 Xerox Corporation Automatic mobile photo capture using video analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559234B (en) * 2013-10-24 2017-01-25 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN104077389A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Display method of webpage element information and browser device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system
US20160037071A1 (en) * 2013-08-21 2016-02-04 Xerox Corporation Automatic mobile photo capture using video analysis
CN103455600A (en) * 2013-09-03 2013-12-18 小米科技有限责任公司 Video URL (Uniform Resource Locator) grabbing method and device and server equipment
US20150356195A1 (en) * 2014-06-05 2015-12-10 Apple Inc. Browser with video display history

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘志龙: "Web视频信息提取研究", 《中国优秀硕士学位论文全文数据库》 *

Also Published As

Publication number Publication date
WO2018053863A1 (en) 2018-03-29

Similar Documents

Publication Publication Date Title
CN108475275A (en) Identify video page
CN104685501B (en) Text vocabulary is identified in response to visual query
Peters et al. Content extraction using diverse feature sets
CN102902693B (en) Detect the repeat pattern on webpage
CN109716327A (en) The video capture frame of visual search platform
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
Nguyen et al. Learning to extract form labels
US20010044810A1 (en) System and method for dynamic content retrieval
KR101640051B1 (en) Characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device
US11907644B2 (en) Detecting compatible layouts for content-based native ads
CN103778125B (en) Webpage throwing content analyzing method and device and automatic throwing method and device
TWI695277B (en) Automatic website data collection method
CN101714164A (en) Methods and apparatus to automatically crawl the internet using image analysis
CN103678509B (en) Generate the method and device of web page template
EP3289487B1 (en) Computer-implemented methods of website analysis
CN108399150A (en) Text handling method, device, computer equipment and storage medium
CN105431886A (en) Rendering hierarchical visualizations of data sets
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN108595697B (en) Webpage integration method, device and system
Bozkir et al. Layout-based computation of web page similarity ranks
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
Zhang et al. A large scale rgb-d dataset for action recognition
CN102902790B (en) Web page classification system and method
Fiol-Roig et al. Data mining techniques for web page classification
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180831