CN108475275A - Identify video page - Google Patents
Identify video page Download PDFInfo
- Publication number
- CN108475275A CN108475275A CN201680077528.6A CN201680077528A CN108475275A CN 108475275 A CN108475275 A CN 108475275A CN 201680077528 A CN201680077528 A CN 201680077528A CN 108475275 A CN108475275 A CN 108475275A
- Authority
- CN
- China
- Prior art keywords
- video
- page
- web page
- video object
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
Present disclose provides the methods, devices and systems of video page for identification.In order to identify video page, the structured content of the web page on internet can be obtained first.The video object attribute can be extracted from the structured content of web page.It can use and determine whether web page is video page based on the disaggregated model of the page via what machine learning was built.The video object attribute is used as the input feature vector of the disaggregated model based on the page.
Description
Background technology
Search engine is widely used to search for interested content in Internet user.In some cases, user may
It is expected that only receiving the search result of a certain type, such as video page.For example, user may wish to inquiry video content, ring
Should be in this inquiry from the user, the associated video page of search engine can be returned with user is inquired video content
List.Herein, video page indicates to include at least one video and sets at least one video to the Web page of main contents
Face.
In response to the inquiry to video content, search engine provider should pre-establish, for example, comprising on internet
Video page index database or big table (Big Table), wherein database or big table can based on it is local or point
The storage of cloth.It needs to identify the video page on internet first to establish database or big table.It is existing that there are some
The mode of video page for identification, for example, the mode based on template, the mode based on URL, the mode based on map of website
Deng.In general, these modes are directed to small range of video website, the welcome video website such as mainstream, and depend on to this
The artificially defined Rule Information of video page on a little websites, such as pattern rule, URL rules.For example, for being based on template
Mode or mode based on URL, some video websites can be to the artificially design rule of the video page in video website, example
Such as, content of pages pattern rule, page layout template rule or URL rules, and search engine provider can be from video network
Rule Information is summed up in some video pages on standing and the video in video website is further identified using the Rule Information
The page.In another example, for the mode based on map of website, the operator of a welcome video website of mainstream can
Initially to provide the web page list of the website to search engine provider and whether web page is video for identification
The respective meta-data of the page, then, search engine provider can be known using the map of website of the metadata and the website
Video page not on the website.
Artificially defined Rule Information is depended on according to the accuracy rate of the identification video page of existing way.In addition, institute's energy
The range of the video page enough detected is generally focused on the welcome video website of mainstream of lesser amt.
Invention content
The content of present invention is provided to introduce one group of concept in simplified form, this group of concept will be in specific embodiment party below
It is described further in formula.The content of present invention is not intended to the key features or essential features of mark institute subject matter, also not purport
In the range of the subject matter for limiting.
Embodiment of the disclosure can provide the methods, devices and systems of video page for identification.
In one aspect, present disclose provides a kind of methods of video page for identification.According to this method, can obtain
The structured content of web page.It can be by structured content for extracting the video object attribute.It can will be via machine learning structure
The disaggregated model based on the page built is for determining whether web page is video page.The video object attribute is used as base
In the input feature vector of the disaggregated model of the page.
On the other hand, present disclose provides a kind of devices of video page for identification.The device may include knot
Structure content obtaining module, attribute extractor and video page grader.Structured content acquisition module can be configured as
Obtain the structured content of web page.Attribute extractor can be configured as the extraction the video object attribute from structured content.
Video page grader, which can be configured as, to be determined by using what is built via machine learning based on the disaggregated model of the page
Whether web page is video page, input feature vector of the video object attribute as the disaggregated model based on the page.
On the other hand, present disclose provides a kind of systems of video page for identification.The system may include:One
A or multiple processors and memory.The memory can store computer executable instructions, when the computer is executable
When instruction is run so that one or more of processors execute the arbitrary steps of the method according to disclosure various aspects.
On the other hand, present disclose provides a kind of non-volatile computer-readable mediums.The non-volatile computer
Readable medium may include instruction, when described instruction is run so that one or more processors execute each according to the disclosure
The arbitrary steps of the method for a aspect.
It should be noted that the spy that the above one or more aspects are specifically noted in including described in detail below and claim
Sign.The particular exemplary feature of one or more of aspects has been set forth in detail in following specification and attached drawing.These features are only
Some modes in the various ways for the principle that only instruction can implement various aspects, and the disclosure is intended to include all these
Aspect and its equivalents.
Description of the drawings
Below with reference to the disclosed many aspects of attached drawing description, these attached drawings are provided public to illustrative and not limiting institute
The many aspects opened.
Fig. 1 is the flow chart according to the illustrative methods of the video page for identification of the embodiment of the present disclosure.
Fig. 2 shows the exemplary means according to the video page for identification of the embodiment of the present disclosure.
Fig. 3 is the flow according to the illustrative methods of the structured content for obtaining web page of the embodiment of the present disclosure
Figure.
Fig. 4 shows the exemplary means of the video page for identification according to the embodiment of the present disclosure.
Fig. 5 is the stream according to the illustrative methods for determining potential video page based on URL of the embodiment of the present disclosure
Cheng Tu.
Fig. 6 shows the exemplary means of the video page for identification according to the embodiment of the present disclosure.
Fig. 7 shows the exemplary system of the video page for identification according to the embodiment of the present disclosure.
Specific implementation mode
The disclosure is discussed referring now to various exemplary embodiment.It should be appreciated that the discussion of these embodiments
Be used only for so that those skilled in the art can better understand that and thereby implement embodiment of the disclosure, and not instruct pair
Any restrictions of the scope of the present disclosure.
Embodiment of the disclosure can provide the methods, devices and systems of video page for identification.The implementation of the disclosure
Example can be applied to various search engines.For example, by implementing embodiment of the disclosure, can efficiently identify on internet
Video page, and it includes the number for maintaining the video page for search engine to index that can and then be added to video page
According to library or big table.Therefore, when user's video content a certain by search engine inquiry, search engine can be from database or big
Video page is extracted in table to index and return to user.
In one aspect, the present disclosure proposes use the disaggregated model based on the page to identify video page.It can be via
Machine learning builds the disaggregated model based on the page.Various machine learning algorithms can be applied to based on each of web page
The video object attribute is planted to build the disaggregated model based on the page.By using via constructed by machine learning based on the page
Disaggregated model can improve the accuracy rate of identification video page and recall (recall) rate.On the other hand, the disclosure is also
It proposes and determines potential video page using based on the disaggregated model of URL.It can be based in advance via machine learning to build
The disaggregated model of URL.By being used in conjunction with disaggregated model based on URL and based on the disaggregated model of the page, can further carry
The accuracy rate and recall rate of height identification video page, and the efficiency of identification video page can be significantly improved.
Embodiment of the disclosure can execute the large range of web page on internet video page identification, and simultaneously
It is not limited to the welcome video website of the mainstream of lesser amt.Therefore, more video pages can be identified for search
Engine.Embodiment of the disclosure can also realize automatic video page identification, therefore can constantly identify on internet
Video page.
It should be appreciated that example described above environment is intended solely for illustrative purposes, and not instruct to the disclosure
Any restrictions of range.The disclosure can be implemented using different structure and/or function.
Fig. 1 is the flow chart according to the illustrative methods 100 of the video page for identification of the embodiment of the present disclosure.
Method 100 starts at 102 and proceeds to 104.At 104, the structured content of web page can be obtained.It is many
Well known, web page is that various markup languages, such as HTML, XML etc. may be used to realize.It can be by the source of web page
Code document, such as html document, are converted into structured content.Embodiment of the disclosure may be used any for obtaining web
The mode of the structured content of the page.In one embodiment, the structured content of the acquisition web page at 104 can be with
Only refer to reception structured content.In such a case, it is possible to provide structuring by any Web page surface technology or any third party
Content.In another embodiment, the structured content of the acquisition web page at 104 may include in generating structure
The process of appearance.It is, for example, possible to use DOM Document Object Model (DOM) technology generates structured content, wherein for DOM technologies
Such as html document, XML document provide structured representation mode.In this case, at 104, for example, correspondence can be obtained
In the dom tree of web page.That is, structured content can be expressed as dom tree.Herein, the Web page of structured content is obtained
Face can be any one web page on internet.Optionally, which can be the certain party by describing below
Formula and the potential video page of determination.
At 106, the video object attribute can be extracted from structured content.In general, the structured content of web page can
With each attribute including the object in web page, for example, logical structure information, layout information or type information.Therefore, web
The attribute of the video object in the page can be obtained from structured content.
If web page only includes a video object, the video object during web page can be obtained at 106
Attribute.However, if web page includes multiple the video objects, multiple videos during web page can be obtained at 106
The attribute of one or more of object the video object.For example, in one embodiment, can obtain in multiple the video objects
All videos object or some the video objects attribute.In another embodiment, multiple the video objects can only be obtained
In with maximum sized the video object attribute.
In one embodiment, the operation at 106 may include being primarily based on video identifier information to come from structuring
The video object is detected in content.The video object indicates the video unit of a certain video type in web page.Common video type
May include webm, ogg, mp4, avi, flv etc..Video identifier information can be that the video object can be indicated in structured content
Any kind of information.For example, video identifier information can be directly embedded into the video object html tag or HTML5 regard
Frequency marking label.Video identifier information can also be to be embedded with the Iframe labels of video page.Embodiment of the disclosure is not limited to
Any certain types of the video object or any certain types of video identifier information.
After detecting the video object, operation at 106 can be obtained further to be detected from structured content
The attribute of the video object.As described above, if web page includes multiple the video objects, the operation at 106 can be obtained from knot
The attribute of one or more of the multiple the video objects detected in structure content the video object.
In one embodiment, the video object attribute extracted may include and the video object phase in web page
Associated layout information.Layout information can be one or more of width, height and position or its arbitrary combination or
Any therefrom derived information of person.
For example, in one embodiment, layout information may include width, height, top and the left part of the video object
One or more of.In general, the width of the video object, height, top and left part are defined in structured content, therefore,
It can directly be extracted from structured content.Otherwise, can make to derive in various manners.In another embodiment,
Layout information may include width, height, top and the left part corresponding at least one container (container) of the video object
One or more of.At least one container corresponding to the video object may include that the video object is used in different levels
One or more containers.In general, width, height, top and the left part of at least one container are defined within structured content
In, therefore, can directly it be extracted from structured content.Otherwise, can make to derive in various manners.Implement in another kind
In mode, layout information may include the distance between the horizontal centre of the video object and the horizontal centre of web page.The distance
Can be calculated based on the relative position between the video object and web page.In another embodiment, layout information
May include top or the distance between the top or bottom of bottom and web page of the video object.The distance can be example
Such as, between the distance between the top of the video object and the top of web page, the bottom of the video object and the bottom of web page
Distance, the top of the video object and the distance between the bottom of web page or the video object bottom and web page
The distance between top.It should be appreciated that layout information may include at least one of foregoing exemplary embodiment.In addition,
In some cases, normalized form may be used in the above layout information.For example, can be by the width or height of web page
For layout information to be normalized.
In one embodiment, the video object attribute may include Video type information.For example, can be out of structuring
The type of the video object is determined in appearance.As described above, video type can be flv, webm, ogg, mp4 etc..
In one embodiment, the video object attribute may include Container Type information.For example, can be out of structuring
The type of at least one container corresponding to the video object is determined in appearance.Container Type can be, for example, div, p, a, span
Deng.
In one embodiment, the video object attribute may include depth information of the video object in dom tree.As above
Described, structured content can be expressed as dom tree.Various modes may be used to determine depth of the video object in dom tree
Degree.
In one embodiment, the video object attribute may include with the video object and/or corresponding to the video object
At least one associated text message of container.Text information can indicate, for example, with the video object and/or corresponding to regarding
" class " or " id " title of the associated label of at least one container of frequency object or any other text.
It should be appreciated that the video object attribute may include at least one of the following:Layout information, video type letter
Breath, depth information in dom tree of Container Type information, the video object and with the video object and/or correspond to the video object
The associated text message of at least one container.
At 108, it can determine whether web page is video page by using based on the disaggregated model of the page.
The video object attribute obtained at 106 is used as the input feature vector of the disaggregated model based on the page.It can be in advance via machine
Device study builds the disaggregated model based on the page, and web is determined for being based upon the video object attribute of web page extraction
Whether the page is video page.It is described later the structure to the disaggregated model based on the page via machine learning.
According to the disclosure, it is intended to video page will be determined as comprising those of video as main contents web page.
For comprising video but not setting the video to the web page of main contents, it is intended to which these web pages are determined as non-regard
The frequency page, because the video may be such as advertisement video and may not be that user is desired.Include multiple in web page
In the case of video, if at least one of multiple video video is main contents, tend to determine the web page
For video page.
For example, if the video object be displayed on web page horizontal centre or close to web page horizontal centre, or
Person has relatively large size although not close to horizontal centre, then will be determined as the web page with higher probability
Video page.However, if the video object be shown as close to web page edge and have relatively small size, or
Even if person is close to horizontal centre but has relatively small size, then the web page will be determined as with higher probability non-
Video page.For another example, if there is web page larger vertical dimension and the video object to be displayed on Web page
The bottom in face, even if still can should with higher probability if then the video object is shown as the horizontal centre close to web page
Web page is determined as the non-video page;However, if the video object is displayed in the first screen of web page and close to web
The web page then will be determined as video page by the horizontal centre of the page with higher probability.Therefore, point based on the page
Class model can be by the layout information in the video object attribute for determining video page.
In addition, the disaggregated model based on the page can also be by the video object attribute, the video object in dom tree depth
Degree is for determining video page.For example, if it is known that for most of video pages, depth of the video object in dom tree
Degree is located in the range of 5 to 7, then is more likely to there is the web page of the depth within the scope of this to be determined as video its video object
The page.In addition, the disaggregated model based on the page can also be by type in the video object attribute, the video object or at least one
The type of container is for determining video page.In addition, the disaggregated model based on the page can also will be with the video object and/or corresponding
In the associated text message of at least one container of the video object for determining video page.For example, and if the video object
" id " of associated label is named as " video_stage (video _ grade) ", then, should due to the use of word " video "
It is video page that web page, which is very likely to, and if the title of " id " includes " gallery (picture library) ", which can be with
It is confirmed as the non-video page.
Method 100 terminates at 110.It will be appreciated, however, that can be by the definitive result at 108 for any further
Application scenarios.For example, if web page is identified as video page by method 100, which can be encoded
It includes the database for maintaining the video page for search engine to index or big table to index and be added to.
As described above, in the method 100, the disaggregated model based on the page can be built via machine learning.The disclosure
Embodiment the various technologies for executing machine learning may be used.For example, in one embodiment, it can be by boosted tree
One or more of (Boosted Tree), random forest, neural network and support vector machines (SVM) are used as machine learning mould
Type.
It will be used as the machine learning model of the disaggregated model based on the page to build, it is possible, firstly, to artificially will be big
The training web page (for example, thousands of or more web pages) of amount is labeled as video page or the non-video page.It can lead to
Cross in Fig. 1 operation 104 and 106 similar modes come from these web pages extract the video object attribute.Then, may be used
With by the video object attribute of web page extracted and be video page or non-video page mark determined by web page
Note be input to machine learning model, using as input training characteristics.Machine learning model may be used in input training characteristics can
Any form explained.For example, if input training characteristics use digital form, it can be in advance to the video object attribute and mark
Remember digitized into.It, can be with training machine learning model, to establish for determining video page based on input training characteristics
Mechanism based on machine learning.The machine learning model trained can be used as the disaggregated model based on the page, for true
Determine whether web page is video page.For example, when whether to determine some web page is video page, based on the page
Disaggregated model can receive the video object attribute of web page, and use the mechanism based on machine learning constructed in it
To return to definitive result.
The disaggregated model based on the page is built by using machine learning mode, the disclosure can be established to be regarded for determination
The determination mechanism of the frequency page, the determination mechanism have performance more higher than existing artificially defined rule.Therefore, the disclosure can
To improve the accuracy rate and recall rate of identification video page.
Fig. 2 shows the exemplary means 200 according to the video page for identification of the embodiment of the present disclosure.In a kind of embodiment party
In formula, device 200 can be configured as the operation of execution method 100.
Device 200 may include structured content acquisition module 202, attribute extractor 204 and video page grader
206.Structured content acquisition module 202 can be configured as the structured content for obtaining web page.For example, structured content
Acquisition module 202 can execute the operation 104 in method 100.Attribute extractor 204 can be configured as from structured content
Extract the video object attribute.For example, attribute extractor 204 can execute the operation 106 in method 100.Video page grader
206 may include that and can be configured as by using base via the disaggregated model based on the page constructed by machine learning
Determine whether web page is video page in the disaggregated model of the page.For example, video page grader 206 can the side of execution
Operation 108 in method 100.
Fig. 3 is the stream according to the illustrative methods 300 of the structured content for obtaining web page of the embodiment of the present disclosure
Cheng Tu.Method 300 is the exemplary implementation of the operation 104 in method 100.It should be appreciated that any other mode can also be passed through
Carry out the operation 104 in implementation 100.
Method 300 starts at 302 and proceeds to 304.It, can be based on the URL of web page come to web page at 304
Execute grasping manipulation.In embodiment of the present disclosure, the web page on internet can be classified as static web page or
Dynamic web page.For static web page, after generating corresponding source code (for example, HTML code), the content of the page
It will not change with display effect, unless source code is changed.And for dynamic web page, even if its source code is not repaiied
Change, at least part of displayed content can also change with time, database manipulation etc..Can by HTML with it is other
The combination of high-level programming language (for example, Java, C#, C++ etc.) generates dynamic web page.For example, dynamic web page can be with
Including program code segments, and by running the program code segments, can execute and background data base, web server or user
Interaction.
Various grasp modes may be used in embodiment of the disclosure.Static grasp mode or dynamic crawl side can be passed through
Formula captures web page.In general, can be by static grasp mode for capturing static web page, it can be by dynamic grasp mode
For capturing dynamic web page.However, in some cases, for example, if the source code of the web page including dynamic content
Video player can be generated in standard web browsers using single source code document, then can also use static crawl side
Formula rather than dynamic grasp mode capture the web page.In some embodiments it is possible to by the URL of web page
Which kind of grasp mode can be applied to the web page by " domain (domain) " field for determination.For static grasp mode and
Dynamic grasp mode can obtain different crawl results respectively.For example, if web page is static web page, operate
304 can grab the single source code document of the static state web page by static grasp mode, for example, html document.And
If web page is dynamic web page, operation 304 usually can grab the dynamic web page by dynamic grasp mode
The source code document in face and at least one script file.Script file can contribute to realize to certain in dynamic web page
The file of the Dynamic Announce of a little objects.For example, script file can be or may include program code (for example, Javascript
Code) section, when being run, which executes and background data base, web server or the interaction of user.In addition, such as
Upper described, in some cases, operation 304 can also capture the single source generation of dynamic web page by static grasp mode
Code document.
At 306, it may be determined that be to execute static parsing or dynamic analysis.Various modes may be used to be used for 306
The determination at place.For example, if grabbing single source code document at 304, the static parsing of execution can be determined at 306,
To which method 300 proceeds to 308.At 308, source code document can be parsed into structured content.
Otherwise, it if grabbing source code document and at least one script file at 304, can be determined at 306
Dynamic analysis is executed, to which method 300 proceeds to 310.It, can be by source code document and at least one script file at 310
It is parsed into structured content.
In one embodiment, at least one script file can be automatically run, to contribute at 310
Parsing operation.That is, the parsing operation at 310 may include by source code document and at least one script file through operation
It is parsed into structured content.For example, if script file is Javascript files, can run in Javascript files
Code, so as to obtain the further object information of web page and use it for generating structure content.As reality
Example, in the case where dynamic web page includes the broadcast button corresponding to script file, can automatic simulation to broadcast button
Clicking operation.Such operation can cause direct Run Script file or send another http request to server
To obtain the newer dynamic web page that may include video.It, can will be above-mentioned when receiving newer dynamic web page
Crawl and parsing operation further apply the newer dynamic web page.
Method 300 terminates at 312.It will be appreciated, however, that the structured content obtained at 308 and 310 can be into
And the operation 106 in the method 100 of Fig. 1 is provided to carry out subsequent processing.
Fig. 4 shows the exemplary means 400 of the video page for identification according to the embodiment of the present disclosure.Device 400 is Fig. 2
The further implementation of middle device 200.
As shown in Figure 4, structured content acquisition module 202 may further include page grabber 402 and page solution
Parser 404.Page grabber 402 and page parsing device 404 can be jointly configured as the operation of execution method 300.
Page grabber 402 can be configured as the URL based on web page to execute crawl to web page.For example, page
Face grabber 402 can execute the operation 304 in method 300.In static grasp mode, page grabber 402 can capture
The source code document of web page, and in dynamic grasp mode, page grabber 402 can capture the source code text of web page
Shelves and at least one script file.
Page parsing device 404 can be configured as the document captured to page grabber 402 or file executes parsing, with
Just the structured content of web page is generated.For example, page parsing device 404 can execute the operation 306,308 and 310 of method 300
In any operation.
Page parsing device 404 can be determined that the static parsing of execution or dynamic analysis.If page grabber 402 captures
To single source code document, then source code document can be parsed into structured content by page parsing device 404.And if the page is grabbed
Device 402 is taken to grab source code document and at least one script file, then page parsing device 404 can by source code document and extremely
A few script file is parsed into structured content.
It attribute extractor 204 and video page grader 206 in Fig. 4 and attribute extractor 204 shown in Fig. 2 and regards
Frequency page classifier 206 is identical.
Back to Fig. 1, method 100 can be applied to any web page on internet.Optionally, in some implementations
In example, can method 100 be only applied to potential video page.Pass through for identification video page as shown in Figure 1 in application
The method based on the page before determine that potential video page, the disclosure can be greatly reduced web page to be processed
Quantity, to be significantly improved the efficiency of identification video page.In addition, by determining whether web page is potential video page,
The disclosure can further increase recognition accuracy and ensure higher recall rate.Various modes may be used to determine potential regard
The frequency page.It in accordance with an embodiment of the present disclosure, can be by the URL of web page for determining whether the web page is potential video page
Face.
All websites on internet can be divided into two groups, one group includes the welcome video website of mainstream, another
Group includes all other website.For the welcome video website of mainstream, the base for determining potential video page can be used
In the mode of URL pattern (pattern).For example, can will be from the web page in the welcome video website of a certain mainstream
One group of acquired URL pattern is for determining whether the web page is potential video page in URL.It is welcome for mainstream
Other websites except video website can use the mode based on URL keyword for determining potential video page.Example
It such as, can be by acquired one group of URL keyword from the URL of the web page on other websites for determining the Web page
Whether face is potential video page.
Fig. 5 is according to the embodiment of the present disclosure for determining the illustrative methods 500 of potential video page based on URL
Flow chart.
Method 500 starts at 502 and proceeds to 504.At 504, the URL of web page can be obtained.It can be by each
Kind of mode obtains the URL of web page.For example, can be by web crawl device for automatically obtaining the web page on internet
URL.In this case, the operation of method 500 can be triggered always by getting event as new URL.
At 506, it may be determined that be used for really by the mode based on URL pattern or by the mode based on URL keyword
Fixed potential video page.For example, " domain " field of the URL obtained at 504 can be extracted, and uses it for determining and be somebody's turn to do
Whether URL is directed toward the web page in the welcome video website of mainstream and thereby should be by the mode based on URL pattern of application.
If determination will apply the mode based on URL pattern, method 500 to proceed to 508 at 506.It, can at 508
To execute URL pattern parsing to URL, to obtain one group of URL pattern for corresponding to URL.In one embodiment, one group
Each URL pattern in URL pattern can be the combination of one or more URL features of URL.URL features may include from
Scheme (scheme), domain, path list (path list), suffix (suffix), query list (inquiry row in URL
Table) etc. at least one of the feature extracted.For example, it is assumed that URL is " http://www.abcde.com/video/cn/
263578.html " can then extract one group of URL feature, for example, scheme=" http ", domain=" abcd.com ",
Path 1=" video ", path 2=" cn ", path 3=" 263578 ", path 3 are decimal numbers etc..It is thus possible to root
One group of URL pattern is formed according to the arbitrary combination of one or more of these URL features URL features.For example, the first URL moulds
Formula can be the combination of [scheme=" http ", domain=" abcd.com ", path 1=" video "], the second URL pattern
It can be the group of [domain=" abcd.com ", path 1=" video ", path 2=" cn ", path 3 are decimal numbers]
Close, etc..
At 510, the classification based on URL pattern can be executed based on the one group of URL pattern obtained at 508, with
Just determine whether the web page is potential video page.In one embodiment, can by with belonging to the web page
The corresponding disaggregated model based on URL pattern of the welcome video website of mainstream is used to execute the classification at 510.It can be via
Machine learning builds the disaggregated model based on URL pattern.In this case, the operation at 508 can be optionally included in
Among the multiple disaggregated models based on URL pattern built respectively for the welcome video website of multiple mainstreams, selection corresponds to
The disaggregated model based on URL pattern of the welcome video website of the mainstream.For example, the selection can be based in URL
" domain " field is performed.Obtained at 508 one group of URL pattern can be used as the classification mould based on URL pattern
The input feature vector of type.Can the disaggregated model based on URL pattern be built via machine learning in advance, for being based on from web
URL pattern acquired in the URL of the page determines whether web page is potential video page.It is described later via engineering
Practise the structure to the disaggregated model based on URL pattern.
If being determined at 506 does not apply the mode based on URL pattern, method 500 to proceed to 512.It, can at 512
To execute URL keyword parsing to URL, to obtain one group of URL keyword for corresponding to URL.For example, it is assumed that URL is
“http://www.edcba.com/video/show/145937.html " can then extract one group of URL from the URL and close
Key word, for example, " video (video) ", " show (performance) " etc..
At 514, point based on URL keyword can be executed based on the one group of URL keyword obtained at 512
Class, to determine whether web page is potential video page.In one embodiment, it can will be built via machine learning
The disaggregated model based on URL keyword be used to execute classification at 514.It can be crucial by obtained at 512 one group of URL
Word is used as the input feature vector of the disaggregated model based on URL keyword.It can be built in advance based on URL keys via machine learning
The disaggregated model of word, for determining whether web page is potential based on the URL keyword acquired in the URL from web page
Video page.It is described later the structure to the disaggregated model based on URL keyword via machine learning.
At 516, if web page is determined as potential video page, method 500 proceeds to 518.At 518, return
Potential video page determined by returning.Otherwise, if web page is determined as not to be potential video page, method 500 exists
Terminate at 520.
Although method 500 terminates at 520, the potential video page determined by method 500 can and then be carried
The operation 104 in the method 100 of Fig. 1 is supplied, it is thus possible to apply the subsequent operation in method 100 in potential video page
On.
As described above, in method 500, the disaggregated model and base based on URL pattern can be built via machine learning
In the disaggregated model of URL keyword.Various technologies may be used to execute machine learning in embodiment of the disclosure.For example, one
In kind embodiment, one or more of boosted tree, random forest, neural network and SVM can be used as machine learning mould
Type.
It should be appreciated that the disaggregated model based on URL pattern is specific for the welcome video website of each mainstream.That is,
For the welcome video website of each mainstream, the disaggregated model based on URL pattern should be individually built.In order to build by
It is used as the machine learning model of the disaggregated model based on URL pattern corresponding with the video website that a certain mainstream is welcome,
The URL for multiple trained web pages that video page or the non-video page are had been labeled as on the website can be obtained first.
In a kind of embodiment, these training web pages artificially can be marked and provided.In another embodiment, Ke Yili
Come automatic terrestrial reference note and these training web pages are provided with the device 200 of video page for identification described above.For example,
For the welcome video website of a certain mainstream, device 200 can be used for the web page on the website being categorized into video page
Or the non-video page.To which these web pages can be labeled and provide the training as the disaggregated model based on URL pattern
Web page.In this case, trained web page is provided since the device 200 of video page for identification can be used,
The expense for training pattern can be reduced, and improves training effectiveness.It can be by similar with the operation 508 in Fig. 5
Mode extracts URL pattern from the URL of these web pages.It is then possible to by the URL pattern extracted and video page
Or the label of the non-video page is input to machine learning model, using as input training characteristics.It, can be with based on input training characteristics
Training machine learning model, to establish the mechanism based on machine learning for determining potential video page.It can will be instructed
Experienced machine learning model is used as the disaggregated model based on URL pattern, should be used to determine the master based on the disaggregated model of URL pattern
Flow whether the web page in welcome video website is potential video page.For example, being regarded when determining that a certain mainstream is welcome
When a certain web page on frequency website is potential video page, the disaggregated model based on URL pattern for corresponding to the website can be with
It receives one group of URL pattern of the web page, and determining knot is returned to based on the mechanism of machine learning using constructed in it
Fruit.
It can be built based on URL via machine learning by the mode similar with the disaggregated model based on URL pattern
The disaggregated model of keyword, the difference is that the machine learning model is using such as
The URL keyword of these web pages extracted from the URL of training web page shown in operation 512 in Fig. 5
Come what is be trained.For example, housebroken machine learning model may include multiple learnt keywords (for example, " v ",
" show ", " play ", " video ", " tv ", " vplay " etc.) and these keywords respective weights.
Disaggregated model based on URL pattern is built by using machine learning mode and based on the classification of URL keyword
Model, the disclosure can establish the determination mechanism for determining potential video page with higher performance.
Fig. 6 shows the exemplary means 600 of the video page for identification according to the embodiment of the present disclosure.Device 600 is Fig. 2
In device 200 further implementation.
As shown in fig. 6, in addition to structured content acquisition module 202, attribute extractor 204 and video page grader
Except 206, the device 600 of video page can also include URL parser 602 and URL classifier 604 for identification.URL is parsed
Device 602 and URL classifier 604 can be jointly configured as to execute as shown in Figure 5 determines potential video page based on URL
Method 500.
URL parser 602 can be configured as executes URL parsings to the URL of web page.In one embodiment,
URL parser 602 may include URL pattern resolver 612 and URL keyword resolver 614.For example, URL pattern resolver
612 URL that can be configured as the web page in the video website welcome to being directed toward mainstream execute URL pattern parsing, and
URL keyword resolver 614 can be configured as the web page in the video website welcome to not being directed toward any mainstream
URL executes URL keyword parsing.
URL classifier 604 can be configured as executes the classification based on URL using the disaggregated model based on URL.One
In kind embodiment, URL classifier 604 may further include the grader 616 based on URL pattern and be based on URL keyword
Grader 618, and the disaggregated model based on URL may further include disaggregated model based on URL pattern and be based on URL
The disaggregated model of keyword.For example, the grader 616 based on URL pattern may include the disaggregated model based on URL pattern, and
And it can be configured as and execute the classification based on URL pattern using the disaggregated model based on URL pattern.Based on URL keyword
Grader 618 may include the disaggregated model based on URL keyword, and can be configured as using be based on URL keyword
Disaggregated model execute the classification based on URL keyword.
It is corresponding to execute that the potential video page determined by URL classifier 604 can be provided to subsequent module in turn
Processing.For example, structured content acquisition module 202 can obtain the potential video page determined by URL classifier 604
Structured content.
Fig. 7 shows the exemplary system 700 of the video page for identification according to the embodiment of the present disclosure.System 700 can be with
Including one or more processors 702.System 700 can also include memory 704, with one or more of processors
702 connections.Memory 704 can store computer executable instructions, when the computer executable instructions are run so that
One or more of processors 702 execute the method for the video page for identification according to the embodiment of the present disclosure as described above
Arbitrary operation.
Embodiment of the disclosure can be embodied in non-volatile computer-readable medium.The non-volatile computer is readable
Medium may include instruction, when described instruction is run so that one or more processors are executed according to this public affairs as described above
Open the arbitrary operation of the method for the video page for identification of embodiment.
It should be appreciated that all operations in process as described above are all only exemplary, the disclosure is not restricted to
The sequence of any operation or these operations in method, but should cover all other equivalent under same or similar design
Transformation.
It is also understood that all modules in arrangement described above can be implemented by various modes.These moulds
Block may be implemented as hardware, software, or combinations thereof.In addition, these moulds any module in the block can be further divided into
Submodule or with other block combiners.
It has been combined various device and method and describes processor.These processors can use electronic hardware, computer
Software or its arbitrary combination are implemented.These processors, which are implemented as hardware or software, will depend on specifically applying and applying
The overall design constraints being added in system.As an example, the arbitrary portion of processor, processor involved in the disclosure or
The arbitrary combination of processor may be embodied as microprocessor, microcontroller, digital signal processor (DSP), field programmable gate
It array (FPGA), programmable logic device (PLD), state machine, gate logic, discrete hardware circuit and is configured to carry out
The other suitable processing component of various functions described in the disclosure.This disclosure relates to processor, processor arbitrary portion
Divide or the function of arbitrarily combining of processor may be embodied as being put down by microprocessor, microcontroller, DSP or other suitable
Software performed by platform.
Software should be viewed broadly as indicate instruction, instruction set, code, code segment, program code, program, subprogram,
Software module, application, software application, software package, routine, subroutine, object, active thread, process, function etc..Software can be with
It is resident in computer-readable medium.Computer-readable medium may include such as memory, and memory can be, for example, magnetism
Storage device (e.g., hard disk, floppy disk, magnetic stripe), CD, smart card, flash memory device, random access memory (RAM), read-only storage
Device (ROM), programming ROM (PROM), erasable PROM (EPROM), electric erasable PROM (EEPROM), register or removable
Moving plate.Although this disclosure relates to many aspects in memory is illustrated as detaching with processor, memory can
To be located inside processor (e.g., caching or register).
This specification is provided for that those skilled in the art is allow to implement aspects described herein.These
The various modifications of aspect are apparent to those skilled in the art, and general principle described herein can be applied to it
Its aspect.Therefore, claim is not intended to be limited to aspect shown in this article.About it is known to those skilled in the art or i.e.
It, all will be by drawing by all equivalents structurally and functionally of elements knowing, to various aspects described by the disclosure
With and be expressly incorporated herein, and be intended to be covered by claim.
Claims (20)
1. a kind of method of video page for identification, including:
Obtain the structured content of web page;
The video object attribute is extracted from the structured content;And
Determine whether the web page is video page based on the disaggregated model of the page by using what is built via machine learning
Face, input feature vector of the video object attribute as the disaggregated model based on the page.
2. according to the method described in claim 1, before the acquisition, further include:
By using the disaggregated model based on URL keyword built via machine learning or based on the disaggregated model of URL pattern,
To determine that the URL of the web page is directed toward potential video page.
3. according to the method described in claim 1, wherein, the acquisition includes:
The source code document of the web page is captured with static grasp mode;And
The source code document is parsed into the structured content.
4. according to the method described in claim 1, wherein, the acquisition includes:
With dynamic grasp mode come the source code document for capturing the web page and at least one script file;And
The source code document and at least one script file are parsed into the structured content.
5. according to the method described in claim 4, further including:
At least one script file is run,
Wherein, the parsing include the source code document and at least one script file through operation are parsed into it is described
Structured content.
6. according to the method described in claim 1, wherein, the structured content is expressed as dom tree.
7. according to the method described in claim 1, wherein, the video object attribute includes at least one of the following:
Depth information in dom tree of layout information, Video type information, Container Type information, the video object and with the video
Object and/or the associated text message of at least one container corresponding to the video object.
8. according to the method described in claim 7, wherein, the layout information includes at least one of the following:
(a) one or more of the width of the video object, height, top and left part;
(b) correspond to one or more of width, height, top and the left part of at least one container of the video object;
(c) the distance between the horizontal centre of the horizontal centre of the video object and the web page;And
(d) the distance between the top or bottom of the video object and the top of the web page or bottom.
9. according to the method described in claim 8, wherein, the layout information is the height or width using the web page
And it is normalized.
10. according to the method described in claim 1, wherein, the extraction includes:
Based on video identifier information the video object is detected from the structured content.
11. according to the method described in claim 10, wherein, the video identifier information includes at least one in the following terms
:It is directly embedded into the html tag of the video object, the HTML5 video tabs of the video object is directly embedded into and is embedded with
The Iframe labels of video page.
12. according to the method described in claim 1, wherein, the web page includes multiple the video objects, and the extraction
Further include:
The video object category of one or more of the multiple the video object the video object is extracted from the structured content
Property.
13. according to the method described in claim 1, wherein, by during the machine learning by boosted tree, random forest,
One or more of neural network and support vector machines (SVM) build point based on the page as machine learning model
Class model.
14. according to the method described in claim 1, wherein, by multiple with video page or non-video page marks
The video object attribute of web page and the label of the multiple web page execute the machine learning, described to build
Disaggregated model based on the page.
15. a kind of device of video page for identification, including:
Structured content acquisition module, the structured content for obtaining web page;
Attribute extractor, for extracting the video object attribute from the structured content;And
Video page grader, for described to determine based on the disaggregated model of the page by using being built via machine learning
Whether web page is video page, input feature vector of the video object attribute as the disaggregated model based on the page.
16. device according to claim 15, further includes:
URL classifier, for by using the disaggregated model based on URL keyword built via machine learning or being based on URL
The disaggregated model of pattern, to determine that the URL of the web page is directed toward potential video page.
17. device according to claim 15, wherein the video object attribute includes at least one in the following terms
:Depth information in dom tree of layout information, Video type information, Container Type information, the video object and with it is described
The video object and/or the associated text message of at least one container corresponding to the video object.
18. device according to claim 17, wherein the layout information includes at least one of the following:
(a) one or more of the width of the video object, height, top and left part;
(b) correspond to one or more of width, height, top and the left part of at least one container of the video object;
(c) the distance between the horizontal centre of the horizontal centre of the video object and the web page;And
(d) the distance between the top or bottom of the video object and the top of the web page or bottom.
19. device according to claim 18, wherein the web page includes multiple the video objects, and the attribute
Extractor is additionally configured to:
The video object category of one or more of the multiple the video object the video object is extracted from the structured content
Property.
20. a kind of system of video page for identification, including:
One or more processors;And
Memory, store computer executable instructions, when the computer executable instructions are run so that it is one or
Multiple processors execute the method according to claim 1-14.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2016/100192 WO2018053863A1 (en) | 2016-09-26 | 2016-09-26 | Identifying video pages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108475275A true CN108475275A (en) | 2018-08-31 |
Family
ID=61689295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201680077528.6A Pending CN108475275A (en) | 2016-09-26 | 2016-09-26 | Identify video page |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108475275A (en) |
WO (1) | WO2018053863A1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11442749B2 (en) | 2019-11-11 | 2022-09-13 | Klarna Bank Ab | Location and extraction of item elements in a user interface |
US11379092B2 (en) | 2019-11-11 | 2022-07-05 | Klarna Bank Ab | Dynamic location and extraction of a user interface element state in a user interface that is dependent on an event occurrence in a different user interface |
US11366645B2 (en) | 2019-11-11 | 2022-06-21 | Klarna Bank Ab | Dynamic identification of user interface elements through unsupervised exploration |
US11726752B2 (en) | 2019-11-11 | 2023-08-15 | Klarna Bank Ab | Unsupervised location and extraction of option elements in a user interface |
US11409546B2 (en) | 2020-01-15 | 2022-08-09 | Klarna Bank Ab | Interface classification system |
US11386356B2 (en) | 2020-01-15 | 2022-07-12 | Klama Bank AB | Method of training a learning system to classify interfaces |
US10846106B1 (en) | 2020-03-09 | 2020-11-24 | Klarna Bank Ab | Real-time interface classification in an application |
US11496293B2 (en) | 2020-04-01 | 2022-11-08 | Klarna Bank Ab | Service-to-service strong authentication |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101715004A (en) * | 2009-11-12 | 2010-05-26 | 中国科学院计算技术研究所 | Internet video-oriented distributed acquisition method and system |
CN102411587A (en) * | 2010-09-21 | 2012-04-11 | 腾讯科技(深圳)有限公司 | Webpage classification method and device |
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
CN103455600A (en) * | 2013-09-03 | 2013-12-18 | 小米科技有限责任公司 | Video URL (Uniform Resource Locator) grabbing method and device and server equipment |
US20150356195A1 (en) * | 2014-06-05 | 2015-12-10 | Apple Inc. | Browser with video display history |
US20160037071A1 (en) * | 2013-08-21 | 2016-02-04 | Xerox Corporation | Automatic mobile photo capture using video analysis |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559234B (en) * | 2013-10-24 | 2017-01-25 | 北京邮电大学 | System and method for automated semantic annotation of RESTful Web services |
CN104077389A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Display method of webpage element information and browser device |
-
2016
- 2016-09-26 CN CN201680077528.6A patent/CN108475275A/en active Pending
- 2016-09-26 WO PCT/CN2016/100192 patent/WO2018053863A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101715004A (en) * | 2009-11-12 | 2010-05-26 | 中国科学院计算技术研究所 | Internet video-oriented distributed acquisition method and system |
CN102411587A (en) * | 2010-09-21 | 2012-04-11 | 腾讯科技(深圳)有限公司 | Webpage classification method and device |
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
US20160037071A1 (en) * | 2013-08-21 | 2016-02-04 | Xerox Corporation | Automatic mobile photo capture using video analysis |
CN103455600A (en) * | 2013-09-03 | 2013-12-18 | 小米科技有限责任公司 | Video URL (Uniform Resource Locator) grabbing method and device and server equipment |
US20150356195A1 (en) * | 2014-06-05 | 2015-12-10 | Apple Inc. | Browser with video display history |
Non-Patent Citations (1)
Title |
---|
刘志龙: "Web视频信息提取研究", 《中国优秀硕士学位论文全文数据库》 * |
Also Published As
Publication number | Publication date |
---|---|
WO2018053863A1 (en) | 2018-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108475275A (en) | Identify video page | |
CN104685501B (en) | Text vocabulary is identified in response to visual query | |
Peters et al. | Content extraction using diverse feature sets | |
CN102902693B (en) | Detect the repeat pattern on webpage | |
CN109716327A (en) | The video capture frame of visual search platform | |
CN109522562B (en) | Webpage knowledge extraction method based on text image fusion recognition | |
Nguyen et al. | Learning to extract form labels | |
US20010044810A1 (en) | System and method for dynamic content retrieval | |
KR101640051B1 (en) | Characteristic content determination device, characteristic content determination method, recording medium, content generation device, and related content insertion device | |
US11907644B2 (en) | Detecting compatible layouts for content-based native ads | |
CN103778125B (en) | Webpage throwing content analyzing method and device and automatic throwing method and device | |
TWI695277B (en) | Automatic website data collection method | |
CN101714164A (en) | Methods and apparatus to automatically crawl the internet using image analysis | |
CN103678509B (en) | Generate the method and device of web page template | |
EP3289487B1 (en) | Computer-implemented methods of website analysis | |
CN108399150A (en) | Text handling method, device, computer equipment and storage medium | |
CN105431886A (en) | Rendering hierarchical visualizations of data sets | |
CN104331438B (en) | To novel web page contents selectivity abstracting method and device | |
CN108595697B (en) | Webpage integration method, device and system | |
Bozkir et al. | Layout-based computation of web page similarity ranks | |
CN103559234A (en) | System and method for automated semantic annotation of RESTful Web services | |
Zhang et al. | A large scale rgb-d dataset for action recognition | |
CN102902790B (en) | Web page classification system and method | |
Fiol-Roig et al. | Data mining techniques for web page classification | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180831 |