CN103778256A - Method for realizing extraction of Internet audiovisual programs based on context - Google Patents

Method for realizing extraction of Internet audiovisual programs based on context Download PDF

Info

Publication number
CN103778256A
CN103778256A CN201410065728.2A CN201410065728A CN103778256A CN 103778256 A CN103778256 A CN 103778256A CN 201410065728 A CN201410065728 A CN 201410065728A CN 103778256 A CN103778256 A CN 103778256A
Authority
CN
China
Prior art keywords
audiovisual material
audiovisual
webpage
website
internet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410065728.2A
Other languages
Chinese (zh)
Other versions
CN103778256B (en
Inventor
逯利军
钱培专
焦建华
林强
戚永蕾
张昆
张树民
宋聚平
侯卫东
李克民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI CERTUSNET INFORMATION TECHNOLOGY CO., LTD.
Original Assignee
CERTUSNET CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CERTUSNET CORP filed Critical CERTUSNET CORP
Priority to CN201410065728.2A priority Critical patent/CN103778256B/en
Publication of CN103778256A publication Critical patent/CN103778256A/en
Application granted granted Critical
Publication of CN103778256B publication Critical patent/CN103778256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention relates to a method for realizing extraction of Internet audiovisual programs based on context. The method comprises the following steps: loading a predefined audiovisual program meta database; loading the torrent of a website needing to extract audiovisual programs; downloading the webpage content of the website needing to extract audiovisual programs; judging whether a downloaded webpage is the playing page of an audiovisual program or not; if the downloaded webpage is the playing page, searching for the context of the audiovisual program and generating an audiovisual program list, and if the downloaded webpage is not the playing page, quantifying the downloaded webpage content according to the loaded audiovisual program meta database, and saving into a preceding text set as a preceding text of the audiovisual program. By adopting the method for realizing extraction of Internet audiovisual programs based on context, all audiovisual programs of websites on the Internet are extracted under the condition of not creating an extraction template specific to a specific website, unnecessary interference in the capturing process of audiovisual program information can be avoided, and the capturing information of the audiovisual program information is ensured. The method is wider in the application range.

Description

Realize based on context environmental the method that internet audiovisual material is extracted
Technical field
The present invention relates to Internet technical field, relate in particular to internet audiovisual material information and play link extraction field, specifically refer to a kind of method that realizes internet audiovisual material extraction based on context environmental.
Background technology
The general extracting method of current existing internet audiovisual material is: each type program of website is created to a kind of template of extracting, the detailed elements path of extracting programme information is set, then collect page elements by reptile according to template, finally gather generating video programme information.This scheme is for the huge audiovisual Websites quantity in internet, and each website generates one and extracts template, if website revision or renewal page structure just need to be revised corresponding reptile configuration template.
Under prior art, if crawl the audiovisual material on all internets, and form consistent audiovisual material table, the configuration amount of template is as astronomical figure, adds the website renewal of can ceaselessly upgrading, and safeguards so most according to being impossible mission.
Summary of the invention
The object of the invention is to overcome the shortcoming of above-mentioned prior art, provide a kind of can realize do not create for specific website extract template in the situation that, extract website on all internets audiovisual material, the accuracy that guarantees audiovisual material information scratching, there is broader applications scope realize based on context environmental the method that internet audiovisual material is extracted.
To achieve these goals, of the present inventionly realize based on context environmental the method that internet audiovisual material extracts and there is following formation:
Should realize the method that internet audiovisual material is extracted based on context environmental, its principal feature is that described method comprises the following steps:
(1) load predefined audiovisual material metadatabase;
(2) load the seed address that need to extract audiovisual material website;
(3) download the web page contents that need to extract audiovisual material website;
(4) judge whether the webpage of downloading is the broadcasting page of an audiovisual material, if so, continues step (5), otherwise continue step (6);
(5) search this audiovisual material above and generate audiovisual material list;
(6) web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above.
Preferably, described audiovisual material metadata comprises director, protagonist, performer, issuing time, update time and the program outline of audiovisual material.
Preferably, described loading need to be extracted the seed address of audiovisual material website, is specially:
Load the seed address that need to capture audiovisual material website from Xml file or database.
Preferably, described download need to be extracted the web page contents of audiovisual material website, is specially:
Use Http client or reptile that the web page contents of the named web page of intended target website is downloaded to this locality from server.
Preferably, described search this audiovisual material above and generate audiovisual material list, comprise the following steps:
(51) broadcasting type corresponding to this audiovisual material identified;
(52) from searching this audiovisual material above set above;
(53) merge the above complete documentation of this audiovisual material of web content data Information generation of metadata information and this download.
More preferably, described identifies broadcasting type corresponding to this audiovisual material, is specially:
Identify broadcasting type corresponding to this audiovisual material and utilize corresponding player to verify broadcasting to this audiovisual material.
Preferably, the described web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above, comprises the following steps:
(61) judge whether this webpage is the details page of an audiovisual material, if so, continues step (62), otherwise continue step (3);
(62) according to the rule of audiovisual material metadatabase definition, this webpage is quantized and judge this webpage be whether an audiovisual material above, if so, continue step (63), otherwise continue step (64);
(63) using this webpage as an audiovisual material above and deposit above set in, then continue step (64);
(64) judge whether this webpage is last webpage of website, if so, finishes to exit, otherwise, step (65) continued;
(65) analyze the hyperlink of this webpage and add webpage queue to be downloaded, then continuing step (3).
Adopt and realized based on context environmental the method that internet audiovisual material is extracted in this invention, there is following beneficial effect:
(1) adopt audiovisual material information characteristics quantization method, can evade unnecessary interference in audiovisual material information scratching process, thereby can guarantee that the audiovisual material grabbing is accurately.
(2) utilize the unchangeability of audiovisual material metadata information, for the renewal of website layout or content, capture as long as implement the increment of the method, can grab the latest update audiovisual material information of website.
(3) utilize the checking of player rule, can guarantee that the audiovisual material grabbing is the audiovisual material that can play.
(4) adopt a small amount of configuration, not for specific website, but can identify the audiovisual material on internet by the relation between webpage, obtain essential information and the broadcast address of audiovisual material, can be in the situation that not creating extraction template for specific website, extract the audiovisual material of website on all internets, there is range of application widely.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of realizing the method for internet audiovisual material extraction based on context environmental of the present invention.
Embodiment
In order more clearly to describe technology contents of the present invention, conduct further description below in conjunction with specific embodiment.
The audiovisual material grasping means of existing internet, is all for page layout and content configuration template, thus identification audiovisual material.
The present invention is from audiovisual material, carry out abstract to the metadata of audiovisual material, such as: audiovisual material generally all can issuing time/update time, director, performer, the present invention carries out template configuration for these metadata exactly, main contents display area at webpage is identified these metadata, and then the information recording of formation audiovisual material above.
According to the invention process, as long as the once template of (or on a small quantity several times) audiovisual material metadata of configuration, just can avoid the template of a large amount of different web sites that configure under prior art, and the website space of a whole page upgrade after later maintenance, because for existing audiovisual material, its basic metadata information can not become, as: director and the performer of film " decisive battle greatly " can not become all the time.
Audiovisual material on internet, have the details page of audiovisual material, details page has been collected most of metadata of this audiovisual material, these data can form a part for audiovisual material information, have link in details page and be associated with the broadcasting page, play together with the information of the page and the information combination of details page, form the context of an audiovisual material, in conjunction with context, system generates an audiovisual material record.
Realization flow:
1, system starts, and loads meta data category, definition in predefined audiovisual material metadatabase, Web page loading player recognition feature;
2, load the seed address of the spiders of configuration, in these addresses, may have the audiovisual material information of expection;
3, the network being defined by reptile is downloaded logic, downloads the web page contents being present in queue to be crawled;
4, analyzing web page content:
Whether first identify this page by player identification module is the broadcasting page of an audiovisual material;
Whether be the details page of an audiovisual material by this webpage of audiovisual material metadata collecting Module recognition;
Collected the hyperlink of this page by URL analysis module, these hyperlink be likely an audiovisual material above below, also may be a new audiovisual material above, these hyperlink are added in the queue to be climbed of reptile, for continuing the crawl of the next page, complete the traversal to whole website with this;
If 5 current pages are broadcasting pages of an audiovisual material,, merge above metadata information and this page metadata information from this page of set search above above, generate the complete documentation of an audiovisual material;
If 6 current pages are not the broadcasting pages of an audiovisual material, according to the rule of metadata definition, quantize this page, with judge this page be whether an audiovisual material above, if quantized result meets the rule above of an audiovisual material, deposit current page in above set;
If 7 systems need further to capture, jump to 3;
8, system has completed the page to be analyzed, completes audiovisual material and extracts.
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the specific embodiment of the present invention is further described.
Fig. 1 is that a kind of that the embodiment of the present invention provides realizes based on context environmental the method that internet audiovisual material is extracted, and comprising:
Step (1): loading audiovisual material feature database is metadatabase.
Particularly, audiovisual material is all accompanied with director, protagonist, performer, issuing time, update time, program outline etc., according to different audiovisual material types, can configure different audiovisual material metadata combinations.
Step (2): loading need to be extracted the seed address of audiovisual material website.
Particularly, can load the website that need to capture audiovisual material from Xml file or database.
Step (3): downloading web pages content.
Particularly, use Http client or reptile, the named web page that refers to targeted website is downloaded to this locality from server.
Step (4): analyzing web page content, determine whether this page is the broadcasting page of audiovisual material.
Whether particularly, first identifying this page by player identification module is the broadcasting page of an audiovisual material, and which kind of player identifies be, as Flash player; Whether be the details page of an audiovisual material by this webpage of audiovisual material metadata collecting Module recognition, arrange out audiovisual material metamessage and define needed content; Collected the link of this page by URL analysis module, for continuing the crawl of the next page.
Step (5): search audiovisual material above, generate audiovisual material list.
Particularly, the audiovisual material information getting according to step (4), if current page is the broadcasting page of an audiovisual material, from this page of set search above above, merge above metadata information and this page metadata information, generate the complete documentation of an audiovisual material.
Step (6): take audiovisual material feature database as criterion, quantize web page contents, as audiovisual material above.
Particularly, the audiovisual material information getting according to step (4), quantize this page, with judge this page be whether an audiovisual material above, if quantized result meets the rule above of an audiovisual material, deposit above current page in set, this set can be a HASH table, or a tables of data in database.
Adopt and realized based on context environmental the method that internet audiovisual material is extracted in this invention, there is following beneficial effect:
(1) adopt audiovisual material information characteristics quantization method, can evade unnecessary interference in audiovisual material information scratching process, thereby can guarantee that the audiovisual material grabbing is accurately.
(2) utilize the unchangeability of audiovisual material metadata information, for the renewal of website layout or content, capture as long as implement the increment of the method, can grab the latest update audiovisual material information of website.
(3) utilize the checking of player rule, can guarantee that the audiovisual material grabbing is the audiovisual material that can play.
(4) adopt a small amount of configuration, not for specific website, but can identify the audiovisual material on internet by the relation between webpage, obtain essential information and the broadcast address of audiovisual material, can be in the situation that not creating extraction template for specific website, extract the audiovisual material of website on all internets, there is range of application widely.
In this instructions, the present invention is described with reference to its specific embodiment.But, still can make various modifications and conversion obviously and not deviate from the spirit and scope of the present invention.Therefore, instructions and accompanying drawing are regarded in an illustrative, rather than a restrictive.

Claims (7)

1. realize based on context environmental the method that internet audiovisual material is extracted, it is characterized in that, described method comprises the following steps:
(1) load predefined audiovisual material metadatabase;
(2) load the seed address that need to extract audiovisual material website;
(3) download the web page contents that need to extract audiovisual material website;
(4) judge whether the webpage of downloading is the broadcasting page of an audiovisual material, if so, continues step (5), otherwise continue step (6);
(5) search this audiovisual material above and generate audiovisual material list;
(6) web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above.
2. the method that realizes internet audiovisual material extraction based on context environmental according to claim 1, is characterized in that, described audiovisual material metadata comprises director, protagonist, performer, issuing time, update time and the program outline of audiovisual material.
3. the method that realizes internet audiovisual material extraction based on context environmental according to claim 1, is characterized in that, described loading need to be extracted the seed address of audiovisual material website, is specially:
Load the seed address that need to capture audiovisual material website from Xml file or database.
4. the method that realizes internet audiovisual material extraction based on context environmental according to claim 1, is characterized in that, described download need to be extracted the web page contents of audiovisual material website, is specially:
Use Http client or reptile that the web page contents of the named web page of intended target website is downloaded to this locality from server.
5. according to claim 1ly realize based on context environmental the method that internet audiovisual material is extracted, it is characterized in that, described search this audiovisual material above and generate audiovisual material list, comprise the following steps:
(51) broadcasting type corresponding to this audiovisual material identified;
(52) from searching this audiovisual material above set above;
(53) merge the above complete documentation of this audiovisual material of web content data Information generation of metadata information and this download.
6. the method that realizes internet audiovisual material extraction based on context environmental according to claim 5, is characterized in that, described identifies broadcasting type corresponding to this audiovisual material, is specially:
Identify broadcasting type corresponding to this audiovisual material and utilize corresponding player to verify broadcasting to this audiovisual material.
7. the method that realizes internet audiovisual material extraction based on context environmental according to claim 1, it is characterized in that, the described web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above, comprises the following steps:
(61) judge whether this webpage is the details page of an audiovisual material, if so, continues step (62), otherwise continue step (3);
(62) according to the rule of audiovisual material metadatabase definition, this webpage is quantized and judge this webpage be whether an audiovisual material above, if so, continue step (63), otherwise continue step (64);
(63) using this webpage as an audiovisual material above and deposit above set in, then continue step (64);
(64) judge whether this webpage is last webpage of website, if so, finishes to exit, otherwise, step (65) continued;
(65) analyze the hyperlink of this webpage and add webpage queue to be downloaded, then continuing step (3).
CN201410065728.2A 2014-02-26 2014-02-26 Method for realizing extraction of Internet audiovisual programs based on context Active CN103778256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410065728.2A CN103778256B (en) 2014-02-26 2014-02-26 Method for realizing extraction of Internet audiovisual programs based on context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410065728.2A CN103778256B (en) 2014-02-26 2014-02-26 Method for realizing extraction of Internet audiovisual programs based on context

Publications (2)

Publication Number Publication Date
CN103778256A true CN103778256A (en) 2014-05-07
CN103778256B CN103778256B (en) 2017-02-01

Family

ID=50570491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410065728.2A Active CN103778256B (en) 2014-02-26 2014-02-26 Method for realizing extraction of Internet audiovisual programs based on context

Country Status (1)

Country Link
CN (1) CN103778256B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241307A (en) * 2018-08-29 2019-01-18 山东浪潮商用系统有限公司 A kind of performers and clerks' contents management method and system
CN115002068A (en) * 2022-05-09 2022-09-02 北京市博汇科技股份有限公司 Internet audio-visual program address automatic analysis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090049122A1 (en) * 2006-08-14 2009-02-19 Benjamin Wayne System and method for providing a video media toolbar
CN102630041A (en) * 2012-04-01 2012-08-08 央视国际网络有限公司 Processing method, device and system for television program data
CN102902785A (en) * 2012-09-29 2013-01-30 合一网络技术(北京)有限公司 Webpage information acquisition system and method
CN103428525A (en) * 2013-07-22 2013-12-04 华中科技大学 Online inquiry and play control method and system for network videos and television programs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090049122A1 (en) * 2006-08-14 2009-02-19 Benjamin Wayne System and method for providing a video media toolbar
CN102630041A (en) * 2012-04-01 2012-08-08 央视国际网络有限公司 Processing method, device and system for television program data
CN102902785A (en) * 2012-09-29 2013-01-30 合一网络技术(北京)有限公司 Webpage information acquisition system and method
CN103428525A (en) * 2013-07-22 2013-12-04 华中科技大学 Online inquiry and play control method and system for network videos and television programs

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241307A (en) * 2018-08-29 2019-01-18 山东浪潮商用系统有限公司 A kind of performers and clerks' contents management method and system
CN115002068A (en) * 2022-05-09 2022-09-02 北京市博汇科技股份有限公司 Internet audio-visual program address automatic analysis method and system

Also Published As

Publication number Publication date
CN103778256B (en) 2017-02-01

Similar Documents

Publication Publication Date Title
CN101853300B (en) Method and system for identifying and evaluating video downloading service website
CN103235913A (en) System, equipment and method used for identifying and intercepting bundled software
US8972374B2 (en) Content acquisition system and method of implementation
US8788925B1 (en) Authorized syndicated descriptions of linked web content displayed with links in user-generated content
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN110008378B (en) Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN105094890A (en) Method and device for loading application program plug-ins
WO2017107620A1 (en) Method and system for loading page data
CN103744853A (en) Method and device for providing web cache information in search engine
CN101364979A (en) Downloaded material parsing and processing system and method
CN103678487A (en) Method and device for generating web page snapshot
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN102609412A (en) RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system
CN103440243A (en) Teaching resource recommendation method and device thereof
CN103605696B (en) Method and device for acquiring audio-video file addresses
CN103475688A (en) Distributed method and distributed system for downloading website data
CN106598991A (en) Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
CN102870118A (en) Access method, device and system to user behavior
KR102024998B1 (en) Extracting similar group elements
US8458247B2 (en) System and method for generating web analytic reports
CN113038153A (en) Financial live broadcast violation detection method, device and equipment and readable storage medium
WO2020006381A1 (en) Method, apparatus, storage medium and electronic device for establishing question and answer system
CN103778256A (en) Method for realizing extraction of Internet audiovisual programs based on context
Vogel et al. An in-depth analysis of web page structure and efficiency with focus on optimization potential for initial page load
CN101517574A (en) Illegal contents auto-searching system and method using access/search application on internet

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20171214

Address after: KIC Business Center No. 433 Shanghai 200433 Yangpu District Shanghai Road No. 6 building 11 layer

Patentee after: SHANGHAI CERTUSNET INFORMATION TECHNOLOGY CO., LTD.

Address before: 210042 Xuanwu District, Xuanwu District, Jiangsu, Nanjing, No. 699-22, building 18

Patentee before: CERTUSNET CORP.

TR01 Transfer of patent right