CN101404026A - Crawler system construction method for video-previewing search engine - Google Patents

Crawler system construction method for video-previewing search engine Download PDF

Info

Publication number
CN101404026A
CN101404026A CNA2008101808250A CN200810180825A CN101404026A CN 101404026 A CN101404026 A CN 101404026A CN A2008101808250 A CNA2008101808250 A CN A2008101808250A CN 200810180825 A CN200810180825 A CN 200810180825A CN 101404026 A CN101404026 A CN 101404026A
Authority
CN
China
Prior art keywords
video
hyperlink
search engine
crawler system
construction method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101808250A
Other languages
Chinese (zh)
Inventor
杨溥
郭军
陈�光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CNA2008101808250A priority Critical patent/CN101404026A/en
Publication of CN101404026A publication Critical patent/CN101404026A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method for constructing a crawler system of a previewable video search engine. The method comprises the following steps: (1) hyperlinks are mapped into a list; (2) the list state is detected; (3) an abstract picture is processed; (4) a video is processed; and (5) a video title is processed. The method can provide a universal design method for the crawler system of the previewable video search engine, and can provide a previewing data set for the previewable video search engine, thus simplifying the design and development of other parts of the previewable video search engine, and greatly reducing the development cost of the crawler system of the previewable video search engine as well as the previewable video search engine.

Description

The construction method of the crawler system of video-previewing search engine
Technical field
The present invention relates to the construction method of network data acquisition system, relate in particular to a kind of construction method of crawler system of video-previewing search engine.
Background technology
Along with the arrival of information age and the development of image video technique, the image video is owing to having incomparable advantage and strong visual impact attracting increasing people to appreciate.But because the data volume of video is huge and the restriction of the general network bandwidth, people are difficult to watch video at this machine easily.Just because of this main cause, set up many video website on the wide area network one after another, the online playing of carrying out video data makes people conveniently appreciate video in real time.But along with the surge of video website the video data volume, people are difficult to find desirable video quickly and easily on wide area network, so the search engine of video just arises.Though video search engine can bring greatly convenient, but video is easy to identification unlike text message, and Online Video also needs to download buffers video data for the fluency of playing, the video data volume is big in addition, takies more bandwidth, and user bandwidth and flow all are limited, therefore, the user wishes can judge in advance whether this video is to look for, and whether is worth watching before opening the video webpage.If not required, just needn't go to lose time and bandwidth removes to watch video.Therefore the preview of video search engine receives fervent concern.
Because video website all comprises the summary picture and the video name of video, the vision main contents of the reflecting video that just can concentrate by summary picture and video name, the user can carry out preview and judgement to video by make a summary picture and title.Therefore be the most important thing in the process that is captured in structure preview video search engine of the preview data of video.At present, the acquisition system construction method that does not also have a kind of effective video preview data of system.The preview data of the present invention by introducing hyperlink map listing technology and gathering video based on the technology of searching of this map listing effectively.
Summary of the invention
At the problem that prior art exists, the purpose of this invention is to provide a kind of construction method of crawler system of video-previewing search engine.
For achieving the above object, method of the present invention comprises the following steps:
(1) hyperlink is mapped to tabulation;
(2) detection list state;
(3) the summary picture is handled;
(4) Video processing;
(5) video title is handled.
In the said method, step (3) further comprises:
(31) in the hyperlink map listing, search the summary picture;
(32) downloaded stored summary picture.
In the said method, step (4) further comprises:
(41) in the hyperlink map listing, search video;
(42) downloaded stored video;
In the said method, step (5) further comprises:
(51) foradownloaded video is play the page;
(52) extract the store video title.
Beneficial effect of the present invention is, by using method described in the invention, can provide general method for designing for the crawler system of video-previewing search engine; Can simplify the design and the exploitation of other parts of video-previewing search engine for video-previewing search engine provides preview type data set, reduce the cost of development of video-previewing search engine crawler system and video-previewing search engine significantly.
In conjunction with the accompanying drawings, other characteristics of the present invention and advantage can be from below by becoming clearer the explanation of giving an example the preferred implementation that principle of the present invention is made an explanation.
Description of drawings
Fig. 1 is the process flow diagram according to the method for an embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing the specific embodiment of the present invention is described in detail.
Fig. 1 is the process flow diagram according to the method for an embodiment of the invention.This flow process starts from step 101, it is to be noted that following mentioned video website only is that concrete video website is not construed as limiting the invention for example.Then in step 102, all hyperlink of analysis video webpage, and with all hyperlink according in the webpage source code from top to bottom from left to right order extract one by one, its mapping becomes a tabulation the most at last.Need to prove that start page should be to comprise the abundant web webpage of video hyperlink, as the broadcast page of video etc., this only is for example optimum, and the difference of initial video webpage is not construed as limiting the invention.
Hyperlink is mapped to tabulation, and a kind of embodiment is to analyze from the structure structure of video webpage to extract into table.Further specify below by giving an example.
<a href=" http://www.tudou.com/programs/view/c74iyYGuDIc/ " title=" dominoes new record " target=" new " class=" inner "〉<imgsrc=" http://i01.img.tudou.com/data/imgs/i/023/746/281/m10.jpg " alt=" dominoes new record " width=" 120 " height=" 90 " class=" pack_clipImg "/</a 〉
It more than is one section source code that comprises video hyperlink of a video webpage.Wherein comprise two hyperlink, be respectively:
http://www.tudou.com/programs/view/c74iyYGuDIc/
http://i01.img.tudou.com/data/imgs/i/023/746/281/m10.jpg
First is for pointing to the hyperlink address of the video playback page, and second is the pairing summary picture of this video hyperlink address.The characteristics that video website makes up structure are that hyperlink and the pairing summary picture of the video hyperlink of pointing to the video playback page is close to, and all come mark with the html SGML, by as above code snippet as can be seen, between two hyperlink without any other hyperlink, and the hyperlink of pointing to the video playback page is with the href=mark, and the pairing summary picture of video hyperlink is with img src=mark.Therefore,, can mate href=and img src=mark is searched hyperlink all in the webpage by regular expression for a webpage that comprises video, will such as last example
Href=" http://www.tudou.com/programs/view/c74iyYGuDIc/ " and imgsrc=" http://i01.img.tudou.com/data/imgs/i/023/746/281/m10.jpg " find out, then all hyperlink are listed according to the order of searching, promptly generate the hyperlink map listing, at last the hyperlink map listing of current web page is put into original hyperlink map listing end.The storage implementation mode of a mapping table is by textual form, directly current map listing is write original hyperlink map listing end.It is to be noted that textual form only is for example, also have file layouts such as relevant database, concrete file layout is not construed as limiting the invention.More than be the embodiment that hyperlink is mapped to tabulation, other different examples of implementation are not construed as limiting the invention.
After the step 102, flow process enters step 103.
In step 103, analyzing and testing hyperlink map listing state.An embodiment that detects hyperlink map listing state is by hyperlink label in the table of handling, and adds up one from mark, sees whether be empty.If empty, illustrate that then whole processing of map listing is over; If not empty, illustrate that then map listing do not handle.More than be an embodiment of analyzing and testing hyperlink map listing state, other different embodiments are not construed as limiting the invention.
If do not handle, then flow process enters step 104; If all handle, then flow process enters step 110.
In step 104, the hyperlink map listing that generates in the step 102 picture hyperlink of making a summary is searched.An embodiment of searching the summary picture is by string matching, as the example of the code snippet in the step 102, and matched character string img src=in the hyperlink map listing, the content of its back:
Http:// i01.img.tudou.com/data/imgs/i/023/746/281/m10.jpg is exactly the hyperlink of summary picture.More than be an embodiment searching the summary picture, other different embodiments are not construed as limiting the invention.
After the step 104, flow process enters step 105.
In step 105, the summary picture that downloaded stored is searched in step 104.An embodiment is the summary picture that the storage of utilization relevance Database Systems is downloaded, and is convenient to the management of data like this.More than be an embodiment of downloaded stored summary picture, other different embodiments are not construed as limiting the invention.
After the step 105, flow process enters step 106.
In step 106, the hyperlink of the pairing video of summary picture of step 105 downloaded stored is searched in the hyperlink map listing that step 102 generates.An embodiment of searching corresponding video is the hyperlink of at first locating the summary picture by string matching, mates a hyperlink forward in position of this summary picture hyperlink then and gets final product.Above embodiment is based on such principle: in the structure of video website, the hyperlink of the hyperlink of video and pairing summary picture links to each other before and after together, middle any other hyperlink of nothing, and video hyperlink is in the front of picture hyperlink.As the example of the code snippet in the step 102, in the hyperlink map listing, two hyperlink tightly are close to.Know the hyperlink of summary picture by step 104, at hyperlink map listing Matching Location: img src=" http://i01.img.tudou.com/data/imgs/i/023/746/281/m10.jpg ", follow matched indicia href=forward, just obtain the hyperlink of the video corresponding: http://www.tudou.com/programs/view/c74iyYGuDIc/ with it.More than be an embodiment searching corresponding video, other different embodiments are not construed as limiting the invention.
After the step 106, flow process enters step 107.
In step 107, the video that downloaded stored is searched in step 106.An embodiment is at first by changeing the location technology, obtain real video address, the video that uses the storage of relevance Database Systems to download then can be inserted into after the summary image data of storage in step 105, so just can obtain both associated data set.More than be an embodiment of downloaded stored video, other different embodiments are not construed as limiting the invention.
After the step 107, flow process enters step 108.
In step 108, foradownloaded video is play the page, promptly the video hyperlink of being searched in the step 106 is carried out download process.The embodiment of a foradownloaded video broadcast page is by sending request of data to the pairing main frame of hyperlink.As the example in the step 102, send the programs/view/c74iyYGuDIc request of data and data download to the www.tudou.com main frame.More than be the embodiment that foradownloaded video is play the page, other different embodiments are not construed as limiting the invention.
After the step 108, flow process enters step 109.
In step 109, extract the title of this video of storage, promptly search title mark<title〉to being downloaded the video playback page in the step 108.An embodiment of extracting the title of this video of storage is by string searching coupling<title in this broadcast page 〉.As the example in the step 102, in the following video playback page: http://www.tudou.com/programs/view/c74iyYGuDIc/ searches<title 〉, can obtain<title Dutch dominoes new record</title 〉, middle part is exactly the title of this video, extract center section, use relevance Database Systems store video title then, can be inserted into before the summary image data of storage in step 105, so just can obtain three's associated data set.More than be an embodiment that extracts the title of this video of storage, other different embodiments are not construed as limiting the invention.
After the step 109, the video playback page of downloading in the step 108 is carried out step 102 handle.
In step 110, system finishing.
Below described the specific embodiment of the present invention in conjunction with the accompanying drawings, various not illustrating is construed as limiting the essence of an invention content, and the implementation detail that the invention is not restricted to provide above can realize with additional embodiments under the situation that does not break away from feature of the present invention.The person of an ordinary skill in the technical field after having read instructions can to before described embodiment make an amendment or be out of shape, and do not deviate from essence of an invention and scope.

Claims (4)

1. the construction method of the crawler system of a video-previewing search engine is characterized in that comprising the following steps:
(1) hyperlink is mapped to tabulation;
(2) detection list state;
(3) the summary picture is handled;
(4) Video processing;
(5) video title is handled.
2. the construction method of the crawler system of video-previewing search engine according to claim 1, it is characterized in that: step (3) further comprises:
(31) in the hyperlink map listing, search the summary picture;
(32) downloaded stored summary picture.
3. the construction method of the crawler system of video-previewing search engine according to claim 1, it is characterized in that: step (4) further comprises:
(41) in the hyperlink map listing, search video;
(42) downloaded stored video.
4. the construction method of the crawler system of video-previewing search engine according to claim 1, it is characterized in that: step (5) further comprises:
(51) foradownloaded video is play the page;
(52) extract the store video title.
CNA2008101808250A 2008-11-25 2008-11-25 Crawler system construction method for video-previewing search engine Pending CN101404026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008101808250A CN101404026A (en) 2008-11-25 2008-11-25 Crawler system construction method for video-previewing search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008101808250A CN101404026A (en) 2008-11-25 2008-11-25 Crawler system construction method for video-previewing search engine

Publications (1)

Publication Number Publication Date
CN101404026A true CN101404026A (en) 2009-04-08

Family

ID=40538038

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101808250A Pending CN101404026A (en) 2008-11-25 2008-11-25 Crawler system construction method for video-previewing search engine

Country Status (1)

Country Link
CN (1) CN101404026A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325225A (en) * 2011-09-20 2012-01-18 北京鹏润鸿途科技有限公司 Method and device for playing video of mobile phone website
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102325225A (en) * 2011-09-20 2012-01-18 北京鹏润鸿途科技有限公司 Method and device for playing video of mobile phone website
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information

Similar Documents

Publication Publication Date Title
US20230325431A1 (en) System And Method For Labeling Objects For Use In Vehicle Movement
JP4062908B2 (en) Server device and image display device
KR100403714B1 (en) System and method for facilitating internet search by providing web document layout image and web site structure
KR101475126B1 (en) System and method of inclusion of interactive elements on a search results page
CA2610208C (en) Learning facts from semi-structured text
US7725451B2 (en) Generating clusters of images for search results
US7606794B2 (en) Active Abstracts
CN102402604B (en) Effective forward ordering of search engine
CN103699700B (en) A kind of generation method of search index, system and associated server
US11188591B2 (en) Video matching service to offline counterpart
US20010049677A1 (en) Methods and systems for enabling efficient retrieval of documents from a document archive
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN104462590B (en) Information search method and device
CN103838862B (en) Video searching method, device and terminal
CN105631051A (en) Character recognition based mobile augmented reality reading method and reading system thereof
US7421416B2 (en) Method of managing web sites registered in search engine and a system thereof
CN102511048A (en) Method and system for preprocessing the region of video containing text
CN101446954A (en) Wide area network crawler system for a video website
CN105072460A (en) Information annotation and association method, system and device based on VCE
CN101404026A (en) Crawler system construction method for video-previewing search engine
JP2007317105A (en) On demand link producing system
Oyri News Item Extraction for Text Mining inWeb Newspapers
CN103425766B (en) browse synchronous method and device
CN104504070B (en) A kind of method and apparatus of search
Fung et al. Discover information and knowledge from websites using an integrated summarization and visualization framework

Legal Events

Date Code Title Description
C06 Publication
C57 Notification of unclear or unknown address
DD01 Delivery of document by public notice

Addressee: Xu Weiran

Document name: Notification of Passing Preliminary Examination of the Application for Invention

PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090408