WO2017107403A1 - 电子书更新章节的调度方法和装置 - Google Patents

电子书更新章节的调度方法和装置 Download PDF

Info

Publication number
WO2017107403A1
WO2017107403A1 PCT/CN2016/085545 CN2016085545W WO2017107403A1 WO 2017107403 A1 WO2017107403 A1 WO 2017107403A1 CN 2016085545 W CN2016085545 W CN 2016085545W WO 2017107403 A1 WO2017107403 A1 WO 2017107403A1
Authority
WO
WIPO (PCT)
Prior art keywords
book
url
pattern
site
chapter
Prior art date
Application number
PCT/CN2016/085545
Other languages
English (en)
French (fr)
Inventor
邝景胜
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2017107403A1 publication Critical patent/WO2017107403A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Definitions

  • the present invention relates to the field of electronic book technologies, and in particular, to a method and apparatus for scheduling an e-book update chapter.
  • the existing chapter e-book can continuously capture the chapter list page of the target e-book in each e-book related site, and compare the currently captured chapter list page with the previously captured chapter list page, thereby The updated chapter list page is determined, and then the updated chapter of the target e-book is retrieved based on the updated chapter list page.
  • the present invention has been made in order to provide a scheduling method and apparatus for an e-book update chapter that overcomes the above problems or at least partially solves or alleviates the above problems.
  • a method for scheduling an e-book update chapter including:
  • the reverse recognition from the preset e-book mode information base Deleting an e-book corresponding to the URL; and determining the identified e-book as an updated e-book;
  • a schedule is initiated for the chapter list page of the updated e-book within the site to which the URL belongs, from which all updated chapters of the updated e-book are retrieved.
  • a scheduling apparatus for an e-book update chapter including:
  • a URL mode determining module configured to determine a Pattern of the URL according to a URL Pattern dictionary of a site to which the URL belongs according to a newly added URL in the e-book related site;
  • Updating the e-book identification module configured to: backwardly identify an e-book corresponding to the URL from a preset e-book mode information base according to the Pattern of the URL; and determine the identified e-book as updated E-book
  • Updating a chapter scheduling module configured to schedule an updated e-book identified by the updated e-book identification module, and schedule a chapter list of the e-book within the site to which the URL belongs, and retrieve all updates of the e-book from the e-book chapter.
  • a computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform an electronic book according to any of the above Update the scheduling method of the chapter.
  • a computer readable medium storing a computer program as described above is provided.
  • the newly added URL in the e-book related site may be identified in advance; afterwards, according to the Pattern of the newly added URL, the e-book corresponding to the URL is identified as the updated e-book, and the updated e-book is updated.
  • the chapter list page of the e-book to grab all the updated chapters.
  • the present invention reversely determines the updated e-book by the newly added chapter content page URL, and then determines the updated chapter list page, compared to the existing chapter list page by continuously fetching the chapter list page; Grab all updated chapters based on the updated chapter list page.
  • the updated e-book can be identified in time without frequently fetching the chapter list page of the e-book, the timeliness of the updated chapter of the updated e-book is guaranteed, and the crawling is greatly reduced. Operation improves the crawling efficiency of the updated chapter.
  • FIG. 1 is a schematic flow chart showing a scheduling method of an e-book update chapter according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing a storage structure of an e-book mode information library according to an embodiment of the present invention
  • 3a, 3b are schematic diagrams showing the internal structure of a scheduling apparatus for an e-book update chapter according to an embodiment of the present invention
  • FIG. 4 is a block diagram schematically showing a computing device for performing a scheduling method of an e-book update chapter according to the present invention
  • Fig. 5 schematically shows a storage unit for holding or carrying program code implementing a scheduling method of an e-book update chapter according to the present invention.
  • the inventor of the present invention has found that, in an actual application, when an e-book is updated at a certain site, a chapter content page URL (Uniform Resource Locator) of the updated chapter is added to the site.
  • a chapter content page URL Uniform Resource Locator
  • the inventor of the present invention considers that the newly added chapter content page URL in the site can be determined first; then, the e-book corresponding to the newly added chapter content page URL is reversely identified as the updated e-book, and it is determined that The chapter list page of the updated e-book; all the updated chapters are captured by the chapter list page of the updated e-book.
  • the present invention proposes to inversely determine the updated chapter list page according to the updated chapter content page URL, which can not only quickly identify The updated e-book does not need to frequently fetch the e-book chapter list page, which improves the retrieval efficiency of the updated chapter.
  • an e-book related site usually has an active index page that is frequently changed and can continuously provide a page with a new link within the site. Therefore, the inventors of the present invention considered that the chapter content page URL newly added in the site can be quickly identified through the active index page of the e-book related site. In this way, it is beneficial to further improve the crawling efficiency of the updated chapter.
  • the URL Pattern dictionary of the e-book related site may be established in advance.
  • a set number of URLs belonging to the site may be collected in advance.
  • the set number is set by a person skilled in the art based on experience.
  • the collected URLs are grouped according to the fragment structure of the URL. Among them, the URLs in the same group share the same fragment structure.
  • the fragment structure common to the URLs within the packet is identified as the Pattern of the site. Add the identified Patterns of the site to the URL Pattern dictionary of the site.
  • the collected URLs described above can be divided into two groups.
  • the other group has "http://read.***.com/BookReader/( ⁇ d+).aspx" Fragment structure.
  • each URL in the packet has a different content at the location of the wildcard.
  • the embodiment of the present invention provides a method for scheduling an e-book update chapter based on a pre-established URL Pattern dictionary of each site. As shown in FIG. 1 , the process specifically includes the following steps:
  • S101 Determine a Pattern of the newly added URL according to a URL pattern dictionary of the site to which the newly added URL belongs, for the newly added URL in the e-book related site.
  • the newly added URL in the e-book related site may be identified in advance, and then the e-book corresponding to the newly added URL is identified through subsequent steps, thereby identifying the updated e-book.
  • the identification of new URLs in the e-book related sites will be supplemented later.
  • the site to which each newly added URL belongs may be determined, and the URL Pattern dictionary of the affiliated site is obtained. Then, for each newly added URL, the Pattern of the newly added URL is determined according to the URL Pattern dictionary of the site to which the URL belongs. For example, the newly added URL can be compared with each Pattern in the URL Pattern dictionary to find the Pattern with the highest matching degree with the newly added URL, and the found Pattern is determined as the Pattern of the added URL. .
  • the e-book corresponding to the URL is reversely recognized from the preset e-book mode information database; and the identified e-book is determined as the updated e-book.
  • the e-book mode information base is preset. In the e-book mode information base, for each e-book related site, the common part location of the pattern and the pattern of the site, and all the electrons currently included in the site are stored. The ID of the book (Identity). The information about how to set up the e-book mode information will be described in detail later.
  • the Pattern of the newly added URL is determined by step S101.
  • the public part location corresponding to the Pattern of the newly added URL can be found from the preset e-book mode information base.
  • the Pattern of the newly added URL may be compared with the Pattern of the site stored in the e-book mode repository, and a Pattern matching the Pattern of the newly added URL may be found, and the E-book mode information base is
  • the public part location corresponding to the found Pattern is the public part location corresponding to the Pattern of the newly added URL.
  • the content of the newly added URL in the found public part location is extracted as the ID of the URL.
  • the e-book corresponding to the ID of the newly added URL in the e-book mode information database is recognized as an e-book corresponding to the URL. Since the URL is new in the site, and the URL has the same Pattern and ID as the URL of the chapter list page of the identified e-book, it can be determined that the URL is an updated chapter page of the identified e-book; At the same time, the identified e-book can be determined as an updated e-book.
  • the corresponding chapter content page can be added to the site, and the chapter list page of the e-book can be updated, and the new chapter will be added.
  • a link to the content page is added to the chapter list page.
  • the chapter list page of the updated e-book may be scheduled in the site to which the newly added URL belongs, and determined from the chapter list page.
  • the updated e-book is inversely determined from the newly added chapter content page URL, and then the updated chapter list page is determined.
  • the solution of the invention can identify the updated e-book in time, thereby ensuring the timeliness of the updated chapter of the updated e-book, and greatly reducing The crawling operation improves the crawling efficiency of the updated chapter.
  • the newly added URL in the e-book related site mentioned in step S101 can be identified by the following method:
  • the existing link library stores the e-book related sites included in the current schedule. All URLs.
  • the existing link library can use the Bloom Filter algorithm to store all URLs contained in the e-book related sites.
  • the Bloom Filter algorithm can be used to query whether the URL belongs to an existing link library.
  • the e-book mode information library mentioned in step S102 can be set by the following method:
  • the Pattern dictionary of the site contains all the Patterns involved in the eBook.
  • a common portion between the patterns of the URLs of the chapter content pages of the e-book is extracted, and a common portion between the patterns of the URLs of the chapter content pages is used as the e-book at the site. ID.
  • the location of the ID of the e-book in the Pattern is determined and used as the common part location of the Pattern.
  • Fig. 2 shows a storage structure of an e-book mode information base.
  • an e-book can have a unique code or multiple codes within a site.
  • the code of the e-book can be a numeric code consisting of a set of numbers (eg, 3615564), or a combination of letters and numbers (eg, dQv0z50b27Q1).
  • the e-book has a unique code within the site, for each e-book, the e-book has the same ID at the location of the common portion of the different Pattern.
  • the ID of the e-book may be different in the location of the common part of the different patterns.
  • the ID of the e-book may be a digital code; and in mode 2, the ID of the e-book is a combination code.
  • the URL of the chapter content page currently included in the newly added e-book can be obtained; according to the URL Pattern dictionary of the site, The Pattern of the URL of the content page of each chapter of the newly added e-book is determined, and the Pattern involved in the newly added e-book is thus counted.
  • the common part between the patterns of the URLs of the chapters of the chapters of the new e-books is extracted, and the public part between the patterns of the URLs of the chapters of the chapters is used as the ID of the newly added e-book at the site. .
  • the location of the ID of the e-book in the Pattern is determined and used as the common part of the Pattern.
  • the newly added e-books and e-books are added to the e-book mode information base in the ID corresponding to the location of the common part of each of the related patterns.
  • the present invention further provides a scheduling apparatus for an e-book update chapter.
  • the apparatus includes: a URL mode determining module 301, an updated e-book identification module 302, and an update.
  • Chapter scheduling module 303 The apparatus includes: a URL mode determining module 301, an updated e-book identification module 302, and an update.
  • the URL mode determining module 301 is configured to determine a Pattern of the newly added URL according to a URL pattern dictionary of the site to which the newly added URL belongs, for the newly added URL in the e-book related site.
  • the updated e-book identification module 302 is configured to reversely identify the e-book corresponding to the URL from the preset e-book mode information base according to the Pattern of the newly added URL; and determine the identified e-book as the updated e-book E-book.
  • the updated e-book identification module 302 can search for a public part location corresponding to the Pattern of the newly added URL from the preset e-book mode information base; and the content of the newly added URL in the found public part position Extract the ID of the URL. Then, the updated electronic book identification module 303 can identify the electronic book corresponding to the extracted ID in the electronic book mode information database as the electronic book corresponding to the URL.
  • the update chapter scheduling module 303 is configured to, for the updated e-book identified by the updated e-book identification module 302, initiate scheduling of the chapter list of the e-book within the site to which the newly added URL belongs, and retrieve all updates of the e-book from the e-book. chapter.
  • the scheduling apparatus of the e-book update chapter provided by the present invention includes: a URL mode determining module 301, an updated e-book identifying module 302, and an updated chapter scheduling module 303, Including: a new URL identification module 304.
  • the newly added URL identification module 304 is configured to obtain an active index page of the e-book related site according to a preset scheduling period; and for each URL in the active index page, query whether the URL belongs to an existing link library; if not, The URL is identified as a new URL.
  • the existing link library stores all the URLs included in the e-book related site before the current schedule.
  • the scheduling device of the e-book update chapter may further include: a URL storage module 305.
  • the URL storage module 305 is configured to use the Bloom Filter algorithm to store all URLs currently included in the e-book related site in the existing link library.
  • the newly added URL identification module 301 can obtain the active index page of the e-book related site according to a preset scheduling period; for each URL in the active index page, use the Bloom Filter algorithm to query whether the URL belongs to the existing link library. If not, the URL is recognized as a new URL.
  • the scheduling apparatus of the e-book update chapter provided by the present invention may further include: a URL Pattern dictionary construction module 306.
  • the URL Pattern dictionary construction module 306 is configured to pre-collect a set number of URLs belonging to the site for each e-book related site; group the collected URLs according to the fragment structure of the URL; wherein the URLs in the same group share the same URL Fragment structure; for each grouping, the segment structure shared by the URLs in the group is identified as a Pattern of the site; and each identified Pattern of the site is added to the URL Pattern dictionary of the site.
  • the scheduling apparatus of the e-book update chapter provided by the present invention may further include: an e-book mode information base setting module 307.
  • the e-book mode repository setting module 307 is configured to pre-collect all e-books currently included in the e-book related site; and determine a pattern of a URL of each chapter content page of the e-book according to the URL Pattern dictionary of the site, And thus counting the Pattern involved in the e-book; the public part between the Patterns of the URLs of the content pages of each chapter is used as the ID of the e-book at the site; for each Pattern involved in the e-book, the ID is determined The position in the Pattern, and as the public part of the Pattern; according to the Pattern, the public part location of the Pattern, the ID of the e-book at the site, the inverted index is stored in the e-book mode repository. .
  • the newly added URL in the e-book related site may be identified in advance; afterwards, according to the Pattern of the newly added URL, the e-book corresponding to the URL is identified as the updated e-book, and the updated e-book is updated.
  • the chapter list page of the e-book to grab all the updated chapters.
  • the present invention reversely determines the updated content by newly adding the chapter content page URL by comparing the existing chapter list page by continuously fetching the chapter list page.
  • the e-book which in turn determines the updated chapter list page; and retrieves all of the updated chapters based on the updated chapter list page.
  • the updated e-book can be identified in time without frequently fetching the chapter list page of the e-book, the timeliness of the updated chapter of the updated e-book is guaranteed, and the crawling is greatly reduced. Operation improves the crawling efficiency of the updated chapter.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • Those skilled in the art will appreciate that some or all of the functionality of some or all of the scheduling devices of the e-book update section in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP).
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 4 illustrates a computing device that can implement a scheduling method for an e-book update chapter in accordance with the present invention.
  • the computing device conventionally includes a processor 410 and a computer program product or computer readable medium in the form of a memory 420.
  • the memory 420 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 420 has a memory space 430 for program code 431 for performing any of the method steps described above.
  • storage space 430 for program code may include various program code 431 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 420 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 431', ie, code readable by a processor, such as 410, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种电子书更新章节的调度方法和装置,该方法包括:针对电子书相关站点内新增的统一资源定位符URL,根据所述URL所属站点的URL模式Pattern词典,确定出所述URL的Pattern(S101);根据所述URL的Pattern,从预设的电子书模式信息库中反向识别出所述URL所对应的电子书;并将识别出的电子书确定为已更新的电子书(S102);在所述URL所属站点内对已更新的电子书的章节列表页发起调度,从中抓取已更新的电子书的所有更新章节(S103)。应用所述调度方法和装置,能够快速识别已更新的电子书,提高更新章节的抓取速度;而且无需频繁地进行抓取操作,提高了更新章节的抓取效率。

Description

电子书更新章节的调度方法和装置 技术领域
本发明涉及电子书技术领域,尤其涉及一种电子书更新章节的调度方法和装置。
背景技术
近几年来,网络上追书成为了一种时尚,追书的用户都希望在第一时间看到连载电子书(比如,小说)更新的内容。因此,电子书更新章节的时效性需求极为紧迫。
若能准确了解电子书的更新时间点,将可以快速抓取电子书更新章节。然而,作者对于一本电子书的更新带有一定的随机性,电子书的更新时间点难以预测。
目前,现有可以针对目标电子书,不断抓取各电子书相关站点中该目标电子书的章节列表页,将当前抓取的章节列表页与之前抓取的章节列表页进行比对,由此确定出已更新的章节列表页,继而,根据已更新的章节列表页抓取目标电子书的更新章节。
采用现有的方案来抓取电子书更新章节,虽然无需预测电子书的更新时间点,但需要在各个站点对电子书的章节列表页进行频繁的抓取操作,存在抓取量大;而且,事实上,在电子书未发生更新的情况下,将会存在大量的无效抓取操作,由此造成更新章节的抓取效率低。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决或者减缓上述问题的电子书更新章节的调度方法和装置。
根据本发明的一个方面,提供了一种电子书更新章节的调度方法,包括:
针对电子书相关站点内新增的统一资源定位符URL,根据所述URL所属站点的URL模式Pattern词典,确定出所述URL的Pattern;
根据所述URL的Pattern,从预设的电子书模式信息库中反向识别 出所述URL所对应的电子书;并将识别出的电子书确定为已更新的电子书;
在所述URL所属站点内对已更新的电子书的章节列表页发起调度,从中抓取已更新的电子书的所有更新章节。
根据本发明的另一个方面,提供了一种电子书更新章节的调度装置,包括:
URL模式确定模块,用于针对电子书相关站点内新增的URL,根据所述URL所属站点的URL Pattern词典,确定出所述URL的Pattern;
更新电子书识别模块,用于根据所述URL的Pattern,从预设的电子书模式信息库中反向识别出所述URL所对应的电子书;并将识别出的电子书确定为已更新的电子书;
更新章节调度模块,用于针对所述更新电子书识别模块识别出的已更新的电子书,在所述URL所属站点内对该电子书的章节列表发起调度,从中抓取该电子书的所有更新章节。
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据上述中的任一个电子书更新章节的调度方法。
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了如上所述的计算机程序。
本发明的有益效果为:
本发明的技术方案中,可以预先识别出电子书相关站点内新增的URL;之后,根据新增的URL的Pattern,识别出该URL所对应的电子书为已更新的电子书,通过已更新的电子书的章节列表页来抓取所有的更新章节。这样,相比现有通过不断抓取章节列表页来确定更新的章节列表页,本发明通过新增的章节内容页URL反向确定已更新的电子书,继而确定出更新的章节列表页;并基于更新的章节列表页抓取出所有的更新章节。本发明的方案中无需对电子书的章节列表页进行频繁的抓取操作,就可以及时识别出已更新的电子书,保障了已更新电子书的更新章节的时效性,且大大减少了抓取操作,提高了更新章节的抓取效率。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明 的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示意性示出了根据本发明一个实施例的电子书更新章节的调度方法的流程示意图;
图2示意性示出了根据本发明一个实施例的电子书模式信息库的存储结构示意图;
图3a、3b示意性示出了根据本发明一个实施例的电子书更新章节的调度装置的内部结构示意图;
图4示意性示出了用于执行根据本发明的电子书更新章节的调度方法的计算设备的框图;以及
图5示意性示出了用于保持或者携带实现根据本发明的电子书更新章节的调度方法的程序代码的存储单元。
具体实施方式
下面结合附图和具体的实施方式对本发明作进一步的描述。
本发明的发明人发现,实际应用中,当电子书在某一站点进行更新时,在该站点内将会新增更新章节的章节内容页URL(Uniform Resource Locator,统一资源定位符)。
因此,本发明的发明人考虑,可以先确定出站点内新增的章节内容页URL;之后,反向识别新增的章节内容页URL所对应的电子书为已更新的电子书,并确定已更新的电子书的章节列表页;通过已更新的电子书的章节列表页抓取所有的更新章节。这样,相比现有确定更新的章节列表页后从章节列表页中确定出更新章节的方案,本发明提出的根据更新的章节内容页URL来反向确定更新的章节列表页,不仅可以快速识别出已更新的电子书,还无需对电子书的章节列表页进行频繁的抓取操作,提高了更新章节的抓取效率。
进一步地,本发明的发明人还发现,实际应用中,电子书相关站点通常设有经常变更并且能够持续提供站点内新链接的页面的活性索引页。因此,本发明的发明人考虑,可以通过电子书相关站点的活性索引页,快速识别出站点内新增的章节内容页URL。这样,有利于进一步提高更新章节的抓取效率。
下面将结合附图来详细说明本发明的技术方案。
本发明实施例中,在进行电子书更新章节的调度之前,可以预先建立电子书相关站点的URL Pattern(模式)词典。
具体地,对于每个电子书相关站点,可以预先收集属于该站点的设定数量的URL。其中,设定数量由本领域技术人员根据经验进行设定。
之后,根据URL的片段结构,将收集的URL进行分组。其中,同一分组内的URL共有同一种片段结构。
针对每个分组,将该分组内的URL所共有的片段结构识别为该站点的Pattern。将识别出的该站点的各个Pattern,添加到该站点的URL Pattern词典中。
例如,对于某个电子书相关站点,收集了如下URL:
http://read.***.com/BookReader/dQv0z50b27Q1,PGnD-w-U8wQex0RJOkJclQ2.aspx;
http://vipreader.***.com/BookReader/vip,3615564,91338738.aspx;
http://read.***.com/BookReader/dQv0z50b27Q1,4JP4TS5F50cex0RJ OkJclQ2.aspx;
http://read.***.com/BookReader/3qBoEm8m3j41,ua6SKk-yETAex0R JOkJclQ2.aspx;
http://read.***.com/BookReader/dQv0z50b27Q1,q73U8ohhJScex0RJ OkJclQ2.aspx;
http://vipreader.***.com/BookReader/vip,3615564,91338739.aspx;
http://vipreader.***.com/BookReader/vip,3588468,88861202.aspx。
根据URL的片段结构,可以将上述收集的URL分为两组。
其中一组共有“http://vipreader.***.com/BookReader/vip,(\d+).aspx”这一片段结构。
而另一组共有“http://read.***.com/BookReader/(\d+).aspx”这一 片段结构。
其中,(\d+)为通配符,对于每个分组,该分组内的各URL在通配符所在位置上的内容不同。
分组后,可将“http://vipreader.***.com/BookReader/vip,(\d+).aspx”、“http://read.***.com/BookReader/(\d+).aspx”识别为该站点的两个Pattern。并将识别出的这两个Pattern添加到该站点的URL Pattern词典中。
基于预先建立的各个站点的URL Pattern词典,本发明实施例提供了一种电子书更新章节的调度方法,如图1所示,其流程具体包括如下步骤:
S101:针对电子书相关站点内新增的URL,根据该新增的URL所属站点的URL Pattern词典,确定出该新增的URL的Pattern。
实际应用中,当电子书的章节已发生更新时,通常会针对每个更新章节,在站点新增对应的章节内容页。因此,本发明实施例中,可以预先识别出电子书相关站点内新增的URL,之后通过后续步骤识别出新增的URL所对应的电子书,这样也就识别出了已更新的电子书。关于电子书相关站点内新增的URL的识别将在后续补充。
本发明实施例中,在识别出电子书相关站点内新增的URL之后,可以确定每个新增的URL的所属站点,并获取所属站点的URL Pattern词典。之后,针对每个新增的URL,根据该URL所属站点的URL Pattern词典,确定出该新增的URL的Pattern。例如,可以将新增的URL与URL Pattern词典中的各个Pattern进行一一比对,查找出与新增的URL匹配度最高的Pattern,并将查找出的Pattern确定为该新增的URL的Pattern。
S102:根据新增的URL的Pattern,从预设的电子书模式信息库中反向识别出该URL所对应的电子书;并将识别出的电子书确定为已更新的电子书。
其中,电子书模式信息库是预先设置的,在电子书模式信息库中,针对每个电子书相关站点,存储有该站点的Pattern、Pattern各自的公共部分位置,以及该站点当前包含的所有电子书的ID(Identity,身份标识码)。关于如何设置电子书模式信息库,将在后续详细介绍。
本发明实施例中,通过步骤S101确定出新增的URL的Pattern之 后,可以从预设的电子书模式信息库中查找出新增的URL的Pattern所对应的公共部分位置。具体地,可以将新增的URL的Pattern与电子书模式信息库中存储的该站点的Pattern进行比对,从中查找到与新增的URL的Pattern一致的Pattern,并将电子书模式信息库中与查找到的Pattern对应存储的公共部分位置作为新增的URL的Pattern所对应的公共部分位置。
继而,将新增的URL中处于查找出的公共部分位置的内容提取为该URL的ID。并将电子书模式信息库中与新增的URL的ID对应的电子书识别为该URL所对应的电子书。由于该URL是站点内新增的,且该URL具有与识别出的电子书的章节列表页的URL具有相同的Pattern和ID,因此,可以确定该URL是识别出的电子书的更新章节页;同时,可以将识别出的电子书确定为已更新的电子书。
S103:在新增的URL所属站点内对已更新的电子书的章节列表页发起调度,从中抓取已更新的电子书的所有更新章节。
实际应用中,当电子书的章节已发生更新时,除了可以针对每个更新章节,在站点新增对应的章节内容页,还可以在该电子书的章节列表页进行更新,将新增的章节内容页的链接添加到章节列表页中。
因此,本发明实施例中,通过步骤S101-S102确定出已更新的电子书之后,可以在新增的URL所属站点内对已更新的电子书的章节列表页发起调度,从章节列表页中确定出新增的章节内容页的链接,从而抓取已更新的电子书的所有更新章节。
本发明的方案中,从新增的章节内容页URL反向确定出已更新的电子书,继而确定出更新的章节列表页。相比现有通过不断抓取章节列表页来确定更新的章节列表页,本发明的方案可以及时识别出已更新的电子书,从而保障了已更新电子书的更新章节的时效性,且大大减少了抓取操作,提高了更新章节的抓取效率。
本发明实施例中,关于步骤S101提及的电子书相关站点内新增的URL,可以通过如下方法来识别:
按照预设的调度周期,获取电子书相关站点的活性索引页;针对活性索引页中的每个URL,查询该URL是否属于现有链接库;若否,则将该URL识别为新增的URL。
其中,现有链接库中存储有本次调度之前电子书相关站点包含的 所有URL。实际应用中,现有链接库可以采用布隆过滤器(Bloom Filter)算法存储电子书相关站点包含的所有URL。
这样,针对活性索引页中的每个URL,可以通过Bloom Filter算法查询该URL是否属于现有链接库。
本发明实施例中,关于步骤S102所提及的电子书模式信息库,可以通过如下方法来设置:
对于每个电子书相关站点,预先收集该站点当前包含的所有电子书。根据该站点的URL Pattern词典,确定电子书的各章节内容页的URL的Pattern,并由此统计出电子书所涉及的Pattern。事实上,站点的URL Pattern词典中包含了该电子书所涉及的所有Pattern。之后,针对每本电子书,提取出该电子书的各章节内容页的URL的Pattern之间的公共部分,并将各章节内容页的URL的Pattern之间的公共部分作为该电子书在该站点的ID。之后,针对电子书所涉及的每个Pattern,确定出电子书的ID在该Pattern中所处的位置,并作为该Pattern的公共部分位置。
根据电子书所涉及的Pattern、Pattern的公共部分位置、电子书在站点的ID,建立倒排索引后存储于电子书模式信息库中。图2示出了一种电子书模式信息库的存储结构。
实际应用中,不同的电子书具有不同的身份标识码。因此,如图2所示,针对站点的每个Pattern,不同的电子书在该Pattern的公共部分位置上的内容不同。例如,电子书1在Pattern1的公共部分位置上的内容为ID1;而电子书2在Pattern1的公共部分位置上的内容为ID2。
实际应用中,电子书在站点内可具有唯一的代码或多种代码。电子书的代码可以是由一组数字组成的数字代码(比如,3615564),或者也可以是由字母与数字组成的组合代码(比如,dQv0z50b27Q1)。
在电子书在站点内具有唯一的代码的情况下,针对每个电子书,该电子书在不同的Pattern的公共部分位置上的ID相同。
而在电子书在站点内具有多种代码的情况下,该电子书在不同的Pattern的公共部分位置上的ID可以不同。比如,在模式1下,电子书的ID可以是数字代码;而在模式2下,电子书的ID为组合代码。
实际应用中,在站点内新增电子书的情况下,可以获取该新增电子书当前包含的章节内容页的URL;根据该站点的URL Pattern词典, 确定新增的电子书的各章节内容页的URL的Pattern,并由此统计出该新增的电子书所涉及的Pattern。
继而,提取出新增的电子书的各章节内容页的URL的Pattern之间的公共部分,并将各章节内容页的URL的Pattern之间的公共部分作为新增的电子书在该站点的ID。针对电子书所涉及的每个Pattern,确定出电子书的ID在该Pattern中所处的位置,并作为该Pattern的公共部分位置。之后,将新增的电子书、电子书在所涉及的各个Pattern的公共部分位置所对应的ID,添加到电子书模式信息库中。
基于上述电子书更新章节的调度方法,本发明还提供了一种电子书更新章节的调度装置,如图3a所示,该装置包括:URL模式确定模块301、更新电子书识别模块302、以及更新章节调度模块303。
其中,URL模式确定模块301用于针对电子书相关站点内新增的URL,根据该新增的URL所属站点的URL Pattern词典,确定出该新增的URL的Pattern。
更新电子书识别模块302用于根据新增的URL的Pattern,从预设的电子书模式信息库中反向识别出该URL所对应的电子书;并将识别出的电子书确定为已更新的电子书。
具体地,更新电子书识别模块302可以从预设的电子书模式信息库中查找出新增的URL的Pattern所对应的公共部分位置;将新增的URL中处于查找出的公共部分位置的内容提取为该URL的ID。继而,更新电子书识别模块303可以将电子书模式信息库中与提取的ID对应的电子书识别为URL所对应的电子书。
更新章节调度模块303用于针对更新电子书识别模块302识别出的已更新的电子书,在新增的URL所属站点内对该电子书的章节列表发起调度,从中抓取该电子书的所有更新章节。
进一步地,本发明实施例中,如图3b所示,本发明提供的电子书更新章节的调度装置除了包括:URL模式确定模块301、更新电子书识别模块302、更新章节调度模块303,还可以包括:新增URL识别模块304。
新增URL识别模块304用于按照预设的调度周期,获取电子书相关站点的活性索引页;针对活性索引页中的每个URL,查询该URL是否属于现有链接库;若否,则将该URL识别为新增的URL。
其中,现有链接库中存储有本次调度之前电子书相关站点包含的所有URL。
进一步地,电子书更新章节的调度装置还可以包括:URL存储模块305。URL存储模块305用于采用Bloom Filter算法将电子书相关站点当前包含的所有URL存储于现有链接库。
相应地,新增URL识别模块301可以按照预设的调度周期,获取电子书相关站点的活性索引页;针对活性索引页中的每个URL,通过Bloom Filter算法查询该URL是否属于现有链接库;若否,则将该URL识别为新增的URL。
更优地,如图3b所示,本发明提供的电子书更新章节的调度装置中还可以进一步包括:URL Pattern词典构建模块306。
URL Pattern词典构建模块306用于对于每个电子书相关站点,预先收集属于该站点的设定数量的URL;根据URL的片段结构,将收集的URL进行分组;其中,同一分组内的URL共有同一种片段结构;针对每个分组,将该分组内的URL所共有的片段结构识别为该站点的一个Pattern;将识别出的该站点的各个Pattern,添加到该站点的URL Pattern词典中。
更优地,如图3b所示,本发明提供的电子书更新章节的调度装置还可以进一步包括:电子书模式信息库设置模块307。
电子书模式信息库设置模块307用于对于每个电子书相关站点,预先收集该站点当前包含的所有电子书;根据该站点的URL Pattern词典,确定电子书的各章节内容页的URL的Pattern,并由此统计出电子书所涉及的Pattern;将各章节内容页的URL的Pattern之间的公共部分作为电子书在该站点的ID;针对电子书所涉及的每个Pattern,确定出ID在该Pattern中所处的位置,并作为该Pattern的公共部分位置;根据电子书所涉及的Pattern、Pattern的公共部分位置、电子书在站点的ID,建立倒排索引后存储于电子书模式信息库中。
本发明的技术方案中,可以预先识别出电子书相关站点内新增的URL;之后,根据新增的URL的Pattern,识别出该URL所对应的电子书为已更新的电子书,通过已更新的电子书的章节列表页来抓取所有的更新章节。这样,相比现有通过不断抓取章节列表页来确定更新的章节列表页,本发明通过新增的章节内容页URL反向确定已更新的 电子书,继而确定出更新的章节列表页;并基于更新的章节列表页抓取出所有的更新章节。本发明的方案中无需对电子书的章节列表页进行频繁的抓取操作,就可以及时识别出已更新的电子书,保障了已更新电子书的更新章节的时效性,且大大减少了抓取操作,提高了更新章节的抓取效率。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一 都可以以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的电子书更新章节的调度装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图4示出了可以实现根据本发明的电子书更新章节的调度方法的计算设备。该计算设备传统上包括处理器410和以存储器420形式的计算机程序产品或者计算机可读介质。存储器420可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器420具有用于执行上述方法中的任何方法步骤的程序代码431的存储空间430。例如,用于程序代码的存储空间430可以包括分别用于实现上面的方法中的各种步骤的各个程序代码431。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图5所述的便携式或者固定存储单元。该存储单元可以具有与图4的计算设备中的存储器420类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码431’,即可以由例如诸如410之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行 限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (16)

  1. 一种电子书更新章节的调度方法,其中,包括:
    针对电子书相关站点内新增的统一资源定位符URL,根据所述URL所属站点的URL模式Pattern词典,确定出所述URL的Pattern;
    根据所述URL的Pattern,从预设的电子书模式信息库中反向识别出所述URL所对应的电子书;并将识别出的电子书确定为已更新的电子书;
    在所述URL所属站点内对已更新的电子书的章节列表页发起调度,从中抓取已更新的电子书的所有更新章节。
  2. 如权利要求1所述的方法,其中,站点的URL Pattern词典通过如下方法预先建立:
    对于每个电子书相关站点,预先收集属于该站点的设定数量的URL;
    根据URL的片段结构,将收集的URL进行分组,其中,同一分组内的URL共有同一种片段结构;
    针对每个分组,将该分组内的URL所共有的片段结构识别为该站点的一个Pattern;
    将识别出的该站点的各个Pattern,添加到该站点的URL Pattern词典中。
  3. 如权利要求2所述的方法,其中,电子书模式信息库通过如下方法预先设置:
    对于每个电子书相关站点,预先收集该站点当前包含的所有电子书;
    根据该站点的URL Pattern词典,确定电子书的各章节内容页的URL的Pattern,并由此统计出电子书所涉及的Pattern;
    将各章节内容页的URL的Pattern之间的公共部分作为电子书在该站点的ID;
    针对电子书所涉及的每个Pattern,确定出所述ID在该Pattern中所处的位置,并作为该Pattern的公共部分位置;
    根据电子书所涉及的Pattern、Pattern的公共部分位置、电子书在站点的ID,建立倒排索引后存储于所述电子书模式信息库中。
  4. 如权利要求3所述的方法,其中,所述根据所述URL的Pattern,从预设的电子书模式信息库中反向识别出所述URL所对应的电子书,具体包括:
    从预设的电子书模式信息库中查找出所述URL的Pattern所对应的公共部分位置;
    将所述URL中处于所述公共部分位置的内容提取为所述URL的身份标识码ID;
    将所述电子书模式信息库中与所述ID对应的电子书识别为所述URL所对应的电子书。
  5. 如权利要求1-4任一所述的方法,其中,电子书相关站点内新增的URL是预先识别出的:
    按照预设的调度周期,获取电子书相关站点的活性索引页;
    针对所述活性索引页中的每个URL,查询该URL是否属于现有链接库;若否,则将该URL识别为新增的URL;
    其中,所述现有链接库中存储有本次调度之前电子书相关站点包含的所有URL。
  6. 如权利要求5所述的方法,其中,所述现有链接库采用布隆过滤器Bloom Filter算法存储电子书相关站点包含的所有URL。
  7. 如权利要求6所述的方法,其中,所述针对所述活性索引页中的每个URL,查询该URL是否属于现有链接库,具体包括:
    针对所述活性索引页中的每个URL,通过所述Bloom Filter算法查询该URL是否属于所述现有链接库。
  8. 一种电子书更新章节的调度装置,其中,包括:
    URL模式确定模块,适于针对电子书相关站点内新增的URL,根据所述URL所属站点的URL Pattern词典,确定出所述URL的Pattern;
    更新电子书识别模块,适于根据所述URL的Pattern,从预设的电子书模式信息库中反向识别出所述URL所对应的电子书;并将识别出的电子书确定为已更新的电子书;
    更新章节调度模块,适于针对所述更新电子书识别模块识别出的已更新的电子书,在所述URL所属站点内对该电子书的章节列表发起调度,从中抓取该电子书的所有更新章节。
  9. 如权利要求8所述的装置,其中,还包括:
    URL Pattern词典构建模块,适于对于每个电子书相关站点,预先收集属于该站点的设定数量的URL;根据URL的片段结构,将收集的URL进行分组,其中,同一分组内的URL共有同一种片段结构;针对每个分组,将该分组内的URL所共有的片段结构识别为该站点的一个Pattern;将识别出的该站点的各个Pattern,添加到该站点的URL Pattern词典中。
  10. 如权利要求9所述的装置,其中,还包括:
    电子书模式信息库设置模块,适于对于每个电子书相关站点,预先收集该站点当前包含的所有电子书;根据该站点的URL Pattern词典,确定电子书的各章节内容页的URL的Pattern,并由此统计出电子书所涉及的Pattern;将各章节内容页的URL的Pattern之间的公共部分作为电子书在该站点的ID;针对电子书所涉及的每个Pattern,确定出所述ID在该Pattern中所处的位置,并作为该Pattern的公共部分位置;根据电子书所涉及的Pattern、Pattern的公共部分位置、电子书在站点的ID,建立倒排索引后存储于所述电子书模式信息库中。
  11. 如权利要求10所述的装置,其中,
    所述更新电子书识别模块,还适于从预设的电子书模式信息库中查找出所述URL的Pattern所对应的公共部分位置;将所述URL中处于所述公共部分位置的内容提取为所述URL的ID;将所述电子书模式信息库中与所述ID对应的电子书识别为所述URL所对应的电子书。
  12. 如权利要求8-11任一所述的装置,其中,还包括:
    新增URL识别模块,适于按照预设的调度周期,获取电子书相关站点的活性索引页;针对所述活性索引页中的每个URL,查询该URL是否属于现有链接库;若否,则将该URL识别为新增的URL;
    其中,所述现有链接库中存储有本次调度之前电子书相关站点包含的所有URL。
  13. 如权利要求12所述的装置,其中,还包括:
    URL存储模块,适于采用Bloom Filter算法将电子书相关站点当前包含的所有URL存储于所述现有链接库。
  14. 如权利要求13所述的装置,其中,
    所述新增URL识别模块,还适于按照预设的调度周期,获取电子书相关站点的活性索引页;针对所述活性索引页中的每个URL,通过 所述Bloom Filter算法查询该URL是否属于所述现有链接库;若否,则将该URL识别为新增的URL。
  15. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-7中的任一个所述的电子书更新章节的调度方法。
  16. 一种计算机可读介质,其中存储了如权利要求15所述的计算机程序。
PCT/CN2016/085545 2015-12-23 2016-06-13 电子书更新章节的调度方法和装置 WO2017107403A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510977669.0A CN105630942B (zh) 2015-12-23 2015-12-23 电子书更新章节的调度方法和装置
CN201510977669.0 2015-12-23

Publications (1)

Publication Number Publication Date
WO2017107403A1 true WO2017107403A1 (zh) 2017-06-29

Family

ID=56045875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/085545 WO2017107403A1 (zh) 2015-12-23 2016-06-13 电子书更新章节的调度方法和装置

Country Status (2)

Country Link
CN (1) CN105630942B (zh)
WO (1) WO2017107403A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177522A (zh) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630942B (zh) * 2015-12-23 2019-05-21 北京奇虎科技有限公司 电子书更新章节的调度方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622383A (zh) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 一种读取网络章回文件的方法
CN103123640A (zh) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 一种小说的搜索方法和装置
CN103164435A (zh) * 2011-12-13 2013-06-19 北大方正集团有限公司 一种网络数据的采集方法和系统
CN104361005A (zh) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 一种垂直搜索引擎中对信息单元的调度方法
CN105630942A (zh) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 电子书更新章节的调度方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853300B (zh) * 2010-05-26 2013-01-30 中国科学技术大学 一种视频下载服务网站的识别、评估方法及系统
CN103049451B (zh) * 2011-10-14 2016-08-10 腾讯科技(深圳)有限公司 网络内容更新的跟踪方法和装置
CN103295426B (zh) * 2012-02-22 2015-12-16 腾讯科技(深圳)有限公司 处理电子书的方法、设备及系统
CN103823879B (zh) * 2014-02-28 2017-06-16 中国科学院计算技术研究所 面向在线百科的知识库自动更新方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622383A (zh) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 一种读取网络章回文件的方法
CN103164435A (zh) * 2011-12-13 2013-06-19 北大方正集团有限公司 一种网络数据的采集方法和系统
CN103123640A (zh) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 一种小说的搜索方法和装置
CN104361005A (zh) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 一种垂直搜索引擎中对信息单元的调度方法
CN105630942A (zh) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 电子书更新章节的调度方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177522A (zh) * 2018-11-09 2020-05-19 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质
CN111177522B (zh) * 2018-11-09 2023-08-18 百度在线网络技术(北京)有限公司 页面聚合方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN105630942B (zh) 2019-05-21
CN105630942A (zh) 2016-06-01

Similar Documents

Publication Publication Date Title
US10242121B2 (en) Automatic browser tab groupings
US9032000B2 (en) System and method for geolocation of social media posts
US20190121806A1 (en) Managing a search
US9531751B2 (en) System and method for identifying phishing website
CN107683471A (zh) 索引具有深度链接的网页
WO2015196906A1 (zh) 一种基于搜索获取疾病咨询信息的方法和装置
WO2014000538A1 (zh) 基于终端访问统计的云网址推荐方法及系统及相关设备
WO2014000519A1 (zh) 关键词过滤系统及方法
CN103838785A (zh) 一种专利领域的垂直搜索引擎
JP2010079568A5 (zh)
WO2013189254A1 (zh) 热点聚合方法及装置
US20170053023A1 (en) System to organize search and display unstructured data
WO2018175435A3 (en) SYSTEM AND METHOD FOR PROCESSING ELECTRONIC AND GENETIC / GENOMIC ELECTRONIC INFORMATION USING AUTOMATIC LEARNING AND OTHER ADVANCED ANALYTICAL TECHNIQUES
WO2016034062A1 (zh) 一种信息的查找方法和装置
CN104360993A (zh) 一种从文本提取所需内容的方法
WO2017107403A1 (zh) 电子书更新章节的调度方法和装置
WO2017000659A1 (zh) 一种富集化url的识别方法和装置
US20150269268A1 (en) Search server and search method
WO2015074455A1 (zh) 一种计算关联网页URL模式pattern的方法和装置
WO2016107352A1 (zh) 确定poi名称、确定poi信息有效性的系统和方法
WO2015143911A1 (zh) 推送包含时效性信息的网页的方法和装置
WO2017107695A1 (zh) 对新闻进行排序的方法和装置
CN106202121B (zh) 数据存储及导出的方法和设备
WO2017113324A1 (zh) 基于正则表达式的url过滤方法
RU2015156695A (ru) Способ и система обработки префикса, связанного с поисковым запросом

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877216

Country of ref document: EP

Kind code of ref document: A1