WO2019056797A1 - 网络图片的爬取方法、程序及应用服务器 - Google Patents

网络图片的爬取方法、程序及应用服务器 Download PDF

Info

Publication number
WO2019056797A1
WO2019056797A1 PCT/CN2018/089449 CN2018089449W WO2019056797A1 WO 2019056797 A1 WO2019056797 A1 WO 2019056797A1 CN 2018089449 W CN2018089449 W CN 2018089449W WO 2019056797 A1 WO2019056797 A1 WO 2019056797A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
url
image
picture information
folder
Prior art date
Application number
PCT/CN2018/089449
Other languages
English (en)
French (fr)
Inventor
蔡俊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019056797A1 publication Critical patent/WO2019056797A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a network picture crawling method, a program, and an application server.
  • Web crawling refers to a process or thread in a web search set subsystem that completes a page crawl based on a Uniform Resource Locator (URL).
  • a web crawler is a web spider that searches for a web page through a link address of a web page, starts from a certain page of the website (usually a home page), reads the content of the web page, finds other link addresses in the web page, and then Look for the next page through these link addresses, and keep looping until all the pages on the site have been crawled. If the entire Internet is treated as a website, then web spiders can use this principle to capture all the web pages on the Internet.
  • the present application provides a method, a program, and an application server for crawling a webpage.
  • the webpage can not only quickly crawl to the corresponding target image, but also acquire according to a preset policy.
  • the automatic classification and storage of the captured images realizes the effect of rapid resource retrieval and sorting.
  • the present application provides an application server, which includes a memory, a processor, and a crawler stored on the memory and operable on the processor, the network picture, When the crawler of the network picture is executed by the processor, the following steps are implemented:
  • a picture having the same picture information is stored to the same folder.
  • the present application further provides a method for crawling a network picture, where the method is applied to an application server, and the method includes:
  • a picture having the same picture information is stored to the same folder.
  • the present application further provides a crawling program for a network picture, where the crawling program of the network image includes:
  • a first obtaining module configured to obtain a URL of the target webpage
  • a picture crawling module for crawling a predetermined number of pictures on the target webpage
  • a second acquiring module configured to acquire the picture information
  • Creating a module configured to create a folder according to the picture information and select the picture
  • a storage module configured to store pictures having the same picture information into the same folder.
  • the present application further provides a computer readable storage medium storing a crawler of a network picture, the crawler of the network picture being configurable by at least one processor Executing to cause the at least one processor to perform the following steps:
  • a picture having the same picture information is stored to the same folder.
  • the application server, the network image crawling method, the program, and the computer readable storage medium proposed by the present application first acquire the URL of the target webpage; secondly, crawl the predetermined number of the target webpages. a picture; then, the picture information is acquired; then, a folder is created according to the picture information and the picture is selected; finally, pictures having the same picture information are stored to the same folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • FIG. 1 is a schematic diagram of an optional application environment of each embodiment of the present application.
  • FIG. 2 is a schematic diagram of an optional hardware architecture of the application server of FIG. 1;
  • FIG. 3 is a schematic diagram of functional modules of a first embodiment of a crawling program of a network picture of the present application
  • FIG. 4 is a schematic diagram of an implementation process of a first embodiment of a method for crawling a network picture according to the present application
  • FIG. 5 is a schematic flowchart of implementing a second embodiment of a method for crawling a network picture according to the present application
  • FIG. 6 is a schematic diagram of an implementation process of a third embodiment of a method for crawling a network picture according to the present application.
  • Mobile terminal 1 application server 2 The internet 3 Memory 11 processor 12 Network Interface 13 Crawler for web images 200 First acquisition module 201 Picture crawl module 202 Second acquisition module 203 Create module 204 Storage module 205
  • first, second and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. .
  • features defining “first” and “second” may include at least one of the features, either explicitly or implicitly.
  • the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
  • FIG. 1 it is a schematic diagram of an optional application environment of each embodiment of the present application.
  • the present application is applicable to an application environment including, but not limited to, a mobile terminal 1, an application server 2, and a network 3.
  • the mobile terminal 1 may be a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, an in-vehicle device, etc.
  • Mobile devices such as, and fixed terminals such as digital TVs, desktop computers, notebooks, servers, and the like.
  • the application server 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the application server 2 may be a stand-alone server or a server cluster composed of multiple servers.
  • the network 3 may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, Wireless or wired networks such as 5G networks, Bluetooth, Wi-Fi, and
  • the application server 2 is respectively connected to one or more of the mobile terminals 1 (only one shown in the figure) through the network 3, and each of the mobile terminals 1 is installed and operated.
  • the application client corresponding to the application server 2 (hereinafter referred to as "mobile terminal client").
  • the mobile terminal client is configured to create a long connection between the mobile terminal client and the application server 2 in response to an operation of the mobile terminal user, so that the mobile terminal client can pass the long connection and the The application server 2 performs data transmission and interaction.
  • the crawler 200 with the network image when installed and run in the application server 2, first, the URL of the target webpage is acquired; secondly, a predetermined number of images on the target webpage are crawled; and then, The picture information; then, creating a folder according to the picture information and selecting the picture; finally, storing pictures having the same picture information to the same folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • the application server 2 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is to be noted that FIG. 2 only shows the application server 2 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), and a random access memory (RAM). , static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the application server 2, such as a hard disk or memory of the application server 2.
  • the memory 11 may also be an external storage device of the application server 2, such as a plug-in hard disk equipped on the application server 2, a smart memory card (SMC), and a secure digital number. (Secure Digital, SD) card, flash card, etc.
  • the memory 11 can also include both the internal storage unit of the application server 2 and its external storage device.
  • the memory 11 is generally used to store an operating system installed in the application server 2 and various types of application software, such as program code of the crawler 200 of the network picture. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the application server 2, such as performing control and processing related to data interaction or communication with the mobile terminal 1.
  • the processor 12 is configured to run program code or processing data stored in the memory 11, such as a crawler 200 that runs the network picture.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the application server 2 and other electronic devices.
  • the network interface 13 is mainly used to connect the application server 2 to one or more mobile terminals 1 through the network 3, and the application server 2 and the one or more mobiles. A data transmission channel and a communication connection are established between the terminals 1.
  • the present application proposes a crawler 200 for a web picture.
  • the crawler 200 of the network picture may be divided into one or more modules, and the one or more modules are stored in the memory 11 and are processed by one or more processors ( This embodiment is executed by the processor 12) to complete the application.
  • the crawler 200 of the network picture may be divided into a first obtaining module 201, a picture crawling module 202, a second obtaining module 203, a creating module 204, and a storage module 205.
  • the functional modules referred to in the present application refer to a series of computer program instruction segments capable of performing a specific function, which is more suitable than the program for describing the execution process of the crawler 200 of the network picture in the application server 2.
  • the function of each of the function modules 201-205 will be described in detail below.
  • the first obtaining module 201 is configured to acquire a Uniform Resource Locator (URL) of the target webpage.
  • URL Uniform Resource Locator
  • the first obtaining module 201 acquires a URL of a target webpage by using a web crawling application, and the web crawling application is written by using a Python language.
  • the Python language is an object-oriented, interpreted computer programming language with a rich and powerful library. It is often nicknamed the glue language and can easily connect various modules made in other languages (especially C/C++).
  • a common application scenario is to use Python to quickly prototype a program (sometimes even the final interface of a program), and then rewrite the parts with special requirements for performance, such as graphics rendering modules in 3D games. Performance requirements are particularly high, can be rewritten in C / C + +, and then encapsulated as an extension class library that Python can call.
  • the image crawling module 202 is configured to crawl a predetermined number of pictures on the target webpage.
  • the picture crawling module 202 controls the webpage crawling application to cyclically crawl a predetermined number of pictures on the target webpage by using a looping command.
  • the image crawling module 202 obtains the URL of the target webpage through the getPage function, and can crawl a predetermined number of images on the target webpage, for example, crawling 20 images, and the specific implementation statement is as follows: def getPage (self, pageNum): for i in range(1,21).
  • the second obtaining module 203 is configured to acquire the picture information.
  • the step of acquiring the picture information by the second obtaining module 203 is mainly implemented by:
  • the second obtaining module 203 is configured to splicing the URL of the picture; and acquiring the picture information according to the URL of the picture. Specifically, the second obtaining module 203 is further configured to splicing the URL of the target webpage, the picture prefix, the webpage page number, and the number of links linked to the image from the target webpage, so as to implement the step of splicing the URL of the image. .
  • the picture information may be a URL prefix and a depth of multiple pictures, and the depth information therein is the number of links from the target to the picture. For example, if you search for a website and want to click on a certain image, you may need to click the link on the target webpage. According to the linked website, you may need to click the link again to get the target image. Then we can call the link process several times. It is depth.
  • the picture information may also be link text, such as some files for multimedia, pictures, etc., generally by linking the anchor text (ie, the link text) and related file comments to determine the files. content.
  • the creating module 204 is configured to create a folder according to the picture information and select the picture.
  • the creating module 204 creates the folder according to the picture information naming, and determines a path of the folder; and uses the Beautiful Soup to parse the picture information and obtain the picture and the picture content.
  • Beautiful Soup is a hypertext markup language (HTML)/Extensible Markup Language (XML) parser written in Python, which can be well handled. Do not standardize the markup and generate a parse tree. It provides simple and commonly used navigating, searching and modifying the parse tree. It can save a lot of programming time.
  • HTML hypertext markup language
  • XML Extensible Markup Language
  • the storage module 205 is configured to store pictures having the same picture information to the same folder.
  • pictures having the same picture information are pictures having the same URL prefix and depth. According to the information in the URL of the spliced picture, it can be determined whether the picture has the same URL prefix and depth. For example, if you have the same URL prefix and depth, it is judged to be the same person, otherwise it is not the same person.
  • the interface definition module 201 of the application server 2 the first obtaining module 201 acquires a URL of a target webpage; the image crawling module 202 crawls a predetermined number of pictures on the target webpage; The second obtaining module 203 acquires the picture information; the creating module 204 creates a folder according to the picture information and selects the picture; the storage module 205 stores the picture with the same picture information to the same The folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • the crawler 200 of the network picture proposed by the present application first acquires the URL of the target webpage; secondly, crawls a predetermined number of pictures on the target webpage; and then acquires the image information. Then, a folder is created according to the picture information and the picture is selected; finally, pictures having the same picture information are stored to the same folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • the present application also proposes a method for crawling a network picture.
  • FIG. 4 it is a schematic flowchart of the implementation of the first embodiment of the method for crawling the network picture of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 4 may be changed according to different requirements, and some steps may be omitted.
  • Step S401 Obtain a Uniform Resource Locator (URL) of the target webpage.
  • URL Uniform Resource Locator
  • the application server 2 acquires a URL of a target webpage through a web crawling application, and the web crawling application is written in a Python language.
  • the Python language is an object-oriented, interpreted computer programming language with a rich and powerful library. It is often nicknamed the glue language and can easily connect various modules made in other languages (especially C/C++).
  • a common application scenario is to use Python to quickly prototype a program (sometimes even the final interface of a program), and then rewrite the parts with special requirements for performance, such as graphics rendering modules in 3D games. Performance requirements are particularly high, can be rewritten in C / C + +, and then encapsulated as an extension class library that Python can call.
  • Step S402 crawling a predetermined number of pictures on the target webpage.
  • the specific step of crawling a predetermined number of pictures on the target webpage will be detailed in the second embodiment (FIG. 5) of the crawling method of the network picture of the present application.
  • the application server 2 controls the webpage crawling application to cyclically crawl a predetermined number of pictures on the target webpage by using a looping command.
  • the application server 2 obtains the URL of the target webpage through the getPage function, and can crawl a predetermined number of images on the target webpage, for example, crawling 20 images, and the specific implementation statement is as follows: def getPage(self , pageNum): for i in range(1,21).
  • Step S403 acquiring the picture information.
  • Step S404 creating a folder according to the picture information and selecting the picture. Specifically, the specific step of creating a folder according to the picture information and selecting the picture is described in detail in the third embodiment (FIG. 6) of the crawling method of the network picture of the present application.
  • Step S405 storing pictures having the same picture information into the same folder.
  • the pictures having the same picture information are pictures having the same URL prefix and depth.
  • the application server 2 acquires the URL of the target webpage; crawls a predetermined number of pictures on the target webpage; acquires the image information; creates a folder according to the image information and selects the image; Pictures of the same picture information are stored in the same folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • the crawling method of the network picture proposed by the present application firstly acquires the URL of the target webpage; secondly, crawls a predetermined number of pictures on the target webpage; and then acquires the image information; Creating a folder according to the picture information and selecting the picture; finally, storing pictures having the same picture information to the same folder.
  • the drawbacks of the prior art that the crawled pictures cannot be effectively sorted and sorted in real time can be avoided.
  • the process of crawling webpage images not only can the corresponding target images be quickly crawled, but also the automatically sorted and stored images of the crawled images can be obtained according to the preset strategy, thereby realizing rapid resource retrieval and sorting. effect.
  • FIG. 5 it is a schematic flowchart of the implementation of the second embodiment of the method for crawling the network picture of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 5 may be changed according to different requirements, and some steps may be omitted.
  • the step of acquiring the picture information specifically includes:
  • Step S501 splicing the URL of the picture.
  • Step S502 acquiring the picture information according to the URL of the picture.
  • the application server 2 splicing the URL of the image mainly by: the application server 2 splicing the URL of the target webpage, the picture prefix, the web page number, and linking from the target webpage to the image. The number of links.
  • the picture information may be a URL prefix and a depth of multiple pictures, and the depth information therein is the number of links from the target to the picture. For example, if you search for a website and want to click on a certain image, you may need to click the link on the target webpage. According to the linked website, you may need to click the link again to get the target image. Then we can call the link process several times. It is depth.
  • the picture information may also be link text, such as some files for multimedia, pictures, etc., generally by linking the anchor text (ie, the link text) and related file comments to determine the files. content.
  • the crawling method of the network picture proposed by the present application may acquire the picture information according to the URL of the picture by splicing the URL of the picture. In this way, you can quickly crawl to the corresponding target image.
  • FIG. 6 it is a schematic flowchart of the implementation of the third embodiment of the method for crawling the network picture of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 6 may be changed according to different requirements, and some steps may be omitted.
  • the step of creating a folder according to the picture information and selecting the picture includes:
  • Step S601 creating the folder according to the picture information naming, and determining a path of the folder.
  • step S602 the picture information is parsed using Beautiful Soup and the picture and picture content are obtained.
  • Beautiful Soup is a hypertext markup language (HTML)/Extensible Markup Language (XML) parser written in Python, which can be well handled. Do not standardize the markup and generate a parse tree. It provides simple and commonly used navigating, searching and modifying the parse tree. It can save a lot of programming time.
  • HTML hypertext markup language
  • XML Extensible Markup Language
  • the crawling method of the network picture proposed by the present application can analyze the picture information and obtain the picture and picture content by using Beautiful Soup. This can save a lot of programming time.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种网络图片的爬取方法,所述方法包括:获取目标网页的URL;爬取所述目标网页上预定数量的图片;获取所述图片信息;根据所述图片信息创建文件夹并选取所述图片;及将具有相同所述图片信息的图片存储至同一个所述文件夹。本申请还提供一种网络图片的爬取程序、应用服务器。本申请提供的应用服务器及网络图片的爬取方法、程序,在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。

Description

网络图片的爬取方法、程序及应用服务器
优先权申明
本申请基于巴黎公约申明享有2017年09月22日递交的申请号为CN201710868857.9、名称为“网络图片的爬取方法及应用服务器”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种网络图片的爬取方法、程序及应用服务器。
背景技术
网页爬取指网页搜索集子系统中根据统一资源定位符(Uniform Resource Locator,URL)完成一篇页面爬取的进程或者线程。对于搜索引擎来说,网页爬取即网络蜘蛛是通过网页的链接地址来寻找网页,从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。然而目前的网页爬取过程中,特别是针对图片的爬取过程,虽然可以有效的爬取到目标图片,但并不能实时对爬取到的图片进行有效的整理分类,如此对于利用网页爬取的后续应用来说,限制了对网页爬取的使用,不利用后续应用的性能提升,影响了用户体验。
发明内容
有鉴于此,本申请提出一种网络图片的爬取方法、程序及应用服务器,在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还 可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
首先,为实现上述目的,本申请提出一种应用服务器,所述应用服务器包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的网络图片的爬取程序,所述网络图片的爬取程序被所述处理器执行时实现如下步骤:
获取目标网页的URL;
爬取所述目标网页上预定数量的图片;
获取所述图片信息;
根据所述图片信息创建文件夹并选取所述图片;及
将具有相同所述图片信息的图片存储至同一个所述文件夹。
此外,为实现上述目的,本申请还提供一种网络图片的爬取方法,该方法应用于应用服务器,所述方法包括:
获取目标网页的URL;
爬取所述目标网页上预定数量的图片;
获取所述图片信息;
根据所述图片信息创建文件夹并选取所述图片;及
将具有相同所述图片信息的图片存储至同一个所述文件夹。
此外,为实现上述目的,本申请还提供一种网络图片的爬取程序,所述网络图片的爬取程序包括:
第一获取模块,用于获取目标网页的URL;
图片爬取模块,用于爬取所述目标网页上预定数量的图片;
第二获取模块,用于获取所述图片信息;
创建模块,用于根据所述图片信息创建文件夹并选取所述图片;及
存储模块,用于将具有相同所述图片信息的图片存储至同一个所述文件夹。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有网络图片的爬取程序,所述网络图片的爬取程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
获取目标网页的URL;
爬取所述目标网页上预定数量的图片;
获取所述图片信息;
根据所述图片信息创建文件夹并选取所述图片;及
将具有相同所述图片信息的图片存储至同一个所述文件夹。
相较于现有技术,本申请所提出的应用服务器、网络图片的爬取方法、程序及计算机可读存储介质,首先,获取目标网页的URL;其次,爬取所述目标网页上预定数量的图片;然后,获取所述图片信息;接着,根据所述图片信息创建文件夹并选取所述图片;最后,将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
附图说明
图1是本申请各个实施例一可选的应用环境示意图;
图2是图1中应用服务器一可选的硬件架构的示意图;
图3是本申请网络图片的爬取程序第一实施例的功能模块示意图;
图4为本申请网络图片的爬取方法第一实施例的实施流程示意图;
图5为本申请网络图片的爬取方法第二实施例的实施流程示意图;
图6为本申请网络图片的爬取方法第三实施例的实施流程示意图。
附图标记:
移动终端 1
应用服务器 2
网络 3
存储器 11
处理器 12
网络接口 13
网络图片的爬取程序 200
第一获取模块 201
图片爬取模块 202
第二获取模块 203
创建模块 204
存储模块 205
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都 属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,是本申请各个实施例一可选的应用环境示意图。
在本实施例中,本申请可应用于包括,但不仅限于,移动终端1、应用服务器2、网络3的应用环境中。其中,所述移动终端1可以是移动电话、智能电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、导航装置、车载装置等等的可移动设备,以及诸如数字TV、台式计算机、笔记本、服务器等等的固定终端。所述应用服务器2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该应用服务器2可以是独立的服务器,也可以是多个服务器所组成的服务器集群。所述网络3可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi、通话网络等无线或有线网络。
其中,所述应用服务器2中通过所述网络3分别与一个或多个所述移动终端1(图中仅示出一个)通信连接,每一个所述移动终端1中均安装并运行有与所述应用服务器2对应的应用程序客户端(后文简称“移动终端客户端”)。所述移动终端客户端用于响应移动终端用户的操作,在所述移动终端客户端与所述应用服务器2之间创建长连接,以使所述移动终端客户端能够通过所述长 连接与所述应用服务器2进行数据传输和交互。
本实施例中,当所述应用服务器2内安装并运行有网络图片的爬取程序200时,首先,获取目标网页的URL;其次,爬取所述目标网页上预定数量的图片;然后,获取所述图片信息;接着,根据所述图片信息创建文件夹并选取所述图片;最后,将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
参阅图2所示,是图1中应用服务器2一可选的硬件架构的示意图。本实施例中,所述应用服务器2可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图2仅示出了具有组件11-13的应用服务器2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述应用服务器2的内部存储单元,例如该应用服务器2的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述应用服务器2的外部存储设备,例如该应用服务器2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述应用服务器2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述应用服务器2的操作系统和各 类应用软件,例如所述网络图片的爬取程序200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述应用服务器2的总体操作,例如执行与所述移动终端1进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的网络图片的爬取程序200等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述应用服务器2与其他电子设备之间建立通信连接。本实施例中,所述网络接口13主要用于通过所述网络3将所述应用服务器2与一个或多个所述移动终端1相连,在所述应用服务器2与所述一个或多个移动终端1之间的建立数据传输通道和通信连接。
至此,己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面,将基于上述应用环境和相关设备,提出本申请的各个实施例。
首先,本申请提出一种网络图片的爬取程序200。
参阅图3所示,是本申请网络图片的爬取程序200第一实施例的功能模块图。本实施例中,所述的网络图片的爬取程序200可以被分割成一个或多个模块,所述一个或者多个模块被存储于所述存储器11中,并由一个或多个处理器(本实施例中为所述处理器12)所执行,以完成本申请。例如,在图3中,所述的网络图片的爬取程序200可以被分割成第一获取模块201、图片爬取模块202、第二获取模块203、创建模块204及存储模块205。本申请所称的功能模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述所述网络图片的爬取程序200在所述应用服务器2中的执行过程。以下将就 各功能模块201-205的功能进行详细描述。
所述第一获取模块201,用于获取目标网页的统一资源定位符(Uniform Resource Locator,URL)。
具体地,所述第一获取模块201,通过网页爬取应用程序(application)获取目标网页的URL,所述网页爬取应用程序(application)通过Python语言进行编写。
在本实施例中,Python语言是一种面向对象的解释型计算机程序设计语言,具有丰富和强大的库。它常被昵称为胶水语言,能够把用其他语言制作的各种模块(尤其是C/C++)很轻松地联结在一起。常见的一种应用情形是,使用Python快速生成程序的原型(有时甚至是程序的最终界面),然后对性能有特别要求的部分,用更合适的语言改写,比如3D游戏中的图形渲染模块,性能要求特别高,就可以用C/C++重写,而后封装为Python可以调用的扩展类库。
所述图片爬取模块202,用于爬取所述目标网页上预定数量的图片。
具体地,所述图片爬取模块202通过循环命令控制所述网页爬取应用程序(application)循环爬取所述目标网页上预定数量的图片。在本实施例中,所述图片爬取模块202通过getPage函数去获取目标网页的URL,同时可以爬取目标网页上预定数目的图片,比如爬取20张图片,具体的实现语句如下:def getPage(self,pageNum):for i in range(1,21)。
所述第二获取模块203,用于获取所述图片信息。
所述第二获取模块203,获取所述图片信息的步骤,主要通过以下方式实现:
所述第二获取模块203,拼接所述图片的URL;并根据所述图片的URL获取所述图片信息。具体地,所述第二获取模块203还用于拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数,这样以实现拼接所述图片的URL的步骤。
在本实施例中,所述图片信息可以为多张图片的URL前缀和深度,而其中的深度信息即从目标链接至所述图片的链接数。比如搜索到某一网站,想点取某一图片,可能先要点击目标网页上的链接,根据链接的网站可能还需要再次点击一次链接才能获取目标图片,那么这几次的链接过程我们可以称之为深度。当然,在其他实施方式中,所述图片信息也可以是链接文本,比如一些对于多媒体、图片等文件,一般是通过链接的锚文本(即,链接文本)和相关的文件注释来判断这些文件的内容。例如有一个链接文字为“张曼玉照片”,其链接指向一张bmp格式的图片,那么网络蜘蛛就知道这张图片的内容是“张曼玉的照片”。这样,在搜索“张曼玉”和“照片”的时候都能让搜索引擎找到这张图片。
所述创建模块204,用于根据所述图片信息创建文件夹并选取所述图片。
具体地,所述创建模块204,根据所述图片信息命名创建所述文件夹,并确定文件夹的路径;并使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
在本实施方式中,Beautiful Soup是用Python写的一个超文本标记语言(Hyper Text Mark-up Language,HTML)/可扩展标识语言(Extensible Markup Language,XML)的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省编程时间。
所述存储模块205,用于将具有相同所述图片信息的图片存储至同一个所述文件夹。本实施例中,具有相同图片信息的图片为具有相同URL前缀和深度的图片。根据拼接的所述图片的URL中的信息即可以判断出所述图片是否具有相同的URL前缀和深度。比如,拥有相同的URL前缀和深度时,判断为同一人,否则不为同一人。
从上文可知,所述应用服务器2的所述接口定义模块201,所述第一获取 模块201获取目标网页的URL;所述图片爬取模块202爬取所述目标网页上预定数量的图片;所述第二获取模块203获取所述图片信息;所述创建模块204根据所述图片信息创建文件夹并选取所述图片;所述存储模块205将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
通过上述功能模块201-205,本申请所提出的网络图片的爬取程序200,首先,获取目标网页的URL;其次,爬取所述目标网页上预定数量的图片;然后,获取所述图片信息;接着,根据所述图片信息创建文件夹并选取所述图片;最后,将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
此外,本申请还提出一种网络图片的爬取方法。
参阅图4所示,是本申请网络图片的爬取方法第一实施例的实施流程示意图。在本实施例中,根据不同的需求,图4所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤S401,获取目标网页的统一资源定位符(Uniform Resource Locator,URL)。
具体地,所述应用服务器2,通过网页爬取应用程序(application)获取目标网页的URL,所述网页爬取应用程序(application)通过Python语言进 行编写。
在本实施例中,Python语言是一种面向对象的解释型计算机程序设计语言,具有丰富和强大的库。它常被昵称为胶水语言,能够把用其他语言制作的各种模块(尤其是C/C++)很轻松地联结在一起。常见的一种应用情形是,使用Python快速生成程序的原型(有时甚至是程序的最终界面),然后对性能有特别要求的部分,用更合适的语言改写,比如3D游戏中的图形渲染模块,性能要求特别高,就可以用C/C++重写,而后封装为Python可以调用的扩展类库。
步骤S402,爬取所述目标网页上预定数量的图片。具体地,所述爬取所述目标网页上预定数量的图片的具体步骤将在本申请网络图片的爬取方法第二实施例(图5)进行详述。
具体地,所述应用服务器2通过循环命令控制所述网页爬取应用程序(application)循环爬取所述目标网页上预定数量的图片。在本实施例中,所述应用服务器2通过getPage函数去获取目标网页的URL,同时可以爬取目标网页上预定数目的图片,比如爬取20张图片,具体的实现语句如下:def getPage(self,pageNum):for i in range(1,21)。
步骤S403,获取所述图片信息。
步骤S404,根据所述图片信息创建文件夹并选取所述图片。具体地,所述根据所述图片信息创建文件夹并选取所述图片的具体步骤将在本申请网络图片的爬取方法第三实施例(图6)进行详述。
步骤S405,将具有相同所述图片信息的图片存储至同一个所述文件夹。具体地,所述具有相同图片信息的图片为具有相同URL前缀和深度的图片。根据拼接的所述图片的URL中的信息即可以判断出所述图片是否具有相同的URL前缀和深度。比如,拥有相同的URL前缀和深度时,判断为同一人,否则不为同一人。
从上文可知,所述应用服务器2获取目标网页的URL;爬取所述目标网页上预定数量的图片;获取所述图片信息;根据所述图片信息创建文件夹并选取所述图片;将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
通过上述步骤S401-405,本申请所提出的网络图片的爬取方法,首先,获取目标网页的URL;其次,爬取所述目标网页上预定数量的图片;然后,获取所述图片信息;接着,根据所述图片信息创建文件夹并选取所述图片;最后,将具有相同所述图片信息的图片存储至同一个所述文件夹。这样,可以避免现有技术中不能实时对爬取到的图片进行有效的整理分类的弊端。在进行网页图片爬取的过程中,不仅可以快速的爬取到相应的目标图片,还可以根据预设策略获取对爬取到的图片进行自动的分类存储,实现了资源快速检索及分类整理的效果。
参阅图5所示,是本申请网络图片的爬取方法第二实施例的实施流程示意图。在本实施例中,根据不同的需求,图5所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
在本实施例中,所述获取所述图片信息的步骤,具体包括:
步骤S501,拼接所述图片的URL。
步骤S502,根据所述图片的URL获取所述图片信息。
在本实施例中,所述应用服务器2拼接所述图片的URL主要是通过以下方式:所述应用服务器2拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数。
在本实施例中,所述图片信息可以为多张图片的URL前缀和深度,而其中的深度信息即从目标链接至所述图片的链接数。比如搜索到某一网站,想点取某一图片,可能先要点击目标网页上的链接,根据链接的网站可能还需要再次点击一次链接才能获取目标图片,那么这几次的链接过程我们可以称之为深度。当然,在其他实施方式中,所述图片信息也可以是链接文本,比如一些对于多媒体、图片等文件,一般是通过链接的锚文本(即,链接文本)和相关的文件注释来判断这些文件的内容。例如有一个链接文字为“张曼玉照片”,其链接指向一张bmp格式的图片,那么网络蜘蛛就知道这张图片的内容是“张曼玉的照片”。这样,在搜索“张曼玉”和“照片”的时候都能让搜索引擎找到这张图片。
通过上述步骤S501-502,本申请所提出的网络图片的爬取方法,可以通过拼接所述图片的URL,并根据所述图片的URL获取所述图片信息。这样,可以快速的爬取到相应的目标图片。
参阅图6所示,是本申请网络图片的爬取方法第三实施例的实施流程示意图。在本实施例中,根据不同的需求,图6所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
在本实施例中,所述根据所述图片信息创建文件夹并选取所述图片的步骤,具体包括:
步骤S601,根据所述图片信息命名创建所述文件夹,并确定文件夹的路径。
步骤S602,使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
在本实施方式中,Beautiful Soup是用Python写的一个超文本标记语言(Hyper Text Mark-up Language,HTML)/可扩展标识语言(Extensible Markup  Language,XML)的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省编程时间。
通过上述步骤S601-602,本申请所提出的网络图片的爬取方法,可以通过使用Beautiful Soup解析所述图片信息并获取图片及图片内容。这样,可以大大节省编程时间。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种网络图片的爬取方法,应用于应用服务器,其特征在于,所述方法包括:
    获取目标网页的URL;
    爬取所述目标网页上预定数量的图片;
    获取所述图片信息;
    根据所述图片信息创建文件夹并选取所述图片;及
    将具有相同所述图片信息的图片存储至同一个所述文件夹。
  2. 如权利要求1所述的网络图片的爬取方法,其特征在于,所述获取所述图片信息的步骤,具体包括:
    拼接所述图片的URL;及
    根据所述图片的URL获取所述图片信息。
  3. 如权利要求2所述的网络图片的爬取方法,其特征在于,所述拼接所述图片的URL的步骤,具体包括:
    拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数。
  4. 如权利要求1所述的网络图片的爬取方法,其特征在于,所述根据所述图片信息创建文件夹并选取所述图片的步骤,具体包括:
    根据所述图片信息命名创建所述文件夹,并确定文件夹的路径;及
    使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
  5. 如权利要求1所述的网络图片的爬取方法,其特征在于,所述具有相同图片信息的图片为具有相同URL前缀和深度的图片。
  6. 一种应用服务器,其特征在于,所述应用服务器包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的网络图片的爬取程序,所述网络图片的爬取程序被所述处理器执行时实现如下步骤:
    获取目标网页的URL;
    爬取所述目标网页上预定数量的图片;
    获取所述图片信息;
    根据所述图片信息创建文件夹并选取所述图片;及
    将具有相同所述图片信息的图片存储至同一个所述文件夹。
  7. 如权利要求6所述的应用服务器,其特征在于,所述获取所述图片信息的步骤,具体包括:
    拼接所述图片的URL;及
    根据所述图片的URL获取所述图片信息。
  8. 如权利要求7所述的应用服务器,其特征在于,所述拼接所述图片的URL的步骤,具体包括:
    拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数。
  9. 如权利要求6所述的应用服务器,其特征在于,所述根据所述图片信息创建文件夹并选取所述图片的步骤,具体包括:
    根据所述图片信息命名创建所述文件夹,并确定文件夹的路径;及
    使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
  10. 如权利要求6所述的应用服务器,其特征在于,所述具有相同图片信息的图片为具有相同URL前缀和深度的图片。
  11. 一种网络图片的爬取程序,其特征在于,所述网络图片的爬取程序包括:
    第一获取模块,用于获取目标网页的URL;
    图片爬取模块,用于爬取所述目标网页上预定数量的图片;
    第二获取模块,用于获取所述图片信息;
    创建模块,用于根据所述图片信息创建文件夹并选取所述图片;及
    存储模块,用于将具有相同所述图片信息的图片存储至同一个所述文件夹。
  12. 如权利要求11所述的网络图片的爬取程序,其特征在于,所述第二获取模块,具体用于:
    拼接所述图片的URL;及
    根据所述图片的URL获取所述图片信息。
  13. 如权利要求12所述的网络图片的爬取程序,其特征在于,所述第二获取模块还用于:
    拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数。
  14. 如权利要求11所述的网络图片的爬取程序,其特征在于,所述创建模块,具体用于:
    根据所述图片信息命名创建所述文件夹,并确定文件夹的路径;及
    使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
  15. 如权利要求11所述的网络图片的爬取程序,其特征在于,所述具有相同图片信息的图片为具有相同URL前缀和深度的图片。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有网络图片的爬取程序,所述网络图片的爬取程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
    获取目标网页的URL;
    爬取所述目标网页上预定数量的图片;
    获取所述图片信息;
    根据所述图片信息创建文件夹并选取所述图片;及
    将具有相同所述图片信息的图片存储至同一个所述文件夹。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述获取所述图片信息的步骤,具体包括:
    拼接所述图片的URL;及
    根据所述图片的URL获取所述图片信息。
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述拼接所述图片的URL的步骤,具体包括:
    拼接所述目标网页的URL、图片前缀、网页页码以及从目标网页链接至所述图片的链接数。
  19. 如权利要求16所述的计算机可读存储介质,其特征在于,所述根据所述图片信息创建文件夹并选取所述图片的步骤,具体包括:
    根据所述图片信息命名创建所述文件夹,并确定文件夹的路径;及
    使用Beautiful Soup解析所述图片信息并获取图片及图片内容。
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述具有相同图片信息的图片为具有相同URL前缀和深度的图片。
PCT/CN2018/089449 2017-09-22 2018-06-01 网络图片的爬取方法、程序及应用服务器 WO2019056797A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710868857.9 2017-09-22
CN201710868857.9A CN107870975A (zh) 2017-09-22 2017-09-22 网络图片的爬取方法及应用服务器

Publications (1)

Publication Number Publication Date
WO2019056797A1 true WO2019056797A1 (zh) 2019-03-28

Family

ID=61752715

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089449 WO2019056797A1 (zh) 2017-09-22 2018-06-01 网络图片的爬取方法、程序及应用服务器

Country Status (2)

Country Link
CN (1) CN107870975A (zh)
WO (1) WO2019056797A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870975A (zh) * 2017-09-22 2018-04-03 平安科技(深圳)有限公司 网络图片的爬取方法及应用服务器
CN109086402A (zh) * 2018-07-31 2018-12-25 武汉斗鱼网络科技有限公司 Android中弹幕头像URL的获取方法
CN109766403A (zh) * 2019-01-18 2019-05-17 郑州轻工业学院 一种互联网位置图片数据的获取方法与装置
CN110647826B (zh) * 2019-09-05 2022-04-29 北京百度网讯科技有限公司 商品训练图片的获取方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609412A (zh) * 2011-01-07 2012-07-25 华东师范大学 基于rss的多线程图文信息同步爬取的控制方法及系统
CN105528422A (zh) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 一种主题爬虫处理方法及装置
CN105893583A (zh) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 基于人工智能的数据采集方法及系统
CN107870975A (zh) * 2017-09-22 2018-04-03 平安科技(深圳)有限公司 网络图片的爬取方法及应用服务器

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8290270B2 (en) * 2006-10-13 2012-10-16 Syscom, Inc. Method and system for converting image text documents in bit-mapped formats to searchable text and for searching the searchable text
CN106503253A (zh) * 2016-11-11 2017-03-15 张军 一种针对图片格式的网络爬虫提取url并索引及映射的框架

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609412A (zh) * 2011-01-07 2012-07-25 华东师范大学 基于rss的多线程图文信息同步爬取的控制方法及系统
CN105528422A (zh) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 一种主题爬虫处理方法及装置
CN105893583A (zh) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 基于人工智能的数据采集方法及系统
CN107870975A (zh) * 2017-09-22 2018-04-03 平安科技(深圳)有限公司 网络图片的爬取方法及应用服务器

Also Published As

Publication number Publication date
CN107870975A (zh) 2018-04-03

Similar Documents

Publication Publication Date Title
US10394902B2 (en) Creating rules for use in third-party tag management systems
US10642904B2 (en) Infrastructure enabling intelligent execution and crawling of a web application
EP2183721B1 (en) Secure inter-module communication mechanism
US11563674B2 (en) Content based routing method and apparatus
WO2019056797A1 (zh) 网络图片的爬取方法、程序及应用服务器
WO2019153603A1 (zh) 网页爬取的配置方法、应用服务器及计算机可读存储介质
US20100299732A1 (en) Time window based canary solutions for browser security
US8245125B1 (en) Hybrid rendering for webpages
US8689099B1 (en) Cross-domain communication
BRPI0616400A2 (pt) sistema e método para processamento de imagem
US20100138477A1 (en) Crunching Dynamically Generated Script Files
RU2628253C2 (ru) Способ и устройство для пометки терминала
CN107147645B (zh) 网络安全数据的获取方法及装置
US20180075003A1 (en) Verifying content of resources in markup language documents
CN112632358B (zh) 一种资源链接获取方法、装置、电子设备及存储介质
TW201804340A (zh) 腳本生成方法與裝置
CN116150513A (zh) 数据处理方法、装置、电子设备及计算机可读存储介质
CN112818270B (zh) 数据跨域传递方法、装置及计算机设备
WO2019071896A1 (zh) 网页地址去重方法、电子设备及计算机可读存储介质
CN109246069B (zh) 网页登录方法、装置和可读存储介质
CN112579947A (zh) 网页元素图的截取方法、装置及电子设备
KR20160132854A (ko) 콘텐츠의 캡처를 통한 자산 수집 서비스 제공 기법
CN113934953B (zh) 网页首屏渲染方法及装置
KR101921123B1 (ko) 메시지의 필드 인덱싱 방법
CN111506840A (zh) 客户端内嵌网页的加载方法、系统、电子设备、存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18858891

Country of ref document: EP

Kind code of ref document: A1