WO2020211367A1 - 数据爬取方法、装置、计算机设备和存储介质 - Google Patents

数据爬取方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020211367A1
WO2020211367A1 PCT/CN2019/118419 CN2019118419W WO2020211367A1 WO 2020211367 A1 WO2020211367 A1 WO 2020211367A1 CN 2019118419 W CN2019118419 W CN 2019118419W WO 2020211367 A1 WO2020211367 A1 WO 2020211367A1
Authority
WO
WIPO (PCT)
Prior art keywords
crawling
data
crawler
code
configure
Prior art date
Application number
PCT/CN2019/118419
Other languages
English (en)
French (fr)
Inventor
张师琲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020211367A1 publication Critical patent/WO2020211367A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • This application relates to the field of crawling technology, and in particular to a data crawling method, device, computer equipment and storage medium.
  • the embodiments of the present application provide a data crawling method, device, computer equipment, and storage medium, which can meet different requirements for data crawling.
  • an embodiment of the present application provides a data crawling method, including:
  • the database includes multiple code blocks, and the pre-building process of the database includes:
  • Data crawling is performed on multiple preset websites respectively, and the computer code corresponding to each crawling step in the data crawling process is taken as a code block.
  • an embodiment of the present application further provides a data crawling device, including:
  • the sequence determination module is used to select the required code blocks from the pre-built database according to the data crawling requirements; and according to the execution order of the selected code blocks, sort the selected code blocks to obtain the corresponding Code block sequence;
  • the crawler configuration module is used to configure the required crawler according to the code block sequence
  • the data crawling module is used to crawl data using the required crawler that has been configured to obtain crawled data
  • the database construction module is used to construct the database in advance, and the database includes a plurality of code blocks.
  • the database construction module is specifically used to crawl data of a plurality of preset websites separately, and then The computer code corresponding to each crawling step in the process is regarded as a code block.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, The processor executes the steps of the data crawling method described above.
  • the embodiments of the present application also provide a non-volatile readable storage medium storing computer readable instructions.
  • the computer readable instructions are executed by one or more processors, one or Multiple processors execute the steps of the data crawling method described above.
  • the required code blocks are selected from the database according to the data crawling requirements, and then the selected code blocks are sorted according to the order of execution of the steps to obtain Code block sequence, and then configure the required crawler according to the code block sequence, and finally use the configured crawler to crawl data. Since the embodiment of the present application can select the required code blocks according to the data crawling requirements, and then sort the selected code blocks, that is to say, it is equivalent to selecting multiple crawling steps according to the data crawling requirements and then performing various crawling steps.
  • the fetching steps are combined and sorted, so that the configured crawler can meet various needs of users, for example, whether to download the entire webpage or accurately crawl, whether to crawl javascript webpages or non-javascript webpages, etc.
  • the data provided in the embodiment of the application The crawling method is simple and easy to configure, and it can crawl different websites and different forms of data.
  • Figure 1 is a block diagram of the internal structure of a computer device in an embodiment
  • Figure 2 is a flowchart of a data crawling method in an embodiment
  • Fig. 3 is a structural block diagram of a data crawling device in an embodiment.
  • Figure 1 is a schematic structural diagram of a computer device in an embodiment of the application.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can realize a A method of data crawling.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device, and when the computer readable instruction is executed by the processor, the processor may execute a data crawling method.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • FIG. 1 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the embodiment of the present application provides a data crawling method, which can be executed by the computer device in FIG. 1. As shown in Figure 2, the method includes the following steps:
  • the database includes a plurality of code blocks
  • the pre-construction process of the database includes: crawling data of a plurality of preset websites respectively, and corresponding to each crawling step in the data crawling process
  • the computer code as a code block.
  • the foregoing computer code is the code corresponding to the crawling step, and may be referred to as crawling code for short.
  • the above-mentioned preset multiple websites such as a shopping website, a certain dating website, a certain news website, a certain database website, etc.
  • the code blocks in the database are more comprehensive and can be configured into various crawlers.
  • the code corresponding to each crawling step is regarded as a code block, and a code block can also be called a component, that is, a step corresponds to a code block or a component.
  • the so-called steps include, for example, the login steps when crawling a webpage, the steps to enter the list, the steps to turn pages, the steps to scroll down and so on. It can be seen that storing the computer code corresponding to each step as a code block in the database is equivalent to storing each step as a separate component.
  • the above process of crawling data for multiple preset websites may include: writing corresponding computer codes for the multiple preset websites, and using the computer code corresponding to each website to Web site for data crawling.
  • each preset website write computer code first, so that you can get a crawler suitable for crawling the website, and then use the computer code corresponding to each preset website (that is, the crawler corresponding to each preset website ) Perform data crawling, and save the code corresponding to each step in the crawling process as a code block (also called a component) in the database.
  • This way of writing computer code for each preset website can get a crawler that is very suitable for the website, so that each step in the data crawling process can complete the crawling work very effectively.
  • the foregoing process of separately writing corresponding computer codes for the plurality of preset websites may include: using a fine-grained decomposition method to separately write corresponding computer codes for data crawling on the plurality of preset websites. .
  • the specific process may include: when writing corresponding computer code for data crawling for each preset website, writing computer code for different crawling objects; wherein, the crawling objects include graphics, audio, At least one of video and text information.
  • crawling object For example, when writing computer code for a news website, write computer code for the pictures in the news website as the crawl object, write computer code for the audio in the news website as the crawl object, and write computer code for the news website. Video is used as the crawling object to write computer code, and the text information in the news website is used as the crawling object to write computer code.
  • crawling objects are subdivided for each website, which can make the code blocks in the database more comprehensive, which can meet various data crawling requirements.
  • the multiple steps corresponding to multiple code blocks in the database constructed through the above process may include: (1) login record cookie; (2) enter the list page to crawl the network address URL; (3) enter the article page Crawl the content of the article; (4) Click Next to turn to the next page to continue execution; (5) Enter the article page to crawl the content of the article; (6) The next page of content appears on the scroll bar; (7) Enter the content in the search box to search.
  • the aforementioned data crawling requirements can be diverse, for example, which website is to be crawled for, and what kind of content (picture, audio, video, text, etc.) on that website is to be crawled for. Different data crawling requirements require different code blocks.
  • the embodiment of the present application selects the required code blocks from the database according to the data crawling requirements. Since different code blocks correspond to different steps, that is, the execution order of each code block corresponds to the execution of each step. Therefore, each code block needs to be sorted, which is equivalent to sorting the steps in the order of execution.
  • the crawling steps will include: login-search hot words-crawling Weibo ID, Weibo content, release time, etc.-page turning , It can be seen that according to the above example, the order of the steps is roughly (1)-(7)-(3)-(4), so it is necessary to select steps (1), (3), (4), (7) from the database ) Corresponding code blocks, and then sort the four code blocks in the execution order (1)-(7)-(3)-(4) to obtain the corresponding code block sequence.
  • the crawling steps will include: enter the list page to crawl the URL-enter the article page-scroll down to turn the page, you can see the above example,
  • the order of steps is roughly (2)-(3)-(6), so you need to select the code blocks corresponding to steps (2), (3) and (6) from the database, and then execute these three code blocks according to The sequence (2)-(3)-(6) is sorted to obtain the corresponding code block sequence.
  • the process of configuring the required crawler is actually a process of generating a configuration file, and the configuration of the required crawler is completed after the configuration file is obtained. Therefore, the specific process of the foregoing step S22 may include: determining the configuration file of the required crawler according to the code block sequence and a preset description document. Among them, some descriptive information can be stored in the descriptive document, which can assist the user in generating the configuration file, for example, the process steps of generating the configuration file, and what information is needed in each step.
  • the code in the configuration file can be in the form of XML, which can improve the versatility of the crawler required.
  • the code block sequence is the code block sequence corresponding to steps (1)-(7)-(3)-(4), At this point, the configuration file can be generated according to this code block sequence.
  • data crawling requirements include not only which website is crawled and what content is crawled, but also whether it is crawled in full or incrementally, crawled javascript web content or non-javascript web content, A few levels of web pages start to fetch content, whether the page turning mode is a pull-down slide, what attributes are the fields to be fetched, etc. Therefore, these contents need to be configured.
  • the process of configuring the crawler according to the code block sequence can include: the seed, the address of the seed, the area where the seed is located, whether it is a full crawl, the keywords required for crawling, the page turning mode, the need to crawl Configure at least one of the attributes of the fields to be fetched, the number of levels of web pages to be fetched, and whether to fetch javascript web content.
  • the specific process can include the following steps:
  • Seed is the seed. As the name implies, the seed is introduced and diverted to capture content;
  • the url is the address of the seed.
  • the url is configured as http://www.chinanews.com/business/gd.shtml;
  • javascript means whether it is a javascript webpage, javascript is 1 for yes, javascript is 0 for no);
  • the keyword is the keyword, and the keyword may not be set in the code
  • SeedArea is the area where the seed is located. If it is not filled in, remove all the URL addresses of the entire webpage. In the above snippet code, the area where the seed is located is! [CDATA[#content_right>div.content_list]];
  • start means to start crawling content from the level of webpage, for example, start crawling from the second level of webpage;
  • a8 Configure the page turning mode. Turning is the page turning mode. When the turning is configured as slider, it means that the page turning mode is pull-down sliding;
  • Meta is the attributes of the fields that need to be captured. For example, field is the field, site is the address, tag is the label, index is the index, and pic is the picture.
  • javascript webpage or non-javascript webpage can be selected, which means that javascript webpage crawling and non-javascript webpage crawling can be realized.
  • the javascript web page is selected, the javascript code can be accurately interpreted and then converted into normal tagged html code. It is understandable that javascript web pages are dynamically generated pages, and non-javascript web pages are statically generated pages.
  • crawler configuration can be performed according to the sequence of code blocks obtained by sorting, the configured crawler can Achieve complete page download, but also accurate capture, for example, only capture images.
  • cluster distributed crawling can also be implemented to improve crawling speed.
  • the configuration file can also be uploaded to the server for storage, so that it can be directly obtained for the same data crawling requirements later, that is, the configuration file is obtained from the server, and according to the The configuration file described above is more convenient for data crawling.
  • the so-called anti-crawling mechanism refers to a proxy IP address that frequently visits a website, and the website will restrict access to the proxy IP address. This problem can be improved in any of two ways:
  • the crawler sends a login request to the server of the website to be logged in.
  • the login request carries the proxy address (ie proxy IP address) of the server used to log in to the website, and periodically modifies the proxy address so that it can Avoid the problem of restriction due to frequent visits to the website using the same proxy address.
  • the crawler modifies the proxy address every half an hour, and then stores the modified proxy address. When it needs to access the website, it can extract the modified proxy address.
  • the crawler sends a login request to the server of the website to be logged in.
  • the login request carries the proxy address (that is, the proxy IP address) of the server used to log in to the website, and passes the crawler when it encounters restricted access or access errors. Modify the proxy address.
  • the server finds that a proxy address frequently visits its website, it will intercept it, and feedback a restricted or wrong information to the sender of the login request, that is, the crawler.
  • the crawler When the crawler receives the information, it will modify the proxy address and again Send a login request.
  • the login request carries the modified proxy address.
  • the proxy address is modified, the website server will not intercept it.
  • the crawler when the crawler sends a login request to the server of the website and receives feedback information about restricted access or access error, the crawler modifies the proxy address in the login request, and then sends the login request with the modified proxy address. This will successfully log on to the website.
  • the process of modifying the proxy address can be as needed.
  • the proxy address used last time is 192.168.1.1
  • the proxy address used next time can be changed to 192.168.2.1.
  • the crawled data obtained after data crawling may have duplicate pages and/or advertisements.
  • a local sensitive hash algorithm may be used to deduplicate and filter the crawled data.
  • the local sensitive hash algorithm is the simhash algorithm.
  • the principle of the simhash algorithm roughly includes the following: basic preprocessing of the text retrieved by crawling, such as removing stop words (ie numerals, quantifiers, function words and other meaningless words) , Root restoration, segmentation (ie chunking), and finally multiple vectors can be obtained. Perform hash algorithm conversion on each vector to obtain a hash code of length f bits, and then perform positive and negative weight conversion on the 1-0 value of each bit. For example, when the f1 bit is 1, the weight is set to + When the weight and f1 bits are 0, the weight is set to -weight, so each vector corresponds to a f-bit weight vector.
  • the weight vectors corresponding to all vectors are accumulated according to the corresponding bits, and finally an f-bit weight array is obtained.
  • an f-bit weight array is obtained.
  • the new 1-0 array which is a new hash code, is the hash fingerprint, and then the hash fingerprint is used for deduplication and filtering to remove a large number of duplicate pages and advertisements.
  • the data crawling method provided by the embodiment of the application selects the required code blocks from the database according to the data crawling requirements, and then sorts the selected code blocks according to the execution order of the steps to obtain the code block sequence, and then according to the code block
  • the crawler required for sequence configuration is finally used for data crawling with the configured crawler. Since the embodiment of the present application can select the required code blocks according to the data crawling requirements, and then sort the selected code blocks, that is to say, it is equivalent to selecting multiple crawling steps according to the data crawling requirements and then performing various crawling steps.
  • the fetching steps are combined and sorted, so that the configured crawler can meet various needs of users, for example, whether to download the entire webpage or accurately crawl, whether to crawl javascript webpages or non-javascript webpages, etc.
  • the data provided in the embodiment of the application The crawling method is simple and easy to configure, and it can crawl different websites and different forms of data.
  • a data crawling device 30 is provided.
  • the device 30 may be integrated into the above-mentioned computer equipment, and may specifically include:
  • the sequence determination module 32 is used to select the required code blocks from the pre-built database according to the data crawling requirements; and according to the execution order of the selected code blocks, sort the selected code blocks to obtain the corresponding Sequence of code blocks;
  • the crawler configuration module 33 is used to configure the required crawler according to the code block sequence
  • the data crawling module 34 is configured to use the configured crawler to crawl data to obtain crawled data;
  • the database construction module 31 is used to construct the database in advance, and the database includes a plurality of code blocks.
  • the database construction module is specifically used to crawl data of a plurality of preset websites separately, and then The computer code corresponding to each crawling step in the fetching process is taken as a code block.
  • the device further includes: a deduplication filtering module, configured to use a local sensitive hash algorithm to deduplicate and filter the crawled data.
  • a deduplication filtering module configured to use a local sensitive hash algorithm to deduplicate and filter the crawled data.
  • the crawler configuration module is specifically configured to: determine the required crawler configuration file according to the code block sequence and a preset description document, wherein the description document stores a configuration file for generating all The description information of the configuration file.
  • performing data crawling on a plurality of preset websites in the database construction module includes: respectively writing corresponding computer codes for the plurality of preset websites, and adopting the corresponding The computer code for crawling the website data.
  • writing the corresponding computer code for the preset multiple websites in the database construction module includes: using a fine-grained decomposition method to separately write the corresponding all the computer codes for the multiple preset websites. Mentioned computer code.
  • using the configured required crawler to perform data crawling in the data crawling module includes: using the required crawler to log in to a corresponding website, specifically including: sending the required crawler to the server of the corresponding website Send a login request, the login request carries a proxy address, and the proxy address is modified periodically through the required crawler, or the required crawler is used to modify the The proxy address is modified.
  • the crawler configuration module is specifically used to: a1, configure the seed; a2, configure the address of the seed; a3, configure whether it is a full crawl; a4, crawl javascript Configure the web content or non-javascript web content; a5, configure the keywords required for crawling; a6, configure the area where the seed is located; a7, configure the number of levels to start crawling the web; a8, right Configure the page turning mode; a9. Configure the attributes of the fields that need to be captured.
  • the sequence determining module selects the required code blocks from the database according to the data crawling requirements, and then sorts the selected code blocks in the order of execution of the steps to obtain the code block sequence, and then
  • the crawler configuration module configures the required crawlers according to the code block sequence, and finally the data crawling module uses the configured crawlers to crawl data. Since the embodiment of the present application can select the required code blocks according to the data crawling requirements, and then sort the selected code blocks, that is to say, it is equivalent to selecting multiple crawling steps according to the data crawling requirements and then performing various crawling steps.
  • the fetching steps are combined and sorted, so that the configured crawler can meet various needs of users, for example, whether to download the entire webpage or accurately crawl, whether to crawl javascript webpages or non-javascript webpages, etc.
  • the data provided in the embodiment of the application The crawling method is simple and easy to configure, and it can crawl different websites and different forms of data.
  • a computer device includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes all When the computer-readable instructions are described, the following steps are implemented: according to the data crawling requirements, the required code blocks are selected from the pre-built database; and the selected code blocks are sorted according to the execution order of the selected code blocks , Obtain the corresponding code block sequence; configure the required crawler according to the code block sequence; use the configured required crawler to crawl data to obtain crawled data; wherein, the database includes multiple Code block, the pre-building process of the database includes: crawling data of multiple preset websites separately, and using the computer code corresponding to each step in the data crawling process as a code block.
  • the processor further implements the following step when executing the computer program: using a locally sensitive hash algorithm to de-duplicate and filter the crawled data.
  • the configuration of the required crawler according to the code block sequence executed by the processor includes: determining the required crawler according to the code block sequence and a preset description document The configuration file, wherein the description document stores the description information used to generate the configuration file.
  • the crawling of data from a plurality of preset websites performed by the processor includes: writing corresponding computer codes for the plurality of preset websites, and using each The computer code corresponding to a website crawls data of the website.
  • the respectively writing the corresponding computer code for the plurality of preset websites executed by the processor includes: separately writing the plurality of preset websites using a fine-grained decomposition method The corresponding computer code.
  • the data crawling performed by the processor using the required crawler completed by the configuration includes: using the required crawler to log in to a corresponding website, specifically including: passing the required crawler to the corresponding website
  • the server of the corresponding website sends a login request, the login request carries a proxy address, and the proxy address is modified periodically through the required crawler or through the required request when access is restricted or an access error is encountered.
  • the crawler modifies the proxy address.
  • the configuration of the required crawler according to the code block sequence executed by the processor includes: a1, configuring the seed; a2, configuring the address of the seed; a3, Whether to configure for full crawling; a4.
  • a non-volatile readable storage medium storing computer readable instructions.
  • the one or more processors execute the following Steps: According to the data crawling requirements, select the required code blocks from the pre-built database; and according to the execution order of the selected code blocks, sort the selected code blocks to obtain the corresponding code block sequence; According to the code block sequence, configure the required crawler; use the configured required crawler to crawl data to obtain the crawled data; wherein, the database includes a plurality of code blocks, and the database
  • the construction process includes: separately crawling data for multiple preset websites, and using the computer code corresponding to each crawling step in the data crawling process as a code block.
  • the following step is further implemented: using a locally sensitive hash algorithm to de-duplicate and filter the crawled data.
  • the configuration of the required crawler according to the code block sequence executed by the one or more processors includes: determining the code block sequence and a preset description document to determine the The configuration file of the required crawler, wherein the description document stores the description information used to generate the configuration file.
  • the crawling of data from a plurality of preset websites executed by the one or more processors includes: respectively writing the corresponding computer code for the plurality of preset websites , And use the computer code corresponding to each website to crawl the website data.
  • the writing of the corresponding computer code for the plurality of preset websites executed by the one or more processors includes: using a fine-grained decomposition method to analyze the plurality of preset websites.
  • the websites respectively write the corresponding computer codes.
  • the data crawling performed by the one or more processors using the required crawler completed by the configuration includes: using the required crawler to log in to a corresponding website, which specifically includes: The required crawler sends a login request to the server of the corresponding website.
  • the login request carries a proxy address, and the proxy address is modified periodically through the required crawler or is passed when access is restricted or an access error is encountered.
  • the desired crawler modifies the proxy address.
  • the configuration of the required crawler according to the code block sequence executed by the one or more processors includes: a1, configuring a seed; a2, configuring an address of the seed A3. Configure whether it is a full crawl; a4. Configure whether to crawl javascript web content or non-javascript web content; a5, configure the keywords required for crawling; a6, the area where the seeds are located Configure; a7, start to configure the level of web page crawling; a8, configure the page turning mode; a9, configure the attributes of the fields that need to be crawled.
  • the computer program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及一种数据爬取方法、装置、计算机设备和存储介质,该方法包括:根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;根据所述代码块序列,对所需爬虫进行配置;采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。本申请能够满足用户的不同需求。

Description

数据爬取方法、装置、计算机设备和存储介质
本申请要求与2019年4月19日提交中国专利局、申请号为201910319429X、申请名称为“数据爬取方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及爬虫技术领域,尤其涉及一种数据爬取方法、装置、计算机设备和存储介质。
背景技术
发明人发现目前,开源爬虫的种类繁多,但是各种爬虫各有优缺,不能满足数据爬取的各种需求。举例来说,随着网络的迅速发展,万维网成为大量信息的载体,如何有效地提取并利用这些信息成为一个巨大的挑战。而万维网上的数据形式有多种,例如图片、数据库、音频、视频多媒体等,还有不同形式的网页、不同形式的反爬技术,使得目前开源社区各种各样的爬虫已经不足以支持对于不同形式数据的爬取要求。
发明内容
本申请实施例提供一种数据爬取方法、装置、计算机设备和存储介质,能够满足数据爬取的不同需求。
依据本申请一个方面,本申请实施例提供一种数据爬取方法,包括:
根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
根据所述代码块序列,对所需爬虫进行配置;
采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;
其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:
对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
依据本申请另一个方面,本申请实施例还提供一种数据爬取装置,包括:
序列确定模块,用于根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
爬虫配置模块,用于根据所述代码块序列,对所需爬虫进行配置;
数据爬取模块,用于采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;
数据库构建模块,用于预先构建所述数据库,所述数据库中包括多个代码块,所述数据库构建模块具体用于:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
依据本申请又一个方面,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述数据爬取方法的步骤。
依据本申请再一个方面,本申请实施例还提供一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述数据爬取方法的步骤。
本申请实施例提供的数据爬取方法、装置、计算机设备和存储介质,根据数据爬取需求从数据库中选择所需的代码块,然后将选择出的各个代码块按照步骤执行顺序进行排序,得到代码块序列,进而依据代码块序列配置所需的爬虫,最后利用配置好的爬虫进行数据爬取。由于本申请实施例可以根据数据爬取需求选择出所需的代码块,然后对选择出的代码块进行排序,也就是说,相当于根据数据爬取需求选择多个爬取步骤进而对各个爬取步骤进行组合排序,这样配置成的爬虫可以满足用户的各种需求,例如,是下载整个网页还是精准抓取、是抓取javascript网页还是非javascript网页等,而且,本申请实施例提供的数据爬取方法简单、易配置,可以实现对不同网站、不同形式的数据的爬取。
附图说明
图1为一个实施例中计算机设备的内部结构框图;
图2为一个实施例中数据爬取方法的流程图;
图3为一个实施例中数据爬取装置的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受 这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。
图1为本申请一个实施例中计算机设备的结构示意图。如图1所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种数据爬取方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种数据爬取方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
本申请实施例提供一种数据爬取方法,该方法可以由图1中的计算机设备执行。如图2所示,该方法包括如下步骤:
S21、根据数据爬取需求,从预先构建的数据库中选择所需的代码块,并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
可理解的是,上述计算机代码为爬取步骤对应的代码,可以简称为爬取代码。
在实际应用中,上述预设的多个网站,例如,某购物网站、某交友网站、某新闻网站、某数据库网站等,可以选择不同种类的网站作为上述预设的多个网站,从而使得构建的数据库中的代码块比较全面,能够配置成各种爬虫。
可理解的是,在数据库构建过程中,将每一个爬取步骤对应的代码作为一个代码块,一个代码块也可以称之为一个组件,也就是说,一个步骤对应一个代码块或一个组件。所谓的步骤,例如,爬取网页时的登陆的步骤、进入列表的步骤、翻页的步骤、下拉翻滚的步骤等。可见,将每一个步骤对应的计算机代码作为一个代码块保存至数据库中,相当于将每一个步骤作为一个单独的组件保存下来。
在实际应用中,上述对预设的多个网站分别进行数据爬取的过程可以包括:对所述预设的多个网站分别编写对应的计算机代码,并采用每一网站对应的计算机代码对该网站进行数据爬取。
也就是说,针对每一个预设的网站先编写计算机代码,这样可以得到适合爬取该网站的爬虫,然后采用每一个预设的网站对应的计算机代码(即每一个预设的网站对应的爬虫)进行数据爬取,将爬取过程中的每一个步骤对应的代码作为一个代码块(也可以称之为一个组件)保存至数据库中。这种针对每一个预设的网站编写计算机代码的方式,能够得到非常适合该网站的爬虫,以便使得在数据爬取过程中各个步骤能够非常有效的完成爬取工作。
其中,上述对所述预设的多个网站分别编写对应的计算机代码的过程可以包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的用来进行数据爬取的计算机代码。通俗的讲就是将业务模型中的对象加以细分,从而得到更科学合理的对象模型,直观的说就是划分出很多对象。具体过程可以包括:对预设的每一个网站编写对应的用来进行数据爬取的计算机代码时,针对不同的爬取对象分别编写计算机代码;其中,所述爬取对象包括图品、音频、视频和文字信息中的至少一种。举例来说,对某个新闻网站编写计算机代码时,将该新闻网站中的图片作为爬取对象编写计算机代码、将该新闻网站中的音频作为爬取对象编写计算机代码、将该新闻网站中的视频作为爬取对象编写计算机代码、将该新闻网站中的文字信息作为爬取对象编写计算机代码等。针对每一个网站均细分出很多爬取对象,可以使数据库中的代码块更加全面,这样可以满足各种各样的数据爬取需求。
举例来说,通过上述过程构建的数据库中的多个代码块所对应的多个步骤可以包括:(1)登录记录cookie;(2)进入列表页爬取网络地址URL;(3)进入文章页爬取文章内容;(4)点击next翻到下一页继续执行;(5)进入文章页爬取文章内容;(6)下拉滚动条出现下一页内容;(7)搜索框输入内容搜索。
可理解的是,上述数据爬取需求可以是多种多样的,例如,对哪个网站进行数据爬取,对该网站上的何种内容(图片、音频、视频、文字等)进行数据爬取。不同的数据爬取需求,所需要的代码块不同。
可理解的是,本申请实施例根据数据爬取需求从数据库中选择出所需要的各个代码块,由于不同的代码块对应不同的步骤,也就是说,各个代码块的执行顺序对应各个步骤的执行顺序,因此需要对各个代码块进行排序,相当于按照执行顺序对各个步骤进行排序。
例如,用户想要爬取新浪微博的内容,根据这一数据爬取需求,可知爬取步骤会包括:登录-搜索热词-爬取微博ID、微博内容、发布时间等-翻页,可见依据上文举例,其步骤顺序大致是(1)-(7)-(3)-(4),因此需要从数据库中选择出步骤(1)、(3)、(4)、(7)对应的代码块,然后将这四个代码块按照 执行顺序(1)-(7)-(3)-(4)进行排序,得到对应的代码块序列。
再例如,用户想要爬取网易新闻上的内容,根据这一数据爬取需求,可知爬取步骤会包括:进入列表页爬取URL–进入文章页–下滑翻页,可见依据上文举例,其步骤顺序大致是(2)-(3)-(6),因此需要从数据库中选择出步骤(2)、(3)和(6)对应的代码块,然后将这三个代码块按照执行顺序(2)-(3)-(6)进行排序,得到对应的代码块序列。
S22、根据所述代码块序列,对所需爬虫进行配置;
可理解的是,对所述所需爬虫进行配置的过程实际上是生成配置文件的过程,得到配置文件后所需爬虫即配置完成。因此上述步骤S22的具体过程可以包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件。其中,说明文档中可以存储有一些说明信息,这些说明信息可以辅助用户生成配置文件,例如,生成配置文件的流程步骤,在每一步骤中需要那些信息等。
在实际应用中,可以通过可扩展标记语言XML的形式进行配置,也就是说,配置文件中的代码可以采用XML的形式,可以提高上述所需爬虫的通用性。
举例来说,针对上述用户想要爬取新浪微博的内容这一数据爬取需求,其代码块序列为步骤(1)-(7)-(3)-(4)对应的代码块序列,此时可以按照这一代码块序列生成配置文件。
可理解的是,数据爬取需求不仅仅包括爬取是哪个网站、爬取何种内容,还可以包括是全量爬取还是增量爬取、爬取javascript网页内容还是非javascript网页内容、从第几级网页开始抓取内容、翻页模式是不是下拉滑动、所要抓取字段有何属性等,因此还需要对这些内容进行配置。
在具体实施时,根据代码块序列对爬虫进行配置的过程可以包括:对种子、种子的地址、种子的所在区域、是否为全量抓取、爬取所需的关键字、翻页模式、需要抓取的字段的属性、开始抓取的网页的级数和是否抓取javascript网页内容中的至少一项进行配置。
具体过程可以包括如下步骤:
a1、对种子进行配置,seed即种子,顾名思义是以种子为引进而发散抓取内容;
a2、对种子的地址进行配置,url即种子的地址,例如,url被配置为http://www.chinanews.com/business/gd.shtml;
a3、对是否为全量抓取进行配置,fully即是否为全量爬取,fully取1为是,fully取0为否;
a4、对爬取javascript网页内容还是非javascript网页内容进行配置,例如,javascript即是否为javascript网页,javascript取1为是,javascript取0为否);
a5、对关键字进行配置,keyword即关键字,在代码中也可不设置关键字;
a6、对种子所在区域进行配置,seedArea即种子所在区域,如若不填则将全网页的URL地址全部取下,在上述片段代码中种子所在区域为![CDATA[#content_right>div.content_list]];
a7、对从第几级网页开始抓取进行配置,start即从第几级网页开始抓取内容,例如,从第2级网页开始抓取;
a8、对翻页模式进行配置,turning即翻页模式,turning配置为slider,则表示翻页模式为下拉滑动;
a9、对需要抓取的字段的属性进行配置,meta即需要抓取字段的属性,例如,field即领域、site即地址、tag即标签、index即索引、pic即图片等。
从上述片段代码可知,可以选择javascript网页或非javascript网页,也就是说可以实现javascript网页抓取和非javascript页面抓取。当选择javascript网页时,可以精确解释javascript代码,进而转变为正常的带标签的html代码。可理解的是,javascript网页即为动态生成的页面,非javascript网页即为静态生成的页面。
由于本申请实施例中可以根据数据爬取需要对不同的代码块进行组合排序(即对各种步骤进行任意组合配置),并按照排序得到的代码块序列进行爬虫配置,因此配置得到的爬虫可以实现完整页面下载,也可以实现精准抓取,例如,只抓取图片。当然,通过对数据爬取需求的设置,还可以实现集群分布式爬取,以提高爬取速度。
可见,不论数据爬取需求是什么,都能通过上述方式配置所需爬虫。
当然,在实际应用中,还可以将所述配置文件上传至服务器上进行存储,以便后续针对同样的数据爬取需求直接获取即可,即从所述服务器上获取所述配置文件,并根据所述配置文件进行数据爬取,更加方便。
S23、采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据。
在进行数据爬取时,爬虫可能会遇到网站的反爬机制,所谓的反爬机制是指一个代理IP地址对一个 网站进行频繁访问,该网站就会对该代理IP地址进行访问限制。对此问题,可以通过以两种下方式中的任意一种进行改善:
(1)爬虫向所要登陆的网站的服务器发送登陆请求,所述登陆请求中携带有用于登陆该网站的服务器的代理地址(即代理IP地址),周期性对所述代理地址进行修改,这样可以避免因采用同一个代理地址频繁访问网站而受限的问题。例如,爬虫每隔半小时修改一次代理地址,再将修改后的代理地址存储起来,在需要访问网站时,提取修改后的代理地址即可。
(2)爬虫向所要登陆的网站的服务器发送登陆请求,所述登陆请求中携带有用于登陆该网站的服务器的代理地址(即代理IP地址),在遇到访问受限或访问错误时通过爬虫对所述代理地址进行修改。当服务器发现一个代理地址频繁访问其网站后,会进行拦截,并向登陆请求的发送者即爬虫反馈一个访问受限或访问错误的信息,当爬虫接收到该信息后,会修改代理地址,再次发送登陆请求,此时登陆请求中携带的是修改后的代理地址。当代理地址被修改后,网站的服务器就不会进行拦截。例如,当爬虫向网站的服务器发送登陆请求后收到访问受限或者访问错误的反馈信息,此时爬虫对登陆请求中的代理地址进行修改,然后发送携带有修改后的代理地址的登陆请求,这样就会成功登陆网站。
无论哪种方式,对代理地址的修改过程可以根据需要,例如,上一次使用的代理地址为192.168.1.1,下一次使用的代理地址可以修改为192.168.2.1。
在实际应用中,数据爬取后得到的爬取数据可能存在重复页面和/或存在广告,此时可以采用局部敏感哈希算法对所述爬取数据进行去重过滤。
其中,局部敏感哈希算法即simhash算法,simhash算法的原理大致包括如下内容:对爬取出的文本进行基本的预处理,比如去除停词(即数词、量词、功能词等没有含义的词)、词根还原、分段(即chunking),最后可以得到多个向量。将每一个向量进行hash算法转换,得到长度f位的hash码,每一位然后对每一位上的1-0值进行正负权值转换,例如f1位是1时,权值设为+weight,f1位为0时,权值设为-weight,由此每一个向量对应一个f位的权值向量。将所有的向量对应的权值向量按照对应位累加,最后得到一个f位的权值数组,将数组中位为正的置1,位为负的置0,那么文本就转变成一个f位的新的1-0数组,也就是一个新的hash码,即为hash指纹,进而利用hash指纹进行去重和过滤,去除大量的重复页面和广告等。
本申请实施例提供的数据爬取方法,根据数据爬取需求从数据库中选择所需的代码块,然后将选择出的各个代码块按照步骤执行顺序进行排序,得到代码块序列,进而依据代码块序列配置所需的爬虫,最后利用配置好的爬虫进行数据爬取。由于本申请实施例可以根据数据爬取需求选择出所需的代码块,然后对选择出的代码块进行排序,也就是说,相当于根据数据爬取需求选择多个爬取步骤进而对各个爬取步骤进行组合排序,这样配置成的爬虫可以满足用户的各种需求,例如,是下载整个网页还是精准抓取、是抓取javascript网页还是非javascript网页等,而且,本申请实施例提供的数据爬取方法简单、易配置,可以实现对不同网站、不同形式的数据的爬取。
如图3所示,在一个实施例中,提供了一种数据爬取装置30,该装置30可以集成于上述的计算机设备中,具体可以包括:
序列确定模块32,用于根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
爬虫配置模块33,用于根据所述代码块序列,对所需爬虫进行配置;
数据爬取模块34,用于采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;
数据库构建模块31,用于预先构建所述数据库,所述数据库中包括多个代码块,所述数据库构建模块具体用于:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
在一些实施例中,所述装置还包括:去重过滤模块,用于采用局部敏感哈希算法对所述爬取数据进行去重过滤。
在一些实施例中,所述爬虫配置模块具体用于:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件,其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
在一些实施例中,所述数据库构建模块中对预设的多个网站分别进行数据爬取包括:对所述预设的多个网站分别编写对应的所述计算机代码,并采用每一网站对应的所述计算机代码对该网站进行数据爬取。
在一些实施例中,所述数据库构建模块中对所述预设的多个网站分别编写对应的所述计算机代码包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的所述计算机代码。
在一些实施例中,数据爬取模块中采用配置完成的所述所需爬虫进行数据爬取包括:采用所述所需爬虫登陆对应网站,具体包括:通过所述所需爬虫向对应网站的服务器发送登陆请求,所述登陆请求中携带有代理地址,且周期性通过所述所需爬虫对所述代理地址进行修改或者在遇到访问受限或访问错误时通过所述所需爬虫对所述代理地址进行修改。
在一些实施例中,所述爬虫配置模块具体用于:a1、对种子进行配置;a2、对所述种子的地址进行配置;a3、对是否为全量抓取进行配置;a4、对爬取javascript网页内容还是非javascript网页内容进行配置;a5、对爬取所需的关键字进行配置;a6、对所述种子的所在区域进行配置;a7、开始抓取网页的级数进行配置;a8、对翻页模式进行配置;a9、对需要抓取的字段的属性进行配置。
本申请实施例提供的数据爬取装置,序列确定模块根据数据爬取需求从数据库中选择所需的代码块,然后将选择出的各个代码块按照步骤执行顺序进行排序,得到代码块序列,进而爬虫配置模块依据代码块序列配置所需的爬虫,最后数据爬取模块利用配置好的爬虫进行数据爬取。由于本申请实施例可以根据数据爬取需求选择出所需的代码块,然后对选择出的代码块进行排序,也就是说,相当于根据数据爬取需求选择多个爬取步骤进而对各个爬取步骤进行组合排序,这样配置成的爬虫可以满足用户的各种需求,例如,是下载整个网页还是精准抓取、是抓取javascript网页还是非javascript网页等,而且,本申请实施例提供的数据爬取方法简单、易配置,可以实现对不同网站、不同形式的数据的爬取。
在一些实施例中,提出了一种计算机设备,所述计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;根据所述代码块序列,对所需爬虫进行配置;采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个步骤所对应的计算机代码作为一个代码块。
在一些实施例中,所述处理器执行所述计算机程序时还实现以下步骤:采用局部敏感哈希算法对所述爬取数据进行去重过滤。
在一些实施例中,所述处理器执行的所述根据所述代码块序列,对所需爬虫进行配置,包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件,其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
在一些实施例中,所述处理器执行的所述对预设的多个网站分别进行数据爬取,包括:对所述预设的多个网站分别编写对应的所述计算机代码,并采用每一网站对应的所述计算机代码对该网站进行数据爬取。
在一些实施例中,所述处理器执行的所述对所述预设的多个网站分别编写对应的所述计算机代码,包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的所述计算机代码。
在一些实施例中,所述处理器执行的所述采用配置完成的所述所需爬虫进行数据爬取,包括:采用所述所需爬虫登陆对应网站,具体包括:通过所述所需爬虫向对应网站的服务器发送登陆请求,所述登陆请求中携带有代理地址,且周期性通过所述所需爬虫对所述代理地址进行修改或者在遇到访问受限或访问错误时通过所述所需爬虫对所述代理地址进行修改。
在一些实施例中,所述处理器执行的所述根据所述代码块序列,对所需爬虫进行配置包括:a1、对种子进行配置;a2、对所述种子的地址进行配置;a3、对是否为全量抓取进行配置;a4、对爬取javascript网页内容还是非javascript网页内容进行配置;a5、对爬取所需的关键字进行配置;a6、对所述种子的所在区域进行配置;a7、开始抓取网页的级数进行配置;a8、对翻页模式进行配置;a9、对需要抓取的字段的属性进行配置。本申请提供的计算机设备的有益效果与上述数据爬取方法和装置相同,这里不再赘述。
在一个实施例中,提出了一种存储有计算机可读指令的非易失性可读存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;根据所述代码块序列,对所需爬虫进行配置;采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
在一些实施例中,所述一个或多个处理器执行所述计算机可读指令时还实现以下步骤:采用局部敏感哈希算法对所述爬取数据进行去重过滤。
在一些实施例中,所述一个或多个处理器执行的所述根据所述代码块序列,对所需爬虫进行配置,包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件,其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
在一些实施例中,所述一个或多个处理器执行的所述对预设的多个网站分别进行数据爬取,包括:对所述预设的多个网站分别编写对应的所述计算机代码,并采用每一网站对应的所述计算机代码对该网站进行数据爬取。
在一些实施例中,所一个或多个处理器执行的所述对所述预设的多个网站分别编写对应的所述计算机代码,包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的所述计算机代码。
在一些实施例中,所述一个或多个处理器执行的所述采用配置完成的所述所需爬虫进行数据爬取,包括:采用所述所需爬虫登陆对应网站,具体包括:通过所述所需爬虫向对应网站的服务器发送登陆请求,所述登陆请求中携带有代理地址,且周期性通过所述所需爬虫对所述代理地址进行修改或者在遇到访问受限或访问错误时通过所述所需爬虫对所述代理地址进行修改。
在一些实施例中,所述一个或多个处理器执行的所述根据所述代码块序列,对所需爬虫进行配置包括:a1、对种子进行配置;a2、对所述种子的地址进行配置;a3、对是否为全量抓取进行配置;a4、对爬取javascript网页内容还是非javascript网页内容进行配置;a5、对爬取所需的关键字进行配置;a6、对所述种子的所在区域进行配置;a7、开始抓取网页的级数进行配置;a8、对翻页模式进行配置;a9、对需要抓取的字段的属性进行配置。
本申请提供的存储介质的有益效果与数据爬取方法和装置相同,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种数据爬取方法,包括:
    根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
    根据所述代码块序列,对所需爬虫进行配置;
    采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;
    其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:
    对预设的多个网站分别进行数据爬取,并将数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
  2. 根据权利要求1所述的方法,还包括:采用局部敏感哈希算法对所述爬取数据进行去重过滤。
  3. 根据权利要求1所述的方法,所述根据所述代码块序列,对所需爬虫进行配置,包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件;其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
  4. 根据权利要求1所述的方法,所述对预设的多个网站分别进行数据爬取,包括:对所述预设的多个网站分别编写对应的所述计算机代码,并采用每一网站对应的所述计算机代码对该网站进行数据爬取。
  5. 根据权利要求4所述的方法,所述对所述预设的多个网站分别编写对应的所述计算机代码,包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的所述计算机代码。
  6. 根据权利要求1所述的方法,所述采用配置完成的所述所需爬虫进行数据爬取,包括:采用所述所需爬虫登陆对应网站,具体包括:通过所述所需爬虫向对应网站的服务器发送登陆请求,所述登陆请求中携带有代理地址,且周期性通过所述所需爬虫对所述代理地址进行修改或者在遇到访问受限或访问错误时通过所述所需爬虫对所述代理地址进行修改。
  7. 根据权利要求1所述的方法,所述根据所述代码块序列,对所需爬虫进行配置,包括:
    a1、对种子进行配置;
    a2、对所述种子的地址进行配置;
    a3、对是否为全量抓取进行配置;
    a4、对爬取javascript网页内容还是非javascript网页内容进行配置;
    a5、对爬取所需的关键字进行配置;
    a6、对所述种子的所在区域进行配置;
    a7、开始抓取网页的级数进行配置;
    a8、对翻页模式进行配置;
    a9、对需要抓取的字段的属性进行配置。
  8. 一种数据爬取装置,所述装置包括:
    序列确定模块,用于根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;
    爬虫配置模块,用于根据所述代码块序列,对所需爬虫进行配置;
    数据爬取模块,用于采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;
    数据库构建模块,用于预先构建所述数据库,所述数据库中包括多个代码块,所述数据库构建模块具体用于:对预设的多个网站分别进行数据爬取,并将在数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
  9. 根据权利要求8所述的装置,所述装置还包括:去重过滤模块,用于采用局部敏感哈希算法对所述爬取数据进行去重过滤。
  10. 根据权利要求8所述的装置,所述爬虫配置模块,具体用于根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件;其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
  11. 根据权利要求8所述的装置,所述数据库构建模块中对预设的多个网站分别编写对应的所述计算机代码,并采用每一网站对应的所述计算机代码对该网站进行数据爬取。
  12. 根据权利要求11所述的装置,所述数据库构建模块中对所述预设的多个网站分别编写对应的所述计算机代码,包括:采用细粒度分解方式对所述预设的多个网站分别编写对应的所述计算机代码。
  13. 根据权利要求8所述的装置,所述数据爬取模块中采用配置完成的所述所需爬虫进行数据爬取,包括:采用所述所需爬虫登陆对应网站,具体包括:通过所述所需爬虫向对应网站的服务器发送登陆请求,所述登陆请求中携带有代理地址,且周期性通过所述所需爬虫对所述代理地址进行修改或者在遇到访问受限或访问错误时通过所述所需爬虫对所述代理地址进行修改。
  14. 根据权利要求8所述的装置,所述爬虫配置模块,具体用于:a1、对种子进行配置;a2、对所述种子的地址进行配置;a3、对是否为全量抓取进行配置;a4、对爬取javascript网页内容还是非javascript网页内容进行配置;a5、对爬取所需的关键字进行配置;a6、对所述种子的所在区域进行配置;a7、开始抓取网页的级数进行配置;a8、对翻页模式进行配置;a9、对需要抓取的字段的属性进行配置。
  15. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行数据爬取方法的步骤,包括:根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;根据所述代码块序列,对所需爬虫进行配置;采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
  16. 根据权利要求15所述的计算机设备,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述方法还包括:采用局部敏感哈希算法对所述爬取数据进行去重过滤。
  17. 根据权利要求15所述的计算机设备,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述根据所述代码块序列,对所需爬虫进行配置,包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件;其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
  18. 一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行数据爬取方法的步骤,包括:根据数据爬取需求,从预先构建的数据库中选择所需的代码块;并根据选择出的各个代码块的执行顺序,对选择出的各个代码块进行排序,得到对应的代码块序列;根据所述代码块序列,对所需爬虫进行配置;采用配置完成的所述所需爬虫进行数据爬取,得到爬取数据;其中,所述数据库中包括多个代码块,所述数据库的预先构建过程包括:对预设的多个网站分别进行数据爬取,并将数据爬取过程中的每一个爬取步骤所对应的计算机代码作为一个代码块。
  19. 根据权利要求18所述的存储介质,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述方法还包括:采用局部敏感哈希算法对所述爬取数据进行去重过滤。
  20. 根据权利要求18所述的存储介质,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述根据所述代码块序列,对所需爬虫进行配置,包括:根据所述代码块序列和预设的说明文档,确定所述所需爬虫的配置文件;其中,所述说明文档中存储有用于生成所述配置文件的说明信息。
PCT/CN2019/118419 2019-04-19 2019-11-14 数据爬取方法、装置、计算机设备和存储介质 WO2020211367A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910319429.XA CN110209909A (zh) 2019-04-19 2019-04-19 数据爬取方法、装置、计算机设备和存储介质
CN201910319429.X 2019-04-19

Publications (1)

Publication Number Publication Date
WO2020211367A1 true WO2020211367A1 (zh) 2020-10-22

Family

ID=67786028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118419 WO2020211367A1 (zh) 2019-04-19 2019-11-14 数据爬取方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110209909A (zh)
WO (1) WO2020211367A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209909A (zh) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 数据爬取方法、装置、计算机设备和存储介质
CN112541104A (zh) * 2019-09-20 2021-03-23 浙江大搜车软件技术有限公司 一种数据抓取方法及装置
CN111597421B (zh) * 2020-04-30 2022-08-30 武汉思普崚技术有限公司 一种实现网站图片爬虫的方法、装置、设备及存储介质
CN112732996A (zh) * 2021-01-11 2021-04-30 深圳市洪堡智慧餐饮科技有限公司 一种基于异步aiohttp多平台分布式数据爬取方法
CN113542223A (zh) * 2021-06-16 2021-10-22 杭州拼便宜网络科技有限公司 基于设备指纹的反爬虫方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567513A (zh) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 钓鱼网站收集方法和钓鱼网站收集设备
US20150039419A1 (en) * 2013-08-05 2015-02-05 Yahoo! Inc. Keyword price recommendation
CN107729508A (zh) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 信息爬取方法与装置
CN110189189A (zh) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 一站式网络购物引导方法、装置、计算机设备和存储介质
CN110209909A (zh) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 数据爬取方法、装置、计算机设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766014B (zh) * 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 用于检测恶意网址的方法和系统
CN108153880A (zh) * 2017-12-26 2018-06-12 北京非斗数据科技发展有限公司 一种关于网络图片的多策略自适应爬取技术
CN109063144A (zh) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 可视化网络爬虫方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567513A (zh) * 2011-12-27 2012-07-11 北京神州绿盟信息安全科技股份有限公司 钓鱼网站收集方法和钓鱼网站收集设备
US20150039419A1 (en) * 2013-08-05 2015-02-05 Yahoo! Inc. Keyword price recommendation
CN107729508A (zh) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 信息爬取方法与装置
CN110189189A (zh) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 一站式网络购物引导方法、装置、计算机设备和存储介质
CN110209909A (zh) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 数据爬取方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN110209909A (zh) 2019-09-06

Similar Documents

Publication Publication Date Title
WO2020211367A1 (zh) 数据爬取方法、装置、计算机设备和存储介质
US11074560B2 (en) Tracking processed machine data
US8554800B2 (en) System, methods and applications for structured document indexing
KR101623223B1 (ko) 하나의 인터넷 호스팅 시스템 집합에 의해 제공되는 다수의 병렬 사용자 경험
US9077681B2 (en) Page loading optimization using page-maintained cache
US8660976B2 (en) Web content rewriting, including responses
US9122769B2 (en) Method and system for processing information of a stream of information
TWI410812B (zh) 網站之定做的、私人化的與整合的客戶端搜尋索引
JP2017532621A (ja) 動的コンテンツと古いコンテンツとを含んでいるウェブサイトの高速レンダリング
JP6203374B2 (ja) ウェブページ・スタイルアドレスの統合
JP6720626B2 (ja) キュレートされたコンテンツ内の古くなったアイテムの除去
US20120016857A1 (en) System and method for providing search engine optimization analysis
US20140208198A1 (en) Representation of an element in a page via an identifier
Bhoedjang et al. Engineering an online computer forensic service
US20080133460A1 (en) Searching descendant pages of a root page for keywords
WO2020207022A1 (zh) 基于Scrapy的数据爬取方法、系统、终端设备及存储介质
KR102284761B1 (ko) 내장가능형 미디어 콘텐츠 검색 위젯
Mendoza et al. BrowStEx: A tool to aggregate browser storage artifacts for forensic analysis
Mehta et al. A comparative study of various approaches to adaptive web scraping
US8909632B2 (en) System and method for maintaining persistent links to information on the Internet
CN109558123A (zh) 网页转化电子书的方法、电子设备、存储介质
Bojinov RESTful Web API Design with Node. js
US20120310893A1 (en) Systems and methods for manipulating and archiving web content
Jobst et al. Efficient github crawling using the graphql api
US20190095538A1 (en) Method and system for generating content from search results rendered by a search engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19924860

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19924860

Country of ref document: EP

Kind code of ref document: A1