WO2020238131A1 - 网络爬虫系统的测试方法及装置、存储介质、电子设备 - Google Patents
网络爬虫系统的测试方法及装置、存储介质、电子设备 Download PDFInfo
- Publication number
- WO2020238131A1 WO2020238131A1 PCT/CN2019/123059 CN2019123059W WO2020238131A1 WO 2020238131 A1 WO2020238131 A1 WO 2020238131A1 CN 2019123059 W CN2019123059 W CN 2019123059W WO 2020238131 A1 WO2020238131 A1 WO 2020238131A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- crawler
- machine
- task
- working time
- network
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the present disclosure relates to the technical field of testing tools, and in particular to a testing method of a web crawler system, a testing device of a web crawler system, a computer-readable storage medium and electronic equipment.
- the Internet has become a carrier of a large amount of information.
- search engines have become the entrance and guide for users to access the Internet.
- the web crawler system is a system that automatically extracts web pages.
- the web crawler system includes a crawler task distribution machine and multiple crawler machines.
- the crawler task distributor is used to distribute tasks to the crawler machines. After receiving the crawler task, the machine starts from the URL (Uniform Resource Locator) of one or several initial web pages, and continuously extracts new URLs from the current page and puts them in the queue for searching until the system's stopping conditions are met. Since the web crawler system needs to crawl a huge number of websites every day, in order to understand the working efficiency of the web crawler system, it is necessary to test the performance of the web crawler system.
- URL Uniform Resource Locator
- the embodiments of the present disclosure provide a testing method of a web crawler system, a testing device of a web crawler system, a computer-readable storage medium, and electronic equipment.
- a method for testing a web crawler system including:
- the crawler task is obtained from the system task database, and the crawler task is sent to the crawler task distributor;
- crawler task distribution machine distributes tasks to the network crawler machine cluster, obtain the total working time of each crawler machine in the network crawler machine cluster;
- the judgment result of whether the workload of the crawler machines in the network crawler machine cluster is balanced is obtained.
- test device for a web crawler system including:
- the task acquisition module is configured to acquire the crawler task from the system task database when the test request signal is received, and send the crawler task to the crawler task distributor;
- the time recording module is configured to obtain the total working time of each crawler machine in the network crawler machine cluster when the crawler task distributor distributes tasks to the network crawler machine cluster;
- the judgment module is configured to obtain a judgment result of whether the workload of the crawler machines in the network crawler machine cluster is balanced according to the total working time of each crawler machine.
- a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the method for testing a web crawler system as described in any one of the above is implemented.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- an electronic device including:
- the processor is configured to implement the testing method of the web crawler system as described in any one of the above by executing the computer program.
- the present disclosure calculates the total working time of each crawler machine during the period when the crawler task distribution machine distributes tasks to the network crawler machine cluster to obtain the judgment result of whether the workload of the crawler machines in the network crawler machine cluster is balanced.
- the test process is simple It is easy to implement and improves the user's test efficiency of the web crawler system.
- Fig. 1 shows a schematic flowchart of a method for testing a web crawler system according to an exemplary embodiment of the present disclosure.
- Fig. 2 shows a schematic flowchart of step S130 in the testing method of the web crawler system of Fig. 1 according to an exemplary embodiment of the present disclosure.
- Fig. 3 shows a schematic flow chart of establishing a system task database further included in a testing method of a web crawler system according to an exemplary embodiment of the present disclosure.
- Fig. 4 shows a schematic block diagram of a test device of a web crawler system according to an exemplary embodiment of the present disclosure.
- Fig. 5 shows a schematic block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
- Fig. 6 shows a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
- FIG. 1 is a schematic flowchart of a method for testing a web crawler system according to an exemplary embodiment of the present disclosure (this application).
- a test of a web crawler system is provided Method
- the test method of the web crawler system can be run on any computing device, for example, run on a terminal or server, can also run on a server cluster or cloud server, etc. Of course, those skilled in the art can also run this on other platforms as required
- the application method is not specifically limited in this disclosure.
- the test method of the web crawler system includes:
- step S110 when the test request signal is received, the crawler task is acquired from the system task database, and the crawler task is sent to the crawler task distributor.
- a web crawler system refers to a system that automatically grabs information on the World Wide Web in accordance with predetermined rules.
- the web crawler system includes a crawler task distribution machine and a network crawler machine cluster.
- the crawler task distributor is used to distribute crawling tasks to the network crawler machine cluster.
- the crawler machine cluster includes multiple crawler machines, and when the network crawler machine cluster receives the crawler task distributed by the crawler task distributor, the crawler machine crawls the crawler task.
- the test request signal refers to a signal used to request the start of the test.
- the test request signal may be sent by the user clicking a specific area of the interface, for example, the user clicking the test request button.
- the test request signal may be sent every predetermined time, the predetermined time may be 8 hours, 12 hours, or 24 hours, etc. This example does not specifically limit this, for example, the test request signal may be It is configured to send at 18:00 every day to request the start of the test, etc.
- the system task database refers to the database used to store the crawler tasks of the test network crawler system.
- the crawler task is obtained from the system task database, and the crawler task is sent to the crawler task distributor, and the crawler task
- the distribution machine distributes crawling tasks to the network crawler machine cluster.
- the number of crawler tasks is multiple, and those skilled in the art can set according to actual needs. For example, 1000 crawler tasks, 2000 crawler tasks, or 5000 crawler tasks can be obtained. This example does not specifically limit this.
- FIG. 3 is a schematic diagram of the process of establishing a system task database in a method for testing a web crawler system according to an exemplary embodiment of the present disclosure.
- the network The test method of the crawler system also includes:
- Step S310 Obtain multiple uniform resource locators.
- the Uniform Resource Locator is the address of a standard resource on the Internet.
- the crawler machine When the crawler machine performs the crawling task, it starts from the URL of one or several initial web pages, and continuously extracts new URLs from the current page and puts them in the queue for searching until the system's stopping conditions are met.
- a random search may be performed on the Internet to obtain the uniform resource locator.
- step S320 the multiple uniform resource locators are sent to the network crawler machine cluster, and the crawler machines in the network crawler machine cluster crawl each uniform resource locator, and the crawling result is recorded.
- multiple uniform resource locators are sent to the web crawler machine cluster, and the crawler machines in the web crawler machine cluster crawl each uniform resource locator, and record the crawling results of the crawler machines to obtain enough
- the number of URLs are stored as crawler tasks.
- Step S330 When the number of crawling results meets the predetermined number, all the crawling results are stored as crawling tasks in the system task database.
- the predetermined number is pre-configured, for example, the predetermined number may be 1000, 2000, 5000, etc.
- the crawling is stopped, and the recorded crawling result is stored as a crawler task in the system task database for subsequent testing.
- Step S120 when the crawler task distribution machine distributes tasks to the network crawler machine cluster, obtain the total working time of each crawler machine in the network crawler machine cluster.
- the crawler task distributor distributes crawler tasks to crawler machines in the network crawler machine cluster.
- the crawler task distributor continues to distribute the next crawler task to the crawler machine. Record the working time required for each crawler machine to complete each crawler task, and respectively add the working time required for each crawler machine to complete the crawler task to obtain the total working time for each crawler machine to complete the crawler task.
- the obtaining the total working time of each crawler machine in the network crawler machine cluster includes:
- each crawler machine When each crawler machine receives the crawler task distributed by the crawler task distributor, it records the working time required for the crawler machine to complete the crawler task.
- each crawler machine receives the crawler task distributed by the crawler task distributor, starting from the moment when the crawler machine starts crawling, and ending with the moment when the crawler stops crawling, record what the crawler machine needs to complete the crawler task Working hours. For example, the crawler machine starts crawling at 15:30, stops crawling at 15:35, and completes the crawling task, then the working time required for the crawling machine to complete the crawling task is 5 minutes.
- the recording the working time required by the crawler machine to complete the crawler task includes:
- the crawler machine When the crawler machine receives the crawler task, it starts timing when the crawler machine starts to crawl for the first time;
- the working time required by the crawler machine to complete the crawler task is acquired by means of timing, so that the acquired working time is more intuitive, unnecessary calculations are not required, and unnecessary power consumption is reduced.
- the total working time of each crawler machine is calculated based on the working time required for each crawler machine to complete each crawler task.
- the working time required by each crawler machine to complete each crawler task is added to obtain the total working time of the crawler machine.
- the crawler machine completes three crawler tasks and completes the work of the three crawler tasks.
- the time is 70S, 98, 82S, then the total working time of the crawler machine is 250S.
- Step S130 according to the total working time of each crawler machine, to obtain a judgment result of whether the workload of the crawler machines in the network crawler machine cluster is balanced.
- the longer the total working time of the crawler machine the greater the workload of the crawler machine.
- the workload of the crawler machines in the network crawler machine cluster can be obtained.
- the user can debug the web crawler system according to the judgment result, so as to make full use of the performance of the web crawler system and improve crawler efficiency.
- FIG. 2 is a flowchart of step S130 in the test method of the web crawler system of FIG. 1 according to an exemplary embodiment of the present disclosure.
- the total working time of the machine to determine whether the workload of the crawler machines in the web crawler machine cluster is balanced includes:
- Step S210 sort the total working time of each crawler machine in ascending order to obtain a working time sequence
- Step S220 based on the obtained working time sequence, subtract the first total working time from the last total working time in the working time sequence to obtain a time difference;
- Step S230 Divide the time difference by the first total working time in the time series to obtain the equilibrium rate of the workload of the crawler machines in the network crawler machine cluster;
- Step S240 determine whether the workload of the crawler machines in the web crawler machine cluster is balanced.
- the total working time of each crawler machine is sorted from small to large.
- a web crawler machine cluster includes 4 crawler machines, and the total working time of each of the 4 crawler machines is 125S. , 113S, 98S and 136S, sort the total working time of each crawler machine in ascending order, and the obtained working time sequence is (98, 113, 125, 136).
- the working time in the last position in the working time series is subtracted from the working time in the first position in the working time series, that is, the maximum value in the working time series is subtracted from the minimum value To get the time difference.
- the ratio is the equilibrium rate of the workload of the crawler machines in the network crawler machine cluster.
- Divide the time difference value by and arrange in the working time series For the total working time of the first place, the equilibrium rate of the workload of the crawler machines in the network crawler machine cluster is 38/98 ⁇ 38.78%.
- the relationship between the workload of the crawler machine with the longest total working time and the workload of the crawler machine with the shortest total working time can be intuitively obtained.
- the greater the equilibrium rate the more the workload of the crawler machine with the longest total working time is compared to the workload of the crawler machine with the shortest total working time, that is, the workload of the crawler machines in the web crawler machine cluster is not balanced.
- the smaller the equilibrium rate the smaller the workload of the crawler machine with the longest total working time is compared to the workload of the crawler machine with the shortest total working time, that is, the workload of the crawler machines in the web crawler machine cluster. balanced.
- the judging whether the workload of the crawler machines in the web crawler machine cluster is balanced based on the balance rate includes:
- balance rate is less than or equal to a predetermined threshold, it is determined that the workload of the crawler machines in the network crawler machine cluster is balanced;
- the equilibrium rate is greater than a predetermined threshold, it is determined that the workload of the crawler machines in the network crawler machine cluster is not balanced.
- the predetermined threshold is configured in advance, and the predetermined threshold may be 10%, 20%, or 25%, etc., which is not specifically limited in this example.
- the predetermined threshold value can be acquired by a user equipment, such as a mobile phone or a computer.
- the user equipment displays a specific acquisition interface to the user, and the user triggers a specific function on the acquisition interface to acquire, for example, the user clicks
- the "predetermined threshold input” button on the acquisition interface an input box appears on the acquisition interface, and the user inputs the predetermined threshold in the input box through an input device such as a keyboard or a touch screen.
- the task crawl success rate of each crawler machine in the network crawler machine cluster (such as the ratio of the number of successful task crawls to the total number of crawls) can also be obtained, and then can be based on each crawler The total working time of the machine and the success rate of task crawling are obtained to determine whether the workload of the crawler machines in the network crawler machine cluster is balanced. This can further improve the reliability of the judgment result of whether the crawler machine workload is balanced.
- the balance rate is less than or equal to a predetermined threshold, and the task crawling success rates of all crawling machines are higher than the preset first success rate threshold, it is determined that the workload of the crawling machines in the web crawling machine cluster is balanced;
- the balance rate is greater than a predetermined threshold, and the task crawl success rate of any crawler machine is lower than or equal to the preset first success rate threshold, it is determined that the workload of the crawler machines in the web crawler machine cluster is unbalanced.
- the average value of the task crawling success rate of all crawling machines can also be calculated.
- the balance rate is less than or equal to the predetermined threshold, and the average value is higher than the preset second success rate threshold, it is determined that the workload of the crawler machines in the network crawler machine cluster is balanced;
- the balance rate is greater than the predetermined threshold and the average value is lower than or equal to the preset second success rate threshold, it is determined that the workload of the crawler machines in the web crawler machine cluster is not balanced.
- the above-mentioned threshold may be preset or determined in other ways, which is not limited in this application.
- the embodiment of the present disclosure also provides a test device for the web crawler system.
- the testing device of this exemplary web crawler system may include a task acquisition module 410, a time recording module 420 and a judgment module 430. among them:
- the task acquisition module 410 is configured to: when the test request signal is received, acquire the crawler task from the system task database, and send the crawler task to the crawler task distributor;
- the time recording module 420 is configured to obtain the total working time of each crawler machine in the network crawler machine cluster when the crawler task distributor distributes tasks to the network crawler machine cluster;
- the judgment module 430 is configured to obtain a judgment result of whether the workload of the crawler machines in the network crawler machine cluster is balanced according to the total working time of each crawler machine.
- the judgment module 430 further includes a sorting unit 431, a first calculation unit 432, a second calculation unit 433, and a judgment unit 434, wherein:
- the sorting unit 431 is used to sort the total working hours of each crawler machine in ascending order to obtain a time sequence
- the first calculation unit 432 is configured to subtract the first total work time from the last total work time in the work time series based on the obtained work time series to obtain the time difference;
- the second calculation unit 433 is configured to divide the time difference by the first total working time in the time series to obtain the equilibrium rate of the workload of the crawler machines in the network crawler machine cluster;
- the judging unit 434 is configured to judge whether the workload of the crawler machines in the web crawler machine cluster is balanced based on the equilibrium rate.
- modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
- the features and functions of two or more modules or units described above may be embodied in one module or unit.
- the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
- the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present disclosure.
- a computing device which may be a personal computer, a server, a mobile terminal, or a network device, etc.
- the apparatus may be implemented as an electronic device that includes a memory and a processor, and a computer program is stored in the memory, and the computer program, when executed by the processor, causes The processor executes any one of the above-mentioned method embodiments, or, when the computer program is executed by the processor, the electronic device realizes the constituent units/modules of the above-mentioned apparatus embodiments The realized function.
- the processor described in the above embodiments may refer to a single processing unit, such as a central processing unit CPU, or it may be a distributed processor system including multiple dispersed processing units.
- the memory described in the above embodiments may include one or more memories, which may be internal memories of the computing device, such as transient or non-transitory memories, or may be connected to the external of the computing device through a memory interface Storage device.
- the electronic device 500 according to this embodiment of the present application will be described below with reference to FIG. 5.
- the electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
- the electronic device 500 is represented in the form of a general-purpose computing device.
- the components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).
- the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the “exemplary method” section of this specification.
- the processing unit 510 may perform step S110 as shown in FIG. 1.
- the crawler task is acquired from the system task database, and the crawler task is sent to the crawler task distributor; step S120 , When the crawler task distribution machine distributes tasks to the network crawler machine cluster, obtain the total working time of each crawler machine in the network crawler machine cluster; step S130, according to the total working time of each crawler machine, to obtain the network crawler The result of judging whether the workload of the crawler machines in the machine cluster is balanced.
- the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.
- RAM random access storage unit
- ROM read-only storage unit
- the storage unit 520 may also include a program/utility tool 5204 having a set of (at least one) program module 5205.
- program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
- the bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
- the electronic device 500 may also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 550.
- the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560.
- networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
- the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530.
- other hardware and/or software modules can be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
- the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
- a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
- each aspect of the present application can also be implemented in the form of a program product, which includes program code.
- the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
- a program product 600 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
- the program product of this application is not limited to this.
- the readable storage medium can be any tangible medium that contains or stores a program.
- the program can be used by or combined with an instruction execution system, device, or device.
- the program product can use any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
- the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
- the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
- the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming language-such as "C" language or similar programming language.
- the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
- the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computing device (for example, using Internet service providers) Business to connect via the Internet).
- LAN local area network
- WAN wide area network
- Internet service providers Internet service providers
- the crawler task when a test request signal is received, the crawler task is acquired from the system task database, and the acquired crawler task is sent to the crawler task distributor for distribution.
- the crawler task distributor sends the crawler task to the network crawler machine
- the crawler machines in the cluster distribute tasks, obtain the total working time from each crawler machine to the end of all crawler tasks, and according to the total working time of each crawler machine, obtain whether the workload of the crawler machines in the web crawler machine cluster is Balanced judgment result.
- the test process is simple and easy to implement If it is balanced, it means that the resources of the web crawler system are fully utilized and the efficiency is high. If it is unbalanced, it means that the resources of the web crawler system are not fully utilized and the efficiency is low.
- the user can choose whether to debug the web crawler system according to the judgment result, which improves the user's test efficiency on the web crawler system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
Claims (20)
- 一种网络爬虫系统的测试方法,其特征在于,包括:当接收到测试请求信号时,由系统任务数据库中获取爬虫任务,并将该爬虫任务发送至爬虫任务分发机;当该爬虫任务分发机向网络爬虫机器集群分发任务时,获取网络爬虫机器集群中的每一爬虫机器的总工作时间;根据每一爬虫机器的总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果。
- 根据权利要求1所述的网络爬虫系统的测试方法,其特征在于,所述获取网络爬虫机器集群中的每一爬虫机器的总工作时间包括:当每一爬虫机器接收到由该爬虫任务分发机所分发的爬虫任务时,记录该爬虫机器完成该爬虫任务所需要的工作时间;当该爬虫任务分发机内的任务分发完毕且所有爬虫任务均已被完成时,基于每一爬虫机器完成每一爬虫任务所需的工作时间,计算得到每一爬虫机器的总工作时间。
- 根据权利要求2所述的网络爬虫系统的测试方法,其特征在于,所述记录该爬虫机器完成该爬虫任务所需要的工作时间包括:当该爬虫机器接收到该爬虫任务时,在该爬虫机器开始第一次爬取的时候开始计时;当该爬虫机器针对该爬虫任务完成预定次数的爬取之后结束计时,以得到该爬虫机器完成该爬虫任务所需的工作时间,并将该工作时间与爬虫机器进行对应存储。
- 根据权利要求1所述的网络爬虫系统的测试方法,其特征在于,所述根据每一爬虫机器的总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果包括:将每一爬虫机器的总工作时间按照从小到大的顺序进行排序,以得到工作时间序列;基于所得到的工作时间序列,将该工作时间序列中的最后一个总工作时间减去第一个总工作时间,以得到时间差值;将该时间差值除以该工作时间序列中的第一个总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量的均衡率;基于该均衡率,判断该网络爬虫机器集群中的爬虫机器的工作量是否均衡。
- 根据权利要求4所述的网络爬虫系统的测试方法,其特征在于,所述基于该均衡率,判断该网络爬虫机器集群中的爬虫机器的工作量是否均衡包括:当该均衡率小于或等于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量均衡;当该均衡率大于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量不均衡。
- 根据权利要求1所述的网络爬虫系统的测试方法,其特征在于,在由系统任务数据库中获取爬虫任务之前,还包括:获取多个统一资源定位符;将该多个统一资源定位符发送至该网络爬虫机器集群,由该网络爬虫机器集群中的爬虫机器对每一统一资源定位符进行爬取,记录爬取结果;当爬取结果的数量满足预定数量时,将所有爬取结果作为爬取任务存储于系统任务数据库中。
- 根据权利要求1-6任一项所述的网络爬虫系统的测试方法,其特征在于,所述方法还包括:获取所述网络爬虫机器集群中的每一爬虫机器的任务爬取成功率;所述根据每一爬虫机器的总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果,包括:根据每一爬虫机器的总工作时间和任务爬取成功率,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果。
- 一种网络爬虫系统的测试装置,其特征在于,包括:任务获取模块,配置为当接收到测试请求信号时,由系统任务数据库中获取爬虫任务,并将该爬虫任务发送至爬虫任务分发机;时间记录模块,配置为当该爬虫任务分发机向网络爬虫机器集群分发任务时,获取网络爬虫机器集群中的每一爬虫机器的总工作时间;判断模块,配置为根据每一爬虫机器的总工作时间,以得到该网络爬虫机 器集群中的爬虫机器的工作量是否均衡的判断结果。
- 根据权利要求8所述的网络爬虫系统的测试装置,其特征在于,所述时间记录模块在获取网络爬虫机器集群中的每一爬虫机器的总工作时间时,具体用于:当每一爬虫机器接收到由该爬虫任务分发机所分发的爬虫任务时,记录该爬虫机器完成该爬虫任务所需要的工作时间;当该爬虫任务分发机内的任务分发完毕且所有爬虫任务均已被完成时,基于每一爬虫机器完成每一爬虫任务所需的工作时间,计算得到每一爬虫机器的总工作时间。
- 根据权利要求9所述的网络爬虫系统的测试装置,其特征在于,所述时间记录模块在记录该爬虫机器完成该爬虫任务所需要的工作时间时,具体用于:当该爬虫机器接收到该爬虫任务时,在该爬虫机器开始第一次爬取的时候开始计时;当该爬虫机器针对该爬虫任务完成预定次数的爬取之后结束计时,以得到该爬虫机器完成该爬虫任务所需的工作时间,并将该工作时间与爬虫机器进行对应存储。
- 根据权利要求8所述的网络爬虫系统的测试装置,其特征在于,所述判断模块包括:排序单元,用于将每一爬虫机器的总工作时间按照从小到大的顺序进行排序,以得到时间序列;第一计算单元,用于基于所得到的工作时间序列,将该工作时间序列中的最后一个总工作时间减去第一个总工作时间,以得到时间差值;第二计算单元,用于将该时间差值除以该时间序列中的第一个总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量的均衡率;判断单元,用于基于该均衡率,判断该网络爬虫机器集群中的爬虫机器的工作量是否均衡。
- 根据权利要求11所述的网络爬虫系统的测试装置,其特征在于,所述判断单元具体用于:当该均衡率小于或等于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量均衡;当该均衡率大于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量不均衡。
- 根据权利要求8所述的网络爬虫系统的测试装置,其特征在于,所述任务获取模块,在被配置为在由系统任务数据库中获取爬虫任务之前,获取多个统一资源定位符;将该多个统一资源定位符发送至该网络爬虫机器集群,由该网络爬虫机器集群中的爬虫机器对每一统一资源定位符进行爬取,记录爬取结果;当爬取结果的数量满足预定数量时,将所有爬取结果作为爬取任务存储于系统任务数据库中。
- 根据权利要求8-13任一项所述的网络爬虫系统的测试装置,其特征在于,所述任务获取模块,还被配置为获取所述网络爬虫机器集群中的每一爬虫机器的任务爬取成功率;所述判断模块具体用于:根据每一爬虫机器的总工作时间和任务爬取成功率,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7中任意一项所述的网络爬虫系统的测试方法。
- 一种电子设备,其特征在于,包括:处理器;以及存储器,其上存储有计算机程序;其中,所述处理器被配置为经由执行所述计算机程序来实现以下步骤:当接收到测试请求信号时,由系统任务数据库中获取爬虫任务,并将该爬虫任务发送至爬虫任务分发机;当该爬虫任务分发机向网络爬虫机器集群分发任务时,获取网络爬虫机器集群中的每一爬虫机器的总工作时间;根据每一爬虫机器的总工作时间,以得到该网络爬虫机器集群中的爬虫机 器的工作量是否均衡的判断结果。
- 根据权利要求16所述的电子设备,其特征在于,所述处理器在执行所述获取网络爬虫机器集群中的每一爬虫机器的总工作时间时,具体执行以下步骤:当每一爬虫机器接收到由该爬虫任务分发机所分发的爬虫任务时,记录该爬虫机器完成该爬虫任务所需要的工作时间;当该爬虫任务分发机内的任务分发完毕且所有爬虫任务均已被完成时,基于每一爬虫机器完成每一爬虫任务所需的工作时间,计算得到每一爬虫机器的总工作时间。
- 根据权利要求17所述的电子设备,其特征在于,所述处理器在执行所述记录该爬虫机器完成该爬虫任务所需要的工作时间时,具体执行以下步骤:当该爬虫机器接收到该爬虫任务时,在该爬虫机器开始第一次爬取的时候开始计时;当该爬虫机器针对该爬虫任务完成预定次数的爬取之后结束计时,以得到该爬虫机器完成该爬虫任务所需的工作时间,并将该工作时间与爬虫机器进行对应存储。
- 根据权利要求16所述的电子设备,其特征在于,所述处理器在执行所述根据每一爬虫机器的总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量是否均衡的判断结果时,具体执行以下步骤:将每一爬虫机器的总工作时间按照从小到大的顺序进行排序,以得到工作时间序列;基于所得到的工作时间序列,将该工作时间序列中的最后一个总工作时间减去第一个总工作时间,以得到时间差值;将该时间差值除以该工作时间序列中的第一个总工作时间,以得到该网络爬虫机器集群中的爬虫机器的工作量的均衡率;基于该均衡率,判断该网络爬虫机器集群中的爬虫机器的工作量是否均衡。
- 根据权利要求19所述的电子设备,其特征在于,所述处理器在执行所述基于该均衡率,判断该网络爬虫机器集群中的爬虫机器的工作量是否均衡时,具体执行以下步骤:当该均衡率小于或等于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量均衡;当该均衡率大于预定阈值时,判定该网络爬虫机器集群中的爬虫机器的工作量不均衡。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910444805.8 | 2019-05-24 | ||
CN201910444805.8A CN110333980A (zh) | 2019-05-24 | 2019-05-24 | 网络爬虫系统的测试方法及装置、存储介质、电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020238131A1 true WO2020238131A1 (zh) | 2020-12-03 |
Family
ID=68140378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/123059 WO2020238131A1 (zh) | 2019-05-24 | 2019-12-04 | 网络爬虫系统的测试方法及装置、存储介质、电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110333980A (zh) |
WO (1) | WO2020238131A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110333980A (zh) * | 2019-05-24 | 2019-10-15 | 深圳壹账通智能科技有限公司 | 网络爬虫系统的测试方法及装置、存储介质、电子设备 |
CN115328812B (zh) * | 2022-10-11 | 2023-02-28 | 深圳华锐分布式技术股份有限公司 | 基于网络爬虫的ui界面测试方法、装置、设备及介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246377A1 (en) * | 2008-08-12 | 2013-09-19 | Jitendra B. Gaitonde | Configuration management for a capture/registration system |
CN106202108A (zh) * | 2015-05-06 | 2016-12-07 | 阿里巴巴集团控股有限公司 | 网络爬虫抓取任务分配方法与装置及数据抓取方法与装置 |
CN107071009A (zh) * | 2017-03-28 | 2017-08-18 | 江苏飞搏软件股份有限公司 | 一种负载均衡的分布式大数据爬虫系统 |
CN107562541A (zh) * | 2017-09-05 | 2018-01-09 | 广东科杰通信息科技有限公司 | 一种负载均衡分布式的爬虫方法、爬虫系统 |
CN108205541A (zh) * | 2016-12-16 | 2018-06-26 | 北大方正集团有限公司 | 分布式网络爬虫任务的调度方法及装置 |
CN110333980A (zh) * | 2019-05-24 | 2019-10-15 | 深圳壹账通智能科技有限公司 | 网络爬虫系统的测试方法及装置、存储介质、电子设备 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040225644A1 (en) * | 2003-05-09 | 2004-11-11 | International Business Machines Corporation | Method and apparatus for search engine World Wide Web crawling |
CN106648445B (zh) * | 2015-10-30 | 2020-07-03 | 北京国双科技有限公司 | 用于网络爬虫的数据存储方法及装置 |
CN107203623B (zh) * | 2017-05-26 | 2020-09-22 | 山东省科学院情报研究所 | 网络爬虫系统的负载均衡调节方法 |
-
2019
- 2019-05-24 CN CN201910444805.8A patent/CN110333980A/zh active Pending
- 2019-12-04 WO PCT/CN2019/123059 patent/WO2020238131A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130246377A1 (en) * | 2008-08-12 | 2013-09-19 | Jitendra B. Gaitonde | Configuration management for a capture/registration system |
CN106202108A (zh) * | 2015-05-06 | 2016-12-07 | 阿里巴巴集团控股有限公司 | 网络爬虫抓取任务分配方法与装置及数据抓取方法与装置 |
CN108205541A (zh) * | 2016-12-16 | 2018-06-26 | 北大方正集团有限公司 | 分布式网络爬虫任务的调度方法及装置 |
CN107071009A (zh) * | 2017-03-28 | 2017-08-18 | 江苏飞搏软件股份有限公司 | 一种负载均衡的分布式大数据爬虫系统 |
CN107562541A (zh) * | 2017-09-05 | 2018-01-09 | 广东科杰通信息科技有限公司 | 一种负载均衡分布式的爬虫方法、爬虫系统 |
CN110333980A (zh) * | 2019-05-24 | 2019-10-15 | 深圳壹账通智能科技有限公司 | 网络爬虫系统的测试方法及装置、存储介质、电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN110333980A (zh) | 2019-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11683221B1 (en) | Automatic generation of template for provisioning services in a hosted computing environment | |
US10713108B2 (en) | Computing system issue detection and resolution | |
US11775501B2 (en) | Trace and span sampling and analysis for instrumented software | |
CN111737127A (zh) | 用于测试地图服务的方法和装置 | |
US10289526B2 (en) | Object oriented data tracking on client and remote server | |
CN110489440B (zh) | 数据查询方法和装置 | |
CN105607986A (zh) | 用户行为日志数据采集方法及装置 | |
CN110858172A (zh) | 一种自动化测试代码生成方法和装置 | |
CN112615758B (zh) | 一种应用识别方法、装置、设备及存储介质 | |
CN115335821B (zh) | 卸载统计收集 | |
WO2020238131A1 (zh) | 网络爬虫系统的测试方法及装置、存储介质、电子设备 | |
CN109033814A (zh) | 智能合约触发方法、装置、设备及存储介质 | |
WO2024124789A1 (zh) | 文件处理方法、装置、服务器及介质 | |
JP2023036681A (ja) | タスク処理方法、処理装置、電子機器、記憶媒体及びコンピュータプログラム | |
WO2021012795A1 (zh) | 网络节点的调度方法、装置、电子设备和存储介质 | |
WO2021051879A1 (zh) | 反向代理评价模型中目标参数选取方法及相关装置 | |
WO2021218468A1 (zh) | 数据更新方法、装置、搜索服务器、终端及存储介质 | |
CN108959294B (zh) | 一种访问搜索引擎的方法和装置 | |
CN110806967A (zh) | 一种单元测试方法和装置 | |
CN103685472A (zh) | 用于提供移动设备所对应的资源信息的方法与设备 | |
US9858549B2 (en) | Business transaction resource usage tracking | |
CN113656731A (zh) | 广告页面的处理方法、装置、电子设备和存储介质 | |
CN115190149B (zh) | 用于铁路勘察的数据采集方法及系统 | |
US9998348B2 (en) | Monitoring a business transaction utilizing PHP engines | |
CN117176613B (zh) | 一种数据采集方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19931003 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19931003 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 18/03/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19931003 Country of ref document: EP Kind code of ref document: A1 |