CN106776934B - Mobile terminal and implementation method of web crawler - Google Patents

Mobile terminal and implementation method of web crawler Download PDF

Info

Publication number
CN106776934B
CN106776934B CN201611092280.9A CN201611092280A CN106776934B CN 106776934 B CN106776934 B CN 106776934B CN 201611092280 A CN201611092280 A CN 201611092280A CN 106776934 B CN106776934 B CN 106776934B
Authority
CN
China
Prior art keywords
node
seed
current
balanced tree
child node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611092280.9A
Other languages
Chinese (zh)
Other versions
CN106776934A (en
Inventor
张琪
郭凤阁
张淑燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nubia Technology Co Ltd
Original Assignee
Nubia Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nubia Technology Co Ltd filed Critical Nubia Technology Co Ltd
Priority to CN201611092280.9A priority Critical patent/CN106776934B/en
Publication of CN106776934A publication Critical patent/CN106776934A/en
Application granted granted Critical
Publication of CN106776934B publication Critical patent/CN106776934B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the invention discloses a mobile terminal and a method for realizing web crawlers, wherein the mobile terminal comprises: the determining unit is used for determining a balanced tree corresponding to the seed node sequence according to the predetermined seed node sequence; and the grabbing unit is used for grabbing all the seed nodes in the seed node sequence and all the nodes generated by all the seed nodes according to the balanced tree.

Description

Mobile terminal and implementation method of web crawler
Technical Field
The invention relates to a computer network technology, in particular to a mobile terminal and a web crawler implementation method.
Background
With the explosive growth of internet information, search engines play an increasingly important role. In search engine technology, web crawlers are important components. The web crawler can automatically capture page information according to a certain rule. The basic steps of web crawler work include: putting a URL (Uniform Resource Locator) to be grabbed into a queue to be grabbed; taking out a URL from a queue to be captured; capturing related page information from the website pointed by the URL; storing the captured page information into a page library; and putting the URL which is already grabbed into the grabbed URL queue. In the process of capturing the webpage information, according to the capturing strategy of the webpage, new URLs are continuously extracted from the current webpage and placed into a queue until certain stopping conditions are met. And then storing the captured webpage information in a server of a search engine, so that the search speed of a user can be increased.
When the web crawler captures information, the configuration file needs to be customized according to the information to be obtained. The configuration file comprises an entry link of the information to be crawled, the area where the information to be crawled is located is defined, and the page turning expansion of the information to be crawled is indicated, and each specific item of information is accurately acquired from a webpage. That is, the configuration file defines the capturing flow of the web crawler and the result to be obtained.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
in the existing web crawler implementation method, a content downloading mode is adopted to capture the content while crawling, multi-person cooperative crawling is not supported, and the execution efficiency is low.
Disclosure of Invention
The invention mainly aims to provide a mobile terminal and a method for realizing a web crawler, which can support simultaneous crawling by multiple people and can improve the execution efficiency.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
an embodiment of the present invention provides a mobile terminal, including: a determining unit and a grasping unit;
the determining unit is used for determining a balanced tree corresponding to the seed node sequence according to the predetermined seed node sequence;
and the grabbing unit is used for grabbing all the seed nodes in the seed node sequence and all the nodes generated by all the seed nodes according to the balanced tree.
In the above embodiment, the mobile terminal further includes: a conversion unit and a storage unit;
the conversion unit is used for converting the balanced tree into a binary tree;
and the storage unit is used for storing the binary tree in a local file.
In the above embodiment, the determining unit includes: a selection subunit and an addition subunit;
the selecting subunit is configured to select one seed node in the seed node sequence as a current seed node when the seed node sequence is not empty;
and the adding subunit is configured to add all the nodes generated by the current seed node to the balanced tree.
In the foregoing embodiment, the add child unit is specifically configured to use the current seed node as a parent node, and when a child node pointed by the parent node is not empty, select a child node from all child nodes as a current child node; judging whether the current child node is in the balanced tree or not; and when the current child node is not in the balanced tree, adding the current child node into the balanced tree, taking the current child node as the current parent node, and returning to execute the operation.
In the above embodiment, the adding subunit is further configured to add the current child node to the directed cyclic graph corresponding to the sequence of seed nodes when the current child node is not in the balanced tree; and when the current child node is in the balanced tree, adding 1 to the number of times of collision of the current child node, and adding the current child node to a directed cyclic graph corresponding to the seed node sequence.
The embodiment of the invention also provides a method for realizing the web crawler, which comprises the following steps:
determining a balanced tree corresponding to the seed node sequence according to the predetermined seed node sequence;
and capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree.
In the above embodiment, the method further comprises:
converting the balanced tree into a binary tree;
the binary tree is saved in a local file.
In the above embodiment, the determining a balanced tree corresponding to a seed node sequence according to a predetermined seed node sequence includes:
when the seed node sequence is not empty, selecting a seed node in the seed node sequence as a current seed node;
adding all nodes generated by the current seed node to the balanced tree.
In the above embodiment, the adding all nodes generated by the current seed node to the balanced tree includes:
taking the current seed node as a father node, and selecting one child node from all child nodes as a current child node when the child node pointed by the father node is not empty;
judging whether the current child node is in the balanced tree or not;
and when the current child node is not in the balanced tree, adding the current child node into the balanced tree, taking the current child node as the current parent node, and returning to execute the operation.
In the above embodiment, the method further comprises:
when the current child node is not in the balanced tree, adding the current child node to a directed cyclic graph corresponding to the sequence of seed nodes;
and when the current child node is in the balanced tree, adding 1 to the number of times of collision of the current child node, and adding the current child node to a directed cyclic graph corresponding to the seed node sequence.
The implementation method of the mobile terminal and the web crawler provided by the embodiment of the invention comprises the steps of firstly determining a balanced tree corresponding to a seed node sequence according to the predetermined seed node sequence; and then capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree. In the prior art, directed cyclic graphs are mostly adopted, and the capturing is performed in a mode of using depth or breadth or sharing the depth or the breadth, so that compared with the prior art, the mobile terminal and the implementation method of the web crawler provided by the embodiment of the invention can support simultaneous crawling of multiple people, and can improve the execution efficiency; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of a mobile terminal according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a communication system in which a mobile terminal according to an embodiment of the present invention can operate;
fig. 3 is a schematic structural diagram of a mobile terminal according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mobile terminal according to a second embodiment of the present invention;
FIG. 5 is a schematic diagram of an implementation flow of a web crawler implementation method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating an implementation method for determining a balanced tree corresponding to a seed node sequence according to an embodiment of the present invention;
fig. 7 is a schematic flow chart of an implementation method for adding all nodes generated by a current seed node to a balanced tree in the embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
It should be understood that the embodiments described herein are only for explaining the technical solutions of the present invention, and are not intended to limit the scope of the present invention.
A mobile terminal implementing various embodiments of the present invention will now be described with reference to the accompanying drawings. In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
The mobile terminal may be implemented in various forms. For example, the terminal described in the present invention may include a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a PAD computer (PAD), a Portable Multimedia Player (PMP), a navigation device, and the like.
Fig. 1 is a schematic hardware structure of a mobile terminal 100 for implementing various embodiments of the present invention, and as shown in fig. 1, the mobile terminal 100 may include: a wireless communication unit 110, a user input unit 120, a sensing unit 130, an output unit 140, a memory 150, an interface unit 160, a controller 170, and a power supply unit 180, etc. Fig. 1 illustrates the mobile terminal 100 having various components, but it is to be understood that not all illustrated components are required to be implemented. More or fewer components may alternatively be implemented. The elements of the mobile terminal 100 will be described in detail below.
The wireless communication unit 110 typically includes one or more components that allow radio communication between the mobile terminal 100 and a wireless communication system or network. For example, the wireless communication unit 110 may include: at least one of a mobile communication module 111, a wireless internet module 112, and a short-range communication module 113.
The mobile communication module 111 transmits and/or receives radio signals to and/or from at least one of a base station (e.g., access point, node B, etc.), an external terminal, and a server. Such radio signals may include voice call signals, video call signals, or various types of data transmitted and/or received according to text and/or multimedia messages.
The wireless internet module 112 supports wireless internet access of the mobile terminal 100. The wireless internet module 112 may be internally or externally coupled to the terminal. The wireless internet access technology referred to by the wireless internet module 112 may include Wireless Local Area Network (WLAN), wireless compatibility authentication (Wi-Fi), wireless broadband (Wibro), worldwide interoperability for microwave access (Wimax), High Speed Downlink Packet Access (HSDPA), and the like.
The short-range communication module 113 is a module for supporting short-range communication. Some examples of short-range communication technologies include bluetoothTMRadio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), zigbeeTMAnd the like.
The user input unit 120 may generate key input data to control various operations of the mobile terminal 100 according to a command input by a user. The user input unit 120 allows a user to input various types of information, and may include a keyboard, dome sheet, touch pad (e.g., a touch-sensitive member that detects a change in resistance, pressure, capacitance, etc. due to being touched), a jog wheel, a jog stick, etc. In particular, when the touch pad is superimposed on the display unit 141 in the form of a layer, a touch screen may be formed.
The sensing unit 130 detects a current state of the mobile terminal 100 (e.g., an open or closed state of the mobile terminal 100), a position of the mobile terminal 100, presence or absence of contact (i.e., touch input) by a user with the mobile terminal 100, an orientation of the mobile terminal 100, acceleration or deceleration movement and direction of the mobile terminal 100, and the like, and generates a command or signal for controlling an operation of the mobile terminal 100. For example, when the mobile terminal 100 is implemented as a slide-type mobile phone, the sensing unit 130 may sense whether the slide-type phone is opened or closed. In addition, the sensing unit 130 can detect whether the power supply unit 180 supplies power or whether the interface unit 160 is coupled with an external device.
The interface unit 160 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port (a typical example is a Universal Serial Bus (USB) port), a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like.
The interface unit 160 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and the external device.
In addition, when the mobile terminal 100 is connected with an external cradle, the interface unit 160 may serve as a path through which power is supplied from the cradle to the mobile terminal 100 or may serve as a path through which various command signals input from the cradle are transmitted to the mobile terminal 100. Various command signals or power input from the cradle may be used as signals for recognizing whether the mobile terminal 100 is accurately mounted on the cradle.
The output unit 140 is configured to provide output signals (e.g., audio signals, video signals, alarm signals, vibration signals, etc.) in a visual, audio, and/or tactile manner. The output unit 140 may include a display unit 141, an audio output module 142, and the like.
The display unit 141 may display information processed in the mobile terminal 100. For example, when the mobile terminal 100 is in a phone call mode, the display unit 141 may display a User Interface (UI) or a Graphical User Interface (GUI) related to a call or other communication (e.g., text messaging, multimedia file downloading, etc.). When the mobile terminal 100 is in a video call mode or an image capturing mode, the display unit 141 may display a captured image and/or a received image, a UI or GUI showing a video or an image and related functions, or the like.
Meanwhile, when the display unit 141 and the touch pad are stacked on each other in the form of layers to form a touch screen, the display unit 141 may function as an input device and an output device. The display unit 141 may include at least one of a Liquid Crystal Display (LCD), a thin film transistor LCD (TFT-LCD), an Organic Light Emitting Diode (OLED) display, a flexible display, a three-dimensional (3D) display, and the like. Some of these displays may be configured to be transparent to allow a user to view from the outside, which may be referred to as transparent displays, and a typical transparent display may be, for example, a TOLED (transparent organic light emitting diode) display or the like. Depending on the particular desired implementation, mobile terminal 100 may include two or more display units (or other display devices), for example, mobile terminal 100 may include an external display unit (not shown) and an internal display unit (not shown). The touch screen may be used to detect a touch input pressure as well as a touch input position and a touch input area.
The audio output module 142 may convert audio data received by the wireless communication unit 110 or stored in the memory 150 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output module 142 may provide audio output related to a specific function performed by the mobile terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output module 142 may include a speaker, a buzzer, and the like.
The memory 150 may store software programs or the like for processing and controlling operations performed by the controller 170, or may temporarily store data (e.g., a phonebook, messages, still images, videos, etc.) that has been output or is to be output. Also, the memory 150 may store data regarding various ways of vibration and audio signals output when a touch is applied to the touch screen.
The memory 150 may include at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. Also, the mobile terminal 100 may cooperate with a network storage device that performs a storage function of the memory 150 through a network connection.
The controller 170 generally controls the overall operation of the mobile terminal 100. For example, the controller 170 performs control and processing related to voice calls, data communications, video calls, and the like. In addition, the controller 170 may include a multimedia module 171 for reproducing or playing back multimedia data, and the multimedia module 171 may be constructed within the controller 170 or may be constructed separately from the controller 170. The controller 170 may perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as a character or an image.
The power supply unit 180 receives external power or internal power and provides appropriate power required to operate the respective elements and components under the control of the controller 170.
The various embodiments described herein may be implemented in a computer-readable medium using, for example, computer software, hardware, or any combination thereof. For a hardware implementation, the embodiments described herein may be implemented using at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a processor, a controller, a microcontroller, a microprocessor, an electronic unit designed to perform the functions described herein, and in some cases, such embodiments may be implemented in the controller 170. For a software implementation, the implementation such as a process or a function may be implemented with a separate software module that allows performing at least one function or operation. The software codes may be implemented by software applications (or programs) written in any suitable programming language, which may be stored in memory 150 and executed by controller 170.
Up to this point, the mobile terminal 100 has been described in terms of its functions. Hereinafter, the slide-type mobile terminal 100 among various types of mobile terminals 100, such as the folder-type, bar-type, swing-type, slide-type mobile terminal 100, etc., will be described as an example for the sake of brevity. Accordingly, the present invention can be applied to any type of mobile terminal 100, and is not limited to the slide type mobile terminal 100.
The mobile terminal 100 as shown in fig. 1 may be configured to operate with communication systems such as wired and wireless communication systems and satellite-based communication systems that transmit data via frames or packets.
A communication system in which the mobile terminal 100 according to the present invention is capable of operating will now be described with reference to fig. 2.
Such communication systems may use different air interfaces and/or physical layers. For example, the air interface used by the communication system includes, for example, Frequency Division Multiple Access (FDMA), Time Division Multiple Access (TDMA), Code Division Multiple Access (CDMA), and Universal Mobile Telecommunications System (UMTS) (in particular, Long Term Evolution (LTE)), global system for mobile communications (GSM), and the like. By way of non-limiting example, the following description relates to a CDMA communication system, but such teachings are equally applicable to other types of systems.
Referring to fig. 2, the CDMA wireless communication system may include a plurality of mobile terminals 100, a plurality of Base Stations (BSs) 270, Base Station Controllers (BSCs) 275, and a Mobile Switching Center (MSC) 280. The MSC280 is configured to interface with a Public Switched Telephone Network (PSTN) 290. The MSC280 is also configured to interface with a BSC275, which may be coupled to the base station 270 via a backhaul. The backhaul may be constructed according to any of several known interfaces including, for example, E1/T1, ATM, IP, PPP, frame Relay, HDSL, ADSL, or xDSL. It will be understood that a system as shown in fig. 2 may include multiple BSCs 275.
Each BS270 may serve one or more sectors (or regions), each sector covered by a multi-directional antenna or an antenna pointing in a particular direction being radially distant from the BS 270. Alternatively, each partition may be covered by two or more antennas for diversity reception. Each BS270 may be configured to support multiple frequency allocations, with each frequency allocation having a particular frequency spectrum (e.g., 1.25MHz, 5MHz, etc.).
The intersection of partitions with frequency allocations may be referred to as a CDMA channel. The BS270 may also be referred to as a Base Transceiver Subsystem (BTS) or other equivalent terminology. In such a case, the term "base station" may be used to generically refer to a single BSC275 and at least one BS 270. The base stations may also be referred to as "cells". Alternatively, each sector of a particular BS270 may be referred to as a plurality of cell sites.
As a typical operation of the wireless communication system, the BS270 receives reverse link signals from various mobile terminals 100. The mobile terminal 100 is generally engaged in conversations, messaging, and other types of communications. Each reverse link signal received by a particular BS270 is processed within the particular BS 270. The obtained data is forwarded to the associated BSC 275. The BSC provides call resource allocation and mobility management functions including coordination of soft handoff procedures between BSs 270. The BSCs 275 also route the received data to the MSC280, which provides additional routing services for interfacing with the PSTN 290. Similarly, the PSTN290 interfaces with the MSC280, the MSC interfaces with the BSCs 275, and the BSCs 275 accordingly control the BS270 to transmit forward link signals to the mobile terminal 100.
The mobile communication module 112 of the wireless communication unit 110 in the mobile terminal accesses the mobile communication network based on the necessary data (including the user identification information and the authentication information) of the mobile communication network (such as the mobile communication network of 2G/3G/4G, etc.) built in the mobile terminal, so as to transmit the mobile communication data (including the uplink mobile communication data and the downlink mobile communication data) for the services of web browsing, network multimedia playing, etc. of the mobile terminal user.
The wireless internet module 113 of the wireless communication unit 110 implements a function of a wireless hotspot by operating a related protocol function of the wireless hotspot, the wireless hotspot supports access by a plurality of mobile terminals (any mobile terminal other than the mobile terminal), transmits mobile communication data (including uplink mobile communication data and downlink mobile communication data) for mobile terminal user's services such as web browsing, network multimedia playing, etc. by multiplexing the mobile communication connection between the mobile communication module 112 and the mobile communication network, since the mobile terminal essentially multiplexes the mobile communication connection between the mobile terminal and the communication network for transmitting mobile communication data, the traffic of mobile communication data consumed by the mobile terminal is charged to the communication tariff of the mobile terminal by a charging entity on the side of the communication network, thereby consuming the data traffic of the mobile communication data included in the communication tariff contracted for use by the mobile terminal.
Based on the above hardware structure of the mobile terminal 100 and the communication system, various embodiments of the method of the present invention are proposed.
Example one
Fig. 3 is a schematic structural diagram of a mobile terminal according to an embodiment of the present invention. As shown in fig. 3, the mobile terminal includes: a determination unit 301 and a grasping unit 302;
the determining unit 301 is configured to determine a balanced tree corresponding to the seed node sequence according to the predetermined seed node sequence.
Generally, a web server has a lot of Uniform Resource Locators (URLs), and the relationships between URLs are also complicated, and in order to clearly obtain and represent the relationships between URLs, a tree structure of URLs can be established.
In a specific embodiment of the present invention, a seed node sequence may be predetermined; and then according to the balance tree corresponding to the predetermined seed node sequence. For example, assuming that the predetermined seed node sequence is { A1, A2, A3, A4}, in this step, a balanced tree corresponding to the seed node sequence { A1, A2, A3, A4} may be determined.
In a specific embodiment of the present invention, when the seed node sequence is not empty, the determining unit 301 may select one seed node in the seed node sequence as a current seed node; all nodes generated by the current seed node are then added to the balanced tree. Specifically, when determining the balanced tree corresponding to the seed node sequence, the determining unit 301 may first determine whether the seed node sequence is empty, and when the seed node sequence is not empty, select a seed node in the seed node sequence as the current seed node; the determination unit 301 may then add all the nodes generated by the current seed node to the balanced tree. When the seed node sequence is empty, the determining unit 301 may end the process of determining the balanced tree corresponding to the seed node sequence.
A grabbing unit 302, configured to grab all seed nodes in the seed node sequence and all nodes generated by all seed nodes according to the balanced tree.
In an embodiment of the present invention, after the determining unit 301 determines the balanced tree corresponding to the seed node sequence, the capturing unit 302 may capture all the seed nodes in the seed node sequence and all the nodes generated by all the seed nodes according to the balanced tree. Specifically, the capturing unit 302 may capture all seed nodes in the seed node sequence and all nodes generated by all seed nodes by traversing the balanced tree by using a traversal method of the balanced tree in the prior art.
Preferably, in an embodiment of the present invention, the mobile terminal further includes: a conversion unit 303 and a storage unit 304;
and a converting unit 303, configured to convert the balancing tree into a binary tree.
And the storage unit 304 is used for saving the binary tree in the local file.
In an embodiment of the present invention, the transforming unit 303 may further transform the balancing tree into a binary tree; the storage unit 304 then saves the binary tree in a local file.
The mobile terminal provided by the embodiment of the invention determines a balanced tree corresponding to a seed node sequence according to the predetermined seed node sequence; and then capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree. In the prior art, directed cyclic graphs are mostly adopted, and the capture is performed in a mode of using depth or breadth or sharing the depth or the breadth, so that compared with the prior art, the mobile terminal provided by the embodiment of the invention can support multiple persons to crawl simultaneously, and the execution efficiency can be improved; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.
Example two
Fig. 4 is a schematic structural diagram of a mobile terminal according to a second embodiment of the present invention. As shown in fig. 4, the determination unit 301 includes: a selection subunit 3011 and an add subunit 3012;
a selecting subunit 3011, configured to select one seed node in the sequence of seed nodes as the current seed node when the sequence of seed nodes is not empty.
In a specific embodiment of the present invention, when determining the balanced tree corresponding to the seed node sequence, the selecting subunit 3011 may first determine whether the seed node sequence is empty; when the sequence of seed nodes is not empty, the selecting subunit 3011 may select one seed node in the sequence of seed nodes as the current seed node; when the seed node sequence is empty, the selecting subunit 3011 may end the process of determining the balanced tree corresponding to the seed node sequence.
And an adding subunit 3012, configured to add all the nodes generated by the current seed node to the balanced tree.
In a specific embodiment of the present invention, after the selecting subunit 3011 selects the current seed node in the seed node sequence, the adding subunit 3012 may add all the nodes generated by the current seed node to the balanced tree.
In a specific embodiment of the present invention, the adding child unit 3012 is specifically configured to use a current child node as a parent node, and when a child node pointed by the parent node is not empty, select one child node from all child nodes as a current child node; judging whether the current child node is in the balanced tree or not; and when the current child node is not in the balanced tree, adding the current child node into the balanced tree, taking the current child node as a current parent node, and returning to execute the operation. Specifically, the add sub-unit 3012 may add the current child node to the balanced tree by using a balanced tree generation method in the related art.
In an embodiment of the present invention, the adding subunit 3012 is further configured to add the current child node to the directed cyclic graph corresponding to the sequence of seed nodes when the current child node is not in the balanced tree; and when the current child node is in the balanced tree, adding 1 to the collision frequency of the current child node, and adding the current child node to the directed cyclic graph corresponding to the seed node sequence.
The mobile terminal provided by the embodiment of the invention determines a balanced tree corresponding to a seed node sequence according to the predetermined seed node sequence; and then capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree. In the prior art, directed cyclic graphs are mostly adopted, and the capture is performed in a mode of using depth or breadth or sharing the depth or the breadth, so that compared with the prior art, the mobile terminal provided by the embodiment of the invention can support multiple persons to crawl simultaneously, and the execution efficiency can be improved; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.
EXAMPLE III
Fig. 5 is a schematic flow chart illustrating an implementation method of a web crawler according to an embodiment of the present invention. As shown in fig. 5, the method includes:
step 501, determining a balanced tree corresponding to the seed node sequence according to the predetermined seed node sequence.
In a specific embodiment of the present invention, a seed node sequence may be predetermined; and then according to the balance tree corresponding to the predetermined seed node sequence. For example, assuming that the predetermined seed node sequence is { A1, A2, A3, A4}, in this step, a balanced tree corresponding to the seed node sequence { A1, A2, A3, A4} may be determined.
Fig. 6 is a flowchart illustrating an implementation method for determining a balanced tree corresponding to a seed node sequence in the embodiment of the present invention. As shown in fig. 6, the method includes:
step 601, when the seed node sequence is not empty, selecting a seed node in the seed node sequence as a current seed node.
In the specific embodiment of the present invention, when determining the balanced tree corresponding to the seed node sequence, it may be determined whether the seed node sequence is empty; when the seed node sequence is not empty, selecting one seed node from the seed node sequence as a current seed node; and when the seed node sequence is empty, ending the flow of determining the balanced tree corresponding to the seed node sequence.
Step 602, adding all nodes generated by the current seed node into the balanced tree.
In a specific embodiment of the present invention, after the current seed node is selected in the sequence of seed nodes, all nodes generated by the current seed node are added to the balanced tree.
According to the analysis, the balanced tree corresponding to the seed node sequence can be determined through the steps 601 to 602, so that all seed nodes in the seed node sequence and all nodes generated by all seed nodes can be captured according to the balanced tree.
Fig. 7 is a schematic flow chart of an implementation method for adding all nodes generated by a current seed node to a balanced tree in the embodiment of the present invention. As shown in fig. 7, the method of adding all nodes generated by the current seed node to the balanced tree may include the steps of:
and 701, taking the current seed node as a father node.
In particular embodiments of the present invention, when all nodes generated by the current seed node are added to the balanced tree, the current seed node may be taken as a parent node after the current seed node is selected in the sequence of seed nodes.
And step 702, when the child node pointed by the parent node is not empty, selecting one child node from all child nodes as the current child node.
In the specific embodiment of the present invention, when the child node pointed by the parent node is not empty, one child node may be selected from all child nodes as the current child node; when the child node pointed by the parent node is empty, the process of adding all the nodes generated by the current seed node to the balanced tree can be finished.
And step 703, judging whether the current child node is in the balanced tree.
In the embodiment of the present invention, after selecting the current child node from all child nodes, it may be determined whether the current child node is in the balanced tree, and when the current child node is not in the balanced tree, step 704 may be executed; step 705 may be performed when the current child node is in a balanced tree.
Step 704, adding the current child node into the balanced tree, taking the current child node as the current parent node, and returning to execute step 702.
In particular embodiments of the present invention, the current child node may be added to the balanced tree when the current child node is not in the balanced tree. Specifically, the current child node may be added to the balanced tree by using a balanced tree generation method in the prior art. The current child node is then taken as the current parent node and the process returns to step 702.
Preferably, in an embodiment of the present invention, when the current child node is not in the balanced tree, the current child node may be further added to the directed cyclic graph corresponding to the seed node sequence.
Step 705, taking the current child node as the current parent node, and returning to execute step 702.
In the specific embodiment of the present invention, when the current child node is in the balanced tree, 1 may be added to the number of collisions of the current child node, and the current child node is added to the directed cyclic graph corresponding to the sequence of seed nodes, and the step 702 is executed again.
According to the analysis, all the nodes generated by the current seed node can be added to the balanced tree through the steps 701 to 705, so that all the seed nodes in the seed node sequence and all the nodes generated by all the seed nodes can be captured according to the balanced tree.
And 502, capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree.
In a specific embodiment of the present invention, after determining the balanced tree corresponding to the seed node sequence, all seed nodes in the seed node sequence and all nodes generated by all seed nodes may be captured according to the balanced tree. Specifically, a traversal method of a balanced tree in the prior art may be adopted, and all seed nodes in the seed node sequence and all nodes generated by all seed nodes are captured by traversing the balanced tree.
Preferably, in the embodiment of the present invention, the balanced tree can be further converted into a binary tree; the binary tree is then saved in a local file. Specifically, a binary tree generation method in the prior art can be adopted to convert the balanced tree into a binary tree.
Specifically, in the specific embodiment of the present invention, the specific storage scheme pseudo code of the directed cyclic graph is as follows:
Figure BDA0001166503520000151
Figure BDA0001166503520000161
the implementation method of the web crawler provided by the embodiment of the invention comprises the steps of firstly determining a balanced tree corresponding to a seed node sequence according to the predetermined seed node sequence; and then capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree. In the prior art, directed cyclic graphs are mostly adopted, and the crawling is performed in a depth or breadth mode or a mode shared by the two modes, so that compared with the prior art, the implementation method of the web crawler provided by the embodiment of the invention can support multiple people to crawl simultaneously, and can improve the execution efficiency; moreover, the technical scheme of the embodiment of the invention is simple and convenient to realize, convenient to popularize and wider in application range.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method described in the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A mobile terminal supporting simultaneous crawling by multiple persons, the mobile terminal comprising: the device comprises a selection unit, an adding unit, a conversion unit, a storage unit and a grabbing unit;
the selection unit is used for selecting one seed node from the seed node sequence as the current seed node when the seed node sequence is not empty;
the adding unit is used for adding all the nodes generated by the current seed node into a balanced tree;
the conversion unit is used for converting the balanced tree into a binary tree;
the storage unit is used for storing the binary tree in a local file;
and the grabbing unit is used for grabbing all the seed nodes in the seed node sequence and all the nodes generated by all the seed nodes according to the balanced tree.
2. The mobile terminal according to claim 1, wherein the adding unit is specifically configured to use the current seed node as a parent node, and when a child node pointed by the parent node is not empty, select one child node from all child nodes as the current child node; judging whether the current child node is in the balanced tree or not; and when the current child node is not in the balanced tree, adding the current child node into the balanced tree, taking the current child node as the current parent node, and returning to execute the operation.
3. The mobile terminal according to claim 2, wherein the adding unit is further configured to add the current child node to the directed cyclic graph corresponding to the sequence of seed nodes when the current child node is not in the balanced tree; and when the current child node is in the balanced tree, adding 1 to the number of times of collision of the current child node, and adding the current child node to a directed cyclic graph corresponding to the seed node sequence.
4. A web crawler implementation method is used for crawling by multiple persons at the same time, and is characterized by comprising the following steps:
when the seed node sequence is not empty, selecting a seed node from the seed node sequence as a current seed node;
adding all nodes generated by the current seed node into a balanced tree;
converting the balanced tree into a binary tree;
storing the binary tree in a local file;
and capturing all seed nodes in the seed node sequence and all nodes generated by all the seed nodes according to the balanced tree.
5. The method of claim 4, wherein adding all nodes generated by the current seed node to the balanced tree comprises:
taking the current seed node as a father node, and selecting one child node from all child nodes as a current child node when the child node pointed by the father node is not empty;
judging whether the current child node is in the balanced tree or not;
and when the current child node is not in the balanced tree, adding the current child node into the balanced tree, taking the current child node as the current parent node, and returning to execute the operation.
6. The method of claim 5, further comprising:
when the current child node is not in the balanced tree, adding the current child node to a directed cyclic graph corresponding to the sequence of seed nodes;
and when the current child node is in the balanced tree, adding 1 to the number of times of collision of the current child node, and adding the current child node to a directed cyclic graph corresponding to the seed node sequence.
CN201611092280.9A 2016-11-30 2016-11-30 Mobile terminal and implementation method of web crawler Active CN106776934B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611092280.9A CN106776934B (en) 2016-11-30 2016-11-30 Mobile terminal and implementation method of web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611092280.9A CN106776934B (en) 2016-11-30 2016-11-30 Mobile terminal and implementation method of web crawler

Publications (2)

Publication Number Publication Date
CN106776934A CN106776934A (en) 2017-05-31
CN106776934B true CN106776934B (en) 2021-03-26

Family

ID=58915656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611092280.9A Active CN106776934B (en) 2016-11-30 2016-11-30 Mobile terminal and implementation method of web crawler

Country Status (1)

Country Link
CN (1) CN106776934B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102480524A (en) * 2010-11-26 2012-05-30 中国科学院声学研究所 Web page crawler cooperating method
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190561A1 (en) * 2002-06-19 2006-08-24 Watchfire Corporation Method and system for obtaining script related information for website crawling
CN100520778C (en) * 2006-07-25 2009-07-29 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN104376063B (en) * 2014-11-11 2019-02-19 南京邮电大学 Multi-threaded network crawler method and information real-time update system based on Classification Management
CN106156104A (en) * 2015-04-02 2016-11-23 北京奇虎科技有限公司 Crawl the method and device of corporate intranet information
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102480524A (en) * 2010-11-26 2012-05-30 中国科学院声学研究所 Web page crawler cooperating method
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多Agent主题爬虫协作策略的研究与分析;杜亚军等;《西华大学学报》;20130131;第32卷(第1期);第31-38页 *

Also Published As

Publication number Publication date
CN106776934A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106775391B (en) Interface switching device and method
CN106547439B (en) Method and device for processing message
CN107066290B (en) Method and device for starting application according to associated policy
CN105760057A (en) Screenshot device and method
CN106776017B (en) Device and method for cleaning application memory and garbage
CN104731512A (en) Method, device and terminal for sharing pictures
CN106131285B (en) Call method and terminal
CN107066604B (en) Junk file cleaning method and terminal
CN106413128A (en) Projection method and mobile terminal
CN106598538B (en) Instruction set updating method and system
CN106024013B (en) Voice data searching method and system
CN104679890A (en) Image pushing method and device
CN104980549A (en) Information processing method and mobile terminal
CN105760055A (en) Mobile terminal and control method thereof
CN105205159B (en) Device and method for automatically feeding back information
CN105611071A (en) Schedule information display method and terminal
CN105491548A (en) Network searching method and device
CN104866095A (en) Mobile terminal, and method and apparatus for managing desktop thereof
CN104967749A (en) Device and method for processing picture and text information
CN104657484A (en) Method, device and system for downloading application software
CN105791541B (en) Screenshot method and mobile terminal
CN104731499A (en) Method and device for starting background applications and mobile terminal
CN106657643A (en) Mobile terminal and communication session display method
CN106454938B (en) Information sharing method, terminal and router
CN105718141A (en) Method for moving desktop icon and mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant