CN105930482A - Method and apparatus for matching keyword with network data - Google Patents

Method and apparatus for matching keyword with network data Download PDF

Info

Publication number
CN105930482A
CN105930482A CN201610282294.0A CN201610282294A CN105930482A CN 105930482 A CN105930482 A CN 105930482A CN 201610282294 A CN201610282294 A CN 201610282294A CN 105930482 A CN105930482 A CN 105930482A
Authority
CN
China
Prior art keywords
data
network data
matching
thread
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610282294.0A
Other languages
Chinese (zh)
Inventor
张旭华
刘硕
邹易兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN201610282294.0A priority Critical patent/CN105930482A/en
Publication of CN105930482A publication Critical patent/CN105930482A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present disclosure relates to a method and an apparatus for matching a keyword with network data. The method comprises: adding network data obtained by means of a network crawler into a data queue; acquiring a matching keyword set by a user; and matching the network data in the data queue with the matching keyword by means of at least one matching thread. When the data crawled by the network crawler needs to be matched by means of different matching keywords, developers only need to input a new matching keyword without the need for changing a code of a crawler network, so as to reduce difficulty in adjusting the matching keyword and lower development costs.

Description

Network data is carried out the method and device of Keywords matching
Technical field
It relates to networking technology area, particularly to a kind of side that network data is carried out Keywords matching Method and device.
Background technology
The data of magnanimity in the Internet, a lot of enterprise customers need to use web crawlers to crawl network number According to, and therefrom get and oneself want data.
In correlation technique, developer, when developing network reptile, increases matching part in the code of web crawlers Dividing code, web crawlers, after crawling network data, will be able to be climbed by running this compatible portion code The data got are mated with the matching keywords in compatible portion code, therefrom to extract the number of needs According to.When developer wants from the data that the extracting data crawled is different, compatible portion generation can be revised Matching keywords in Ma.
Summary of the invention
Disclosure embodiment provides a kind of method and device that network data carries out Keywords matching, technology Scheme is as follows:
First aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching Method.The method includes:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one Mate.
Optionally, described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh Graticule number of passes.
Optionally, the described data volume according to described network data determines score number of passes, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold Between time, calculate described score number of passes according to the data volume of described network data.
Optionally, described by least one coupling thread, by the network data in described data queue and institute State matching keywords to mate, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least One coupling thread reads described matching keywords from described appointment position.
Optionally, described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service The described network data that device sends.
Optionally, the matching keywords of described acquisition user setup, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute State the key word arranged in crawler server.
Second aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching Device.This device includes:
Add module, for the network data crawled by web crawlers is added into data queue;
First acquisition module, for obtaining the matching keywords of user setup;
Matching module, for by least one mate thread, by the network data in described data queue with Described matching keywords mates.
Optionally, described device also includes:
Second acquisition module, for obtaining the data volume of network data to be matched in described data queue;
Determine module, for determining score number of passes according to the data volume of described network data;
Adjusting module, for by newly-built or closedown thread, by the thread of at least one coupling thread described Number is adjusted to described score number of passes.
Optionally, described determine module, including:
First determines submodule, is used for when the data volume of described network data is not more than the first data-quantity threshold, Determine that described score number of passes is First Line number of passes;
Second determines submodule, is used for when the data volume of described network data is not less than the second data-quantity threshold, Determine that described score number of passes is the second Thread Count;
Calculating sub module, for being in described first data-quantity threshold and institute when the data volume of described network data When stating between the second data-quantity threshold, calculate described score number of passes according to the data volume of described network data.
Optionally, described matching module, including:
Load submodule, for the appointment position being carried in internal memory by described matching keywords;
Send submodule, for sending instruction message, described instruction message at least one coupling thread described For indicating at least one coupling thread described to read described matching keywords from described appointment position.
Optionally, described device also includes:
Receiver module, was used for before the network data crawled by web crawlers is added into data queue, Receive the described network data that crawler server sends.
Optionally, described first acquisition module, the described coupling sent for receiving described crawler server is closed Keyword, described matching keywords is the key word that user is arranged in described crawler server.
The third aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching Device, described device includes:
Processor;
For storing the memorizer of the executable instruction of described processor;
Wherein, described processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one Mate.
The technical scheme that disclosure embodiment provides can include following beneficial effect:
The network data crawled by web crawlers is added into data queue, obtains the coupling of user setup Key word, mates thread by least one, the network data in this data queue is entered with matching keywords Row coupling, when needing data web crawlers crawled by different matching keywords to mate, Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from And reduce the difficulty that matching keywords is adjusted, reduce development cost.
It should be appreciated that it is only exemplary and explanatory that above general description and details hereinafter describe, The disclosure can not be limited.
Accompanying drawing explanation
Accompanying drawing herein is merged in description and constitutes the part of this specification, it is shown that meet the disclosure Embodiment, and for explaining the principle of the disclosure together with description.
Fig. 1 is involved by the method that network data carries out Keywords matching shown in each embodiment of the disclosure The schematic diagram of implementation environment;
Fig. 2 is according to a kind of method that network data is carried out Keywords matching shown in an exemplary embodiment Flow chart;
Fig. 3 is according to a kind of side that network data carries out Keywords matching shown in another exemplary embodiment The flow chart of method;
Fig. 4 is according to a kind of side that network data carries out Keywords matching shown in further example embodiment The flow chart of method;
Fig. 5 is according to a kind of device that network data carries out Keywords matching shown in an exemplary embodiment Block diagram;
Fig. 6 is according to a kind of dress that network data carries out Keywords matching shown in another exemplary embodiment The block diagram put;
Fig. 7 is according to the block diagram of a kind of device shown in an exemplary embodiment.
Detailed description of the invention
Here will illustrate exemplary embodiment in detail, its example represents in the accompanying drawings.Following retouches Stating when relating to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represents same or analogous key element. Embodiment described in following exemplary embodiment does not represent all embodiment party consistent with the disclosure Formula.On the contrary, they only with describe in detail in appended claims, the disclosure some in terms of mutually one The example of the apparatus and method caused.
Fig. 1 is according to the reality involved by the method that network data carries out Keywords matching shown by the disclosure Execute the schematic diagram of environment.This implementation environment may include that crawler server 110 and queue server 120.
Crawler server 110 can be the server processing crawlers in a network.Crawlers utilizes and climbs The hardware resource of worm server 110 and perform, crawlers can be from an original URL (Uniform Resource Locator, URL) address starts to crawl in network data, and this network data The new data in URL crawl again, until not having new URL to be available for crawling, or crawl Till predetermined level.Afterwards, the network data got is passed to queue server 120 by crawler server 110.
Queue server 120 is connected with crawler server 110 by cable network or wireless network, one Crawler server can corresponding some queue servers 120, queue server is used for processing crawler server 110 data transmitted and instructions.
Crawler server 110 and queue server 120 can be different levels and framework in the application, also Can be identical level and framework.Wherein, crawler server 110 and queue server 120 are from application layer Basic server, working group's level server, department level server or enterprise level service is may is that for secondary Device;CISC (Complex Instruction Set Computing, complicated order is may is that for from framework System-computed technology) IA frame serverPC, (Reduced Instruction Set Computing, simplifies finger to RISC Make system-computed technology) IA frame serverPC or VLIW (Very Long Instruction Word, VLIW Word) server;Can be universal server and tailored version server for from purposes.
Alternatively, crawler server 110 and queue server 120 can also is that virtual server.
Below, as a example by the implementation environment shown in Fig. 1, the technical scheme that each embodiment of the disclosure is provided It is introduced and illustrates.
Fig. 2 is according to a kind of method that network data is carried out Keywords matching shown in an exemplary embodiment Flow chart, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.This is right Network data carries out the method for Keywords matching can include following several step.
In step 201, the network data crawled by web crawlers is added into data queue.
In step 202., the matching keywords of user setup is obtained.
In step 203, by least one mate thread, by the network data in this data queue with should Matching keywords mates.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides, By the network data crawled by web crawlers being added into data queue, obtain the coupling of user setup Key word, mates thread by least one, by the network data in this data queue and this matching keywords Mate, when needing data web crawlers crawled by different matching keywords to mate, Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from And reduce the difficulty that matching keywords is adjusted, reduce development cost.
Fig. 3 is according to a kind of side that network data carries out Keywords matching shown in another exemplary embodiment The flow chart of method, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.Should The method that network data carries out Keywords matching can include following several step.
In step 301, the network data crawled by web crawlers is added into data queue.
In step, after web crawlers crawls network data from network, queue server is by network data It is added in data queue to be matched.Network data in data queue will be according to being added into this queue Order accept matching treatment successively.
In step 302, the data volume of network data to be matched in this data queue is obtained.
The data volume of network data can be the bar number of network data, and i.e. web crawlers is when crawling network data, Can the data in a webpage be crawled is a data.Or, the data volume of this network data can also Refer to that this network data occupies the size of queue server memory space, such as 96KB, 320MB or 2.4GB Etc..
In step 303, score number of passes is determined according to the data volume of this network data.
Method shown in disclosure embodiment, in order to process the network data that web crawlers crawls in time, can Mate arranging a plurality of coupling thread, if coupling number of threads is less and to be matched in data queue simultaneously Network data more, then may affect treatment effeciency, otherwise, the data team if mating that number of threads is more Network data to be matched in row is less, then can cause the waste processing resource.Therefore, implement in the disclosure In example, queue server can adjust coupling according to the data volume of network data pending in data queue The quantity of thread, while ensureing the treatment effeciency of the network data in data queue, saving processes clothes The process resource of business device.
Optionally, queue server can determine score number of passes in such a way:
1) when the data volume of this network data is not more than the first data-quantity threshold, determine that this score number of passes is First Line number of passes.
In situation 1) in, First Line number of passes is pre-set, mates the minimum of thread in this queue server Quantity, this First Line number of passes can consider data-handling capacity and the network of queue server when arranging The emergency situations that data are possible, not waste the computing of this queue server when this first process Thread Count is arranged Ability, meets certain matching treatment ability simultaneously and is as the criterion.Such as, the First Line in a certain queue server Number of passes is that 5,5 coupling process resources shared by thread only account for the little of the total process resource of queue server Part, will not process other tasks and produce big impact, meanwhile, when data queue is short-and-medium queue server When time adds substantial amounts of network data, 5 mate the place that thread can provide certain within this short time in time Reason ability so that network data will not too be piled up.
Wherein, when queue server network data in data queue is not more than the first data-quantity threshold, protect The quantity holding coupling thread is First Line number of passes, such as, when the bar number of the network data in data queue is little In 1000, or, when data volume is not more than 500M, the quantity holding coupling thread is 5.
2) when the data volume of this network data is not less than the second data-quantity threshold, determine that this score number of passes is Second Thread Count.
In situation 2) in, the second Thread Count is pre-set, mates the maximum of thread in this queue server Quantity, this second Thread Count needs to consider the total data disposal ability of place queue server and also when arranging Send out disposal ability, to keep for key word without departing from this queue server when this second process Thread Count is arranged Join used operational capability to be as the criterion.Such as, when the coupling thread in a certain queue server is less than 15 Time, the other type of mission thread in this queue server will not be significantly affected, and once queue clothes Coupling thread in business device is more than 15, then the other type of mission thread in this queue server can be subject to Significantly affecting, at this point it is possible to this second Thread Count is set to 15, i.e. this queue server is opened the most simultaneously Open 15 coupling threads.
Wherein, when queue server network data in data queue is not less than the second data-quantity threshold, protect The quantity holding coupling thread is the second Thread Count, such as, when the bar number of the network data in data queue is the least In 10000, or, when data volume is not less than 5000M, the quantity holding coupling thread is 15.
3) it is between this first data-quantity threshold and this second data-quantity threshold when the data volume of this network data Time, calculate this score number of passes according to the data volume of this network data.
In situation 3) in, when the data volume of network data is in the first data-quantity threshold and this second data volume threshold Time between value, in order to ensure the efficiency of matching treatment, it is to avoid the network data in data queue is piled up, simultaneously Saving as far as possible processes resource, and queue server can dynamically adjust matched line according to the data volume of network data The quantity of journey, such as, based on situation 1) and situation 2) institute's illustrated example, in the data volume of this network data Time between the first data volume 500MB and the second data volume 5000MB, according to the data of this network data Amount calculates this score number of passes.Actual application has multiple computing formula, for these computing formula, Input quantity is the data volume of this network data, and output is Thread Count.Such as, the data volume of this network data Can with score number of passes in interval corresponding relation, when the data volume of network data be in (500MB, 1000MB] Time interval, corresponding score number of passes be 6 threads, when the data volume of network data be in (1000MB, 1500MB] interval time, corresponding score number of passes is 7, by that analogy.Or, this network data Bar number can with score number of passes in interval corresponding relation, when the bar number of network data be in (1000,2000] Time interval, corresponding score number of passes be 6 threads, when the bar number of network data be in (2000,3000] district Between time, corresponding score number of passes is 7, by that analogy.
In step 304, by newly-built or closedown thread, at least one the coupling thread that will currently run Thread Count be adjusted to this score number of passes.
Queue server adjusts the coupling thread of current operation in real time according to the score number of passes that step 303 determines Quantity.The quantity of the such as current coupling thread run is 8, calculates current institute through abovementioned steps The score number of passes needed is 7, and now, queue server can close wherein 2 coupling threads.In like manner, If the quantity of the current coupling thread run is 8, calculate current desired target through abovementioned steps When Thread Count is 10, now, queue server can be with newly-built 2 coupling threads.
In step 305, the matching keywords of user setup is obtained.
Wherein, this user can be the operation maintenance personnel of crawler system, and this matching keywords can be operation maintenance personnel What crawler system provided, the matching keywords arranged in interface is set.
Within step 306, by least one mate thread, by the network data in this data queue with should Matching keywords mates.
In this step, the appointment position that this matching keywords can be carried in internal memory by queue server, And at least one coupling thread sends instruction message to this, this instruction message is used for indicating this at least one coupling Thread reads this matching keywords from this appointment position.
Such as, after queue server gets new matching keywords, can load it in internal memory, and fixed Time back up in disk, and notify all coupling threads being currently running, make coupling thread dynamic Load new matching keywords, from data queue, extract network data, will be extracted by multimode matching algorithm The network data matching keywords new with this mate, and output matching result.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides, By the network data crawled by web crawlers being added into data queue, obtain the coupling of user setup Key word, mates thread by least one, by the network data in this data queue and this matching keywords Mate, when needing data web crawlers crawled by different matching keywords to mate, Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from And reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the method that disclosure embodiment provides, by obtaining network number to be matched in this data queue According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
Fig. 4 is according to a kind of side that network data carries out Keywords matching shown in further example embodiment The flow chart of method, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.Should The method that network data carries out Keywords matching can include following several step.
In step 401, the network data that crawler server sends is received.
This crawler server can be the crawler server 110 shown in Fig. 1, in this step, crawler server In web crawlers from network, crawl network data after, the network data crawled is carried out predetermined process After (such as duplicate removal), network data is sent to queue server.The form of this network data can be literary composition This form or other network data form etc..
In step 402, the network data received is added into data queue.
In step 403, the data volume of network data to be matched in this data queue is obtained.
In step 404, score number of passes is determined according to the data volume of this network data.
In step 405, by newly-built or closedown thread, at least one the coupling thread that will currently run Thread Count be adjusted to this score number of passes.
In above-mentioned steps 402 to step 405, data queue can be arranged storage threshold value, such as 1,000 nets Network data, the bar number quantity piled up when data queue has exceeded 1,000, and queue server then can increase automatically New coupling thread, and load the most up-to-date coupling keyword, process current network data in time, prevent Data stacking;At the same time it can also be coupling thread is arranged idle process Thread Count, when the number in data queue According to the when of measuring less, suitably reducing Thread Count, save system resource, it realizes process and is referred to Fig. 3 Step 301 in illustrated embodiment is to the description under step 304, and here is omitted.
In a step 406, receiving the matching keywords that this crawler server sends, this matching keywords is to use The key word that family is arranged in this crawler server.
Wherein, this matching keywords can be that operation maintenance personnel arranges setting in interface what crawler system provided Matching keywords.Such as, operation maintenance personnel can be inputted by the inputting interface that crawler server provides or be set Putting matching keywords, the matching keywords dynamic memory of operation maintenance personnel input or setting is arrived by crawler server In the set data structure specified.Queue server offer web interface is to crawler server, when operation maintenance personnel is complete In pairs after the reconfiguring of matching keywords and after notifying crawler server, crawler server can call queue clothes The web interface that business device provides, (i.e. stores the set data of matching keywords by comprising this new matching keywords Structure) message data pass to queue server.Optionally, this message data can also comprise reptile clothes The mark of business device.
Due to a crawler server can corresponding multiple queue servers, when the plurality of queue server passes through When the network data that crawler server is crawled by same matching keywords carries out multimode matching, operation maintenance personnel Having only to arrange matching keywords in crawler server, each queue server i.e. can take from reptile automatically Business device obtains this matching keywords, it is not necessary to be respectively provided with in multiple queue servers, simplify further Operation maintenance personnel arranges the operation of matching keywords.
In step 407, the appointment position this matching keywords being carried in internal memory.
In this step, matching keywords is carried in team the specific bit of internal memory in server by queue server In putting.The appointment position that matching keywords is stored in queue server internal memory with the form of respective character string, May be alternatively stored in the caching of queue server, ROM or RAM.
In a step 408, to this, at least one coupling thread sends instruction message, and this instruction message is used for referring to Show that this at least one coupling thread reads this matching keywords from this appointment position.
Such as, after queue server gets new matching keywords, can load it in internal memory, and fixed Time back up in disk, and notify all coupling threads being currently running, make coupling thread dynamic Load new matching keywords, from data queue, extract network data, will be extracted by multimode matching algorithm The network data matching keywords new with this mate, and output matching result.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides, The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the method that disclosure embodiment provides, by obtaining network number to be matched in this data queue According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
It addition, the method that disclosure embodiment provides, queue server receives the network that crawler server sends Data and matching keywords, and by least one coupling thread, network data and matching keywords are carried out many Data are crawled by mould matching treatment and Data Matching segmentation performs, it is to avoid carry out substantial amounts of network data Timing causes network data to crawl the congested of end, and the data improving crawler system crawl performance.
Fig. 5 is according to a kind of device that network data carries out Keywords matching shown in an exemplary embodiment Block diagram, this device that network data carries out Keywords matching can be by hardware circuit or software and hard It is all or part of that the mode that part combines is implemented as in queue server 120.Network data is closed by this The device of keyword coupling may include that interpolation module the 501, first acquisition module 502 and matching module 503.
Add module 501, be configured to the network data crawled by web crawlers is added into data team Row.
First acquisition module 502, is configured to obtain the matching keywords of user setup.
Matching module 503, is configured at least one coupling thread, by the network number in this data queue Mate according to this matching keywords.
In sum, a kind of device that network data is carried out Keywords matching that disclosure embodiment provides, The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Fig. 6 is according to a kind of dress that network data carries out Keywords matching shown in another exemplary embodiment The block diagram put, this network data is carried out Keywords matching device can by hardware circuit or software and It is all or part of that the mode of combination of hardware is implemented as in queue server 120.Network data is carried out by this The device of Keywords matching may include that interpolation module the 601, first acquisition module 602, matching module 603, Second acquisition module 604, determine module 605, adjusting module 606 and receiver module 607.
Add module 601, be configured to the network data crawled by web crawlers is added into data team Row.
First acquisition module 602, is configured to obtain the matching keywords of user setup.
Matching module 603, is configured at least one coupling thread, by the network number in this data queue Mate according to this matching keywords.
Second acquisition module 604, is configured to obtain the data of network data to be matched in this data queue Amount.
Determine module 605, be configured to the data volume according to this network data and determine score number of passes.
Adjusting module 606, is configured to newly-built or closes thread, at least one coupling thread by this Thread Count is adjusted to this score number of passes.
This determines module 605, including: first determine submodule 605a, second determine submodule 605b and meter Operator module 605c.
First determines submodule 605a, and the data volume being configured as this network data is not more than the first data volume During threshold value, determine that this score number of passes is First Line number of passes.
Second determines submodule 605b, is configured as the data volume of this network data not less than the second data volume During threshold value, determine that this score number of passes is the second Thread Count.
Calculating sub module 605c, the data volume being configured as this network data is in this first data-quantity threshold And time between this second data-quantity threshold, calculate this score number of passes according to the data volume of this network data.
This matching module 603, including: load submodule 603a and send submodule 603b.
Load submodule 603a, be configured to the appointment position being carried in internal memory by this matching keywords.
Send submodule 603b, be configured at least one coupling thread to this and send instruction message, this instruction Message is used for indicating this at least one coupling thread to read this matching keywords from this appointment position.
Receiver module 607, is configured to the network data crawled by web crawlers is being added into data team Before row, receive this network data that crawler server sends.
This first acquisition module 602, is configured to receive this matching keywords that this crawler server sends, should Matching keywords is the key word that user is arranged in this crawler server.
In sum, a kind of device that network data is carried out Keywords matching that disclosure embodiment provides, The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the device that disclosure embodiment provides, by obtaining network number to be matched in this data queue According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
It addition, the device that disclosure embodiment provides, queue server receives the network that crawler server sends Data and matching keywords, and by least one coupling thread, network data and matching keywords are carried out many Data are crawled by mould matching treatment and Data Matching segmentation performs, it is to avoid carry out substantial amounts of network data Timing causes network data to crawl the congested of end, and the data improving crawler system crawl performance.
The disclosure one exemplary embodiment additionally provides a kind of device that network data carries out Keywords matching, It is capable of the method that network data is carried out Keywords matching that the disclosure provides.This device includes: process Device, and for storing the memorizer of the executable instruction of processor.Wherein, processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one Mate.
Optionally, described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh Graticule number of passes.
Optionally, the described data volume according to described network data determines score number of passes, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold Between time, calculate described score number of passes according to the data volume of described network data.
Optionally, described by least one coupling thread, by the network data in described data queue and institute State matching keywords to mate, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least One coupling thread reads described matching keywords from described appointment position.
Optionally, described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service The described network data that device sends.
Optionally, the matching keywords of described acquisition user setup, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute State the key word arranged in crawler server.
It should be noted is that, above-described embodiment provide device when realizing its function, only with above-mentioned respectively The division of individual functional module is illustrated, in actual application, and can be according to actual needs and by above-mentioned merit Distribution can be completed by different functional modules, the content structure of equipment will be divided into different functional modules, To complete all or part of function described above.
About the device in above-described embodiment, wherein modules performs the concrete mode of operation relevant The embodiment of the method is described in detail, explanation will be not set forth in detail herein.
Fig. 7 is the block diagram according to a kind of device 700 shown in an exemplary embodiment.Such as, device 700 May be provided in a server.With reference to Fig. 7, device 700 includes processing assembly 722, and it farther includes One or more processors, and by the memory resource representated by memorizer 732, can be by for storage The instruction that reason parts 722 perform, such as application program.In memorizer 732, the application program of storage can wrap Include one or more each corresponding to one group instruction module.Joined additionally, process assembly 722 It is set to perform instruction, to perform above-mentioned by what queue server performed, network data to be carried out Keywords matching Method.
Device 700 can also include that a power supply module 726 is configured to perform the power management of device 700, One wired or wireless network interface 750 is configured to be connected to device 700 network, and an input is defeated Go out (I/O) interface 758.Device 700 can operate based on the operating system being stored in memorizer 732, example Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to these public affairs Other embodiment opened.The application is intended to any modification, purposes or the adaptations of the disclosure, These modification, purposes or adaptations are followed the general principle of the disclosure and include that the disclosure is not disclosed Common knowledge in the art or conventional techniques means.Description and embodiments is considered only as exemplary , the true scope of the disclosure and spirit are pointed out by claim below.
It should be appreciated that the disclosure is not limited to accurate knot described above and illustrated in the accompanying drawings Structure, and various modifications and changes can carried out without departing from the scope.The scope of the present disclosure is only by appended Claim limits.

Claims (13)

1. the method that network data is carried out Keywords matching, it is characterised in that described method includes:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one Mate.
Method the most according to claim 1, it is characterised in that described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh Graticule number of passes.
Method the most according to claim 2, it is characterised in that the described number according to described network data Score number of passes is determined according to amount, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold Between time, calculate described score number of passes according to the data volume of described network data.
Method the most according to claim 1, it is characterised in that described by least one coupling thread, Network data in described data queue is mated with described matching keywords, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least One coupling thread reads described matching keywords from described appointment position.
Method the most according to claim 1, it is characterised in that described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service The described network data that device sends.
Method the most according to claim 5, it is characterised in that the coupling of described acquisition user setup is closed Keyword, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute State the key word arranged in crawler server.
7. the device that network data is carried out Keywords matching, it is characterised in that described device includes:
Add module, for the network data crawled by web crawlers is added into data queue;
First acquisition module, for obtaining the matching keywords of user setup;
Matching module, for by least one mate thread, by the network data in described data queue with Described matching keywords mates.
Device the most according to claim 7, it is characterised in that described device also includes:
Second acquisition module, for obtaining the data volume of network data to be matched in described data queue;
Determine module, for determining score number of passes according to the data volume of described network data;
Adjusting module, for by newly-built or closedown thread, by the thread of at least one coupling thread described Number is adjusted to described score number of passes.
Device the most according to claim 8, it is characterised in that described determine module, including:
First determines submodule, is used for when the data volume of described network data is not more than the first data-quantity threshold, Determine that described score number of passes is First Line number of passes;
Second determines submodule, is used for when the data volume of described network data is not less than the second data-quantity threshold, Determine that described score number of passes is the second Thread Count;
Calculating sub module, for being in described first data-quantity threshold and institute when the data volume of described network data When stating between the second data-quantity threshold, calculate described score number of passes according to the data volume of described network data.
Device the most according to claim 7, it is characterised in that described matching module, including:
Load submodule, for the appointment position being carried in internal memory by described matching keywords;
Send submodule, for sending instruction message, described instruction message at least one coupling thread described For indicating at least one coupling thread described to read described matching keywords from described appointment position.
11. devices according to claim 7, it is characterised in that described device also includes:
Receiver module, was used for before the network data crawled by web crawlers is added into data queue, Receive the described network data that crawler server sends.
12. devices according to claim 11, it is characterised in that
Described first acquisition module, for receiving the described matching keywords that described crawler server sends, institute Stating matching keywords is the key word that user is arranged in described crawler server.
13. 1 kinds of devices that network data is carried out Keywords matching, it is characterised in that described device includes:
Processor;
For storing the memorizer of the executable instruction of described processor;
Wherein, described processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one Mate.
CN201610282294.0A 2016-04-29 2016-04-29 Method and apparatus for matching keyword with network data Pending CN105930482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610282294.0A CN105930482A (en) 2016-04-29 2016-04-29 Method and apparatus for matching keyword with network data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610282294.0A CN105930482A (en) 2016-04-29 2016-04-29 Method and apparatus for matching keyword with network data

Publications (1)

Publication Number Publication Date
CN105930482A true CN105930482A (en) 2016-09-07

Family

ID=56837626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610282294.0A Pending CN105930482A (en) 2016-04-29 2016-04-29 Method and apparatus for matching keyword with network data

Country Status (1)

Country Link
CN (1) CN105930482A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193676A (en) * 2017-05-19 2017-09-22 成都奇鲁科技有限公司 Hardware analysis method and device
CN112396536A (en) * 2019-08-12 2021-02-23 北京国双科技有限公司 Method and device for realizing intelligent service
CN113157722A (en) * 2021-04-01 2021-07-23 北京达佳互联信息技术有限公司 Data processing method, device, server, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105138547A (en) * 2015-07-10 2015-12-09 无锡天脉聚源传媒科技有限公司 Data search method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105138547A (en) * 2015-07-10 2015-12-09 无锡天脉聚源传媒科技有限公司 Data search method and apparatus

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193676A (en) * 2017-05-19 2017-09-22 成都奇鲁科技有限公司 Hardware analysis method and device
CN112396536A (en) * 2019-08-12 2021-02-23 北京国双科技有限公司 Method and device for realizing intelligent service
CN113157722A (en) * 2021-04-01 2021-07-23 北京达佳互联信息技术有限公司 Data processing method, device, server, system and storage medium
CN113157722B (en) * 2021-04-01 2023-12-26 北京达佳互联信息技术有限公司 Data processing method, device, server, system and storage medium

Similar Documents

Publication Publication Date Title
CN109086031B (en) Business decision method and device based on rule engine
CN110362727A (en) Third party for search system searches for application
CN107491488A (en) The method and apparatus of page data collection
US11010215B2 (en) Recommending applications based on call requests between applications
CN103765412A (en) Predicting user navigation events
US20150120729A1 (en) Web-based representational state transfer api server
US8793258B2 (en) Predicting sharing on a social network
CN107291337A (en) A kind of method and device that Operational Visit is provided
CN105930482A (en) Method and apparatus for matching keyword with network data
CN110069693A (en) Method and apparatus for determining target pages
CN109582844A (en) A kind of method, apparatus and system identifying crawler
CN107291778A (en) The collection method and device of data
CN107249019A (en) Data handling system, method, device and server based on business
DE102022101525A1 (en) INTELLIGENT COOLANT-ASSISTED LIQUID-TO-AIR HEAT EXCHANGER FOR COOLING SYSTEMS IN A DATA CENTER
CN104572901B (en) The method for down loading and device of web data
WO2013026953A2 (en) Method and apparatus for providing search with contextual processing
CN112383513A (en) Crawler behavior detection method and device based on proxy IP address pool and storage medium
CN112507265A (en) Method and device for anomaly detection based on tree structure and related products
KR20210064959A (en) Advertisement management device managing advertisement provided via platform server and operation method of advertisement management device
CN116560661A (en) Code optimization method, device, equipment and storage medium
CN112579853A (en) Method and device for sequencing crawling links and storage medium
CN110215703A (en) The selection method of game application, apparatus and system
CN115983275A (en) Named entity identification method, system and electronic equipment
CN108595479A (en) Web request processing method based on unified Web entrances
CN109246069A (en) Webpage login method, device and readable storage medium storing program for executing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160907