CN105930482A - Method and apparatus for matching keyword with network data - Google Patents
Method and apparatus for matching keyword with network data Download PDFInfo
- Publication number
- CN105930482A CN105930482A CN201610282294.0A CN201610282294A CN105930482A CN 105930482 A CN105930482 A CN 105930482A CN 201610282294 A CN201610282294 A CN 201610282294A CN 105930482 A CN105930482 A CN 105930482A
- Authority
- CN
- China
- Prior art keywords
- data
- network data
- matching
- thread
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present disclosure relates to a method and an apparatus for matching a keyword with network data. The method comprises: adding network data obtained by means of a network crawler into a data queue; acquiring a matching keyword set by a user; and matching the network data in the data queue with the matching keyword by means of at least one matching thread. When the data crawled by the network crawler needs to be matched by means of different matching keywords, developers only need to input a new matching keyword without the need for changing a code of a crawler network, so as to reduce difficulty in adjusting the matching keyword and lower development costs.
Description
Technical field
It relates to networking technology area, particularly to a kind of side that network data is carried out Keywords matching
Method and device.
Background technology
The data of magnanimity in the Internet, a lot of enterprise customers need to use web crawlers to crawl network number
According to, and therefrom get and oneself want data.
In correlation technique, developer, when developing network reptile, increases matching part in the code of web crawlers
Dividing code, web crawlers, after crawling network data, will be able to be climbed by running this compatible portion code
The data got are mated with the matching keywords in compatible portion code, therefrom to extract the number of needs
According to.When developer wants from the data that the extracting data crawled is different, compatible portion generation can be revised
Matching keywords in Ma.
Summary of the invention
Disclosure embodiment provides a kind of method and device that network data carries out Keywords matching, technology
Scheme is as follows:
First aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching
Method.The method includes:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one
Mate.
Optionally, described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh
Graticule number of passes.
Optionally, the described data volume according to described network data determines score number of passes, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes
For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes
It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold
Between time, calculate described score number of passes according to the data volume of described network data.
Optionally, described by least one coupling thread, by the network data in described data queue and institute
State matching keywords to mate, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least
One coupling thread reads described matching keywords from described appointment position.
Optionally, described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service
The described network data that device sends.
Optionally, the matching keywords of described acquisition user setup, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute
State the key word arranged in crawler server.
Second aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching
Device.This device includes:
Add module, for the network data crawled by web crawlers is added into data queue;
First acquisition module, for obtaining the matching keywords of user setup;
Matching module, for by least one mate thread, by the network data in described data queue with
Described matching keywords mates.
Optionally, described device also includes:
Second acquisition module, for obtaining the data volume of network data to be matched in described data queue;
Determine module, for determining score number of passes according to the data volume of described network data;
Adjusting module, for by newly-built or closedown thread, by the thread of at least one coupling thread described
Number is adjusted to described score number of passes.
Optionally, described determine module, including:
First determines submodule, is used for when the data volume of described network data is not more than the first data-quantity threshold,
Determine that described score number of passes is First Line number of passes;
Second determines submodule, is used for when the data volume of described network data is not less than the second data-quantity threshold,
Determine that described score number of passes is the second Thread Count;
Calculating sub module, for being in described first data-quantity threshold and institute when the data volume of described network data
When stating between the second data-quantity threshold, calculate described score number of passes according to the data volume of described network data.
Optionally, described matching module, including:
Load submodule, for the appointment position being carried in internal memory by described matching keywords;
Send submodule, for sending instruction message, described instruction message at least one coupling thread described
For indicating at least one coupling thread described to read described matching keywords from described appointment position.
Optionally, described device also includes:
Receiver module, was used for before the network data crawled by web crawlers is added into data queue,
Receive the described network data that crawler server sends.
Optionally, described first acquisition module, the described coupling sent for receiving described crawler server is closed
Keyword, described matching keywords is the key word that user is arranged in described crawler server.
The third aspect according to disclosure embodiment, it is provided that a kind of network data is carried out Keywords matching
Device, described device includes:
Processor;
For storing the memorizer of the executable instruction of described processor;
Wherein, described processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one
Mate.
The technical scheme that disclosure embodiment provides can include following beneficial effect:
The network data crawled by web crawlers is added into data queue, obtains the coupling of user setup
Key word, mates thread by least one, the network data in this data queue is entered with matching keywords
Row coupling, when needing data web crawlers crawled by different matching keywords to mate,
Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from
And reduce the difficulty that matching keywords is adjusted, reduce development cost.
It should be appreciated that it is only exemplary and explanatory that above general description and details hereinafter describe,
The disclosure can not be limited.
Accompanying drawing explanation
Accompanying drawing herein is merged in description and constitutes the part of this specification, it is shown that meet the disclosure
Embodiment, and for explaining the principle of the disclosure together with description.
Fig. 1 is involved by the method that network data carries out Keywords matching shown in each embodiment of the disclosure
The schematic diagram of implementation environment;
Fig. 2 is according to a kind of method that network data is carried out Keywords matching shown in an exemplary embodiment
Flow chart;
Fig. 3 is according to a kind of side that network data carries out Keywords matching shown in another exemplary embodiment
The flow chart of method;
Fig. 4 is according to a kind of side that network data carries out Keywords matching shown in further example embodiment
The flow chart of method;
Fig. 5 is according to a kind of device that network data carries out Keywords matching shown in an exemplary embodiment
Block diagram;
Fig. 6 is according to a kind of dress that network data carries out Keywords matching shown in another exemplary embodiment
The block diagram put;
Fig. 7 is according to the block diagram of a kind of device shown in an exemplary embodiment.
Detailed description of the invention
Here will illustrate exemplary embodiment in detail, its example represents in the accompanying drawings.Following retouches
Stating when relating to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represents same or analogous key element.
Embodiment described in following exemplary embodiment does not represent all embodiment party consistent with the disclosure
Formula.On the contrary, they only with describe in detail in appended claims, the disclosure some in terms of mutually one
The example of the apparatus and method caused.
Fig. 1 is according to the reality involved by the method that network data carries out Keywords matching shown by the disclosure
Execute the schematic diagram of environment.This implementation environment may include that crawler server 110 and queue server 120.
Crawler server 110 can be the server processing crawlers in a network.Crawlers utilizes and climbs
The hardware resource of worm server 110 and perform, crawlers can be from an original URL (Uniform
Resource Locator, URL) address starts to crawl in network data, and this network data
The new data in URL crawl again, until not having new URL to be available for crawling, or crawl
Till predetermined level.Afterwards, the network data got is passed to queue server 120 by crawler server 110.
Queue server 120 is connected with crawler server 110 by cable network or wireless network, one
Crawler server can corresponding some queue servers 120, queue server is used for processing crawler server
110 data transmitted and instructions.
Crawler server 110 and queue server 120 can be different levels and framework in the application, also
Can be identical level and framework.Wherein, crawler server 110 and queue server 120 are from application layer
Basic server, working group's level server, department level server or enterprise level service is may is that for secondary
Device;CISC (Complex Instruction Set Computing, complicated order is may is that for from framework
System-computed technology) IA frame serverPC, (Reduced Instruction Set Computing, simplifies finger to RISC
Make system-computed technology) IA frame serverPC or VLIW (Very Long Instruction Word, VLIW
Word) server;Can be universal server and tailored version server for from purposes.
Alternatively, crawler server 110 and queue server 120 can also is that virtual server.
Below, as a example by the implementation environment shown in Fig. 1, the technical scheme that each embodiment of the disclosure is provided
It is introduced and illustrates.
Fig. 2 is according to a kind of method that network data is carried out Keywords matching shown in an exemplary embodiment
Flow chart, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.This is right
Network data carries out the method for Keywords matching can include following several step.
In step 201, the network data crawled by web crawlers is added into data queue.
In step 202., the matching keywords of user setup is obtained.
In step 203, by least one mate thread, by the network data in this data queue with should
Matching keywords mates.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides,
By the network data crawled by web crawlers being added into data queue, obtain the coupling of user setup
Key word, mates thread by least one, by the network data in this data queue and this matching keywords
Mate, when needing data web crawlers crawled by different matching keywords to mate,
Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from
And reduce the difficulty that matching keywords is adjusted, reduce development cost.
Fig. 3 is according to a kind of side that network data carries out Keywords matching shown in another exemplary embodiment
The flow chart of method, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.Should
The method that network data carries out Keywords matching can include following several step.
In step 301, the network data crawled by web crawlers is added into data queue.
In step, after web crawlers crawls network data from network, queue server is by network data
It is added in data queue to be matched.Network data in data queue will be according to being added into this queue
Order accept matching treatment successively.
In step 302, the data volume of network data to be matched in this data queue is obtained.
The data volume of network data can be the bar number of network data, and i.e. web crawlers is when crawling network data,
Can the data in a webpage be crawled is a data.Or, the data volume of this network data can also
Refer to that this network data occupies the size of queue server memory space, such as 96KB, 320MB or 2.4GB
Etc..
In step 303, score number of passes is determined according to the data volume of this network data.
Method shown in disclosure embodiment, in order to process the network data that web crawlers crawls in time, can
Mate arranging a plurality of coupling thread, if coupling number of threads is less and to be matched in data queue simultaneously
Network data more, then may affect treatment effeciency, otherwise, the data team if mating that number of threads is more
Network data to be matched in row is less, then can cause the waste processing resource.Therefore, implement in the disclosure
In example, queue server can adjust coupling according to the data volume of network data pending in data queue
The quantity of thread, while ensureing the treatment effeciency of the network data in data queue, saving processes clothes
The process resource of business device.
Optionally, queue server can determine score number of passes in such a way:
1) when the data volume of this network data is not more than the first data-quantity threshold, determine that this score number of passes is
First Line number of passes.
In situation 1) in, First Line number of passes is pre-set, mates the minimum of thread in this queue server
Quantity, this First Line number of passes can consider data-handling capacity and the network of queue server when arranging
The emergency situations that data are possible, not waste the computing of this queue server when this first process Thread Count is arranged
Ability, meets certain matching treatment ability simultaneously and is as the criterion.Such as, the First Line in a certain queue server
Number of passes is that 5,5 coupling process resources shared by thread only account for the little of the total process resource of queue server
Part, will not process other tasks and produce big impact, meanwhile, when data queue is short-and-medium queue server
When time adds substantial amounts of network data, 5 mate the place that thread can provide certain within this short time in time
Reason ability so that network data will not too be piled up.
Wherein, when queue server network data in data queue is not more than the first data-quantity threshold, protect
The quantity holding coupling thread is First Line number of passes, such as, when the bar number of the network data in data queue is little
In 1000, or, when data volume is not more than 500M, the quantity holding coupling thread is 5.
2) when the data volume of this network data is not less than the second data-quantity threshold, determine that this score number of passes is
Second Thread Count.
In situation 2) in, the second Thread Count is pre-set, mates the maximum of thread in this queue server
Quantity, this second Thread Count needs to consider the total data disposal ability of place queue server and also when arranging
Send out disposal ability, to keep for key word without departing from this queue server when this second process Thread Count is arranged
Join used operational capability to be as the criterion.Such as, when the coupling thread in a certain queue server is less than 15
Time, the other type of mission thread in this queue server will not be significantly affected, and once queue clothes
Coupling thread in business device is more than 15, then the other type of mission thread in this queue server can be subject to
Significantly affecting, at this point it is possible to this second Thread Count is set to 15, i.e. this queue server is opened the most simultaneously
Open 15 coupling threads.
Wherein, when queue server network data in data queue is not less than the second data-quantity threshold, protect
The quantity holding coupling thread is the second Thread Count, such as, when the bar number of the network data in data queue is the least
In 10000, or, when data volume is not less than 5000M, the quantity holding coupling thread is 15.
3) it is between this first data-quantity threshold and this second data-quantity threshold when the data volume of this network data
Time, calculate this score number of passes according to the data volume of this network data.
In situation 3) in, when the data volume of network data is in the first data-quantity threshold and this second data volume threshold
Time between value, in order to ensure the efficiency of matching treatment, it is to avoid the network data in data queue is piled up, simultaneously
Saving as far as possible processes resource, and queue server can dynamically adjust matched line according to the data volume of network data
The quantity of journey, such as, based on situation 1) and situation 2) institute's illustrated example, in the data volume of this network data
Time between the first data volume 500MB and the second data volume 5000MB, according to the data of this network data
Amount calculates this score number of passes.Actual application has multiple computing formula, for these computing formula,
Input quantity is the data volume of this network data, and output is Thread Count.Such as, the data volume of this network data
Can with score number of passes in interval corresponding relation, when the data volume of network data be in (500MB, 1000MB]
Time interval, corresponding score number of passes be 6 threads, when the data volume of network data be in (1000MB,
1500MB] interval time, corresponding score number of passes is 7, by that analogy.Or, this network data
Bar number can with score number of passes in interval corresponding relation, when the bar number of network data be in (1000,2000]
Time interval, corresponding score number of passes be 6 threads, when the bar number of network data be in (2000,3000] district
Between time, corresponding score number of passes is 7, by that analogy.
In step 304, by newly-built or closedown thread, at least one the coupling thread that will currently run
Thread Count be adjusted to this score number of passes.
Queue server adjusts the coupling thread of current operation in real time according to the score number of passes that step 303 determines
Quantity.The quantity of the such as current coupling thread run is 8, calculates current institute through abovementioned steps
The score number of passes needed is 7, and now, queue server can close wherein 2 coupling threads.In like manner,
If the quantity of the current coupling thread run is 8, calculate current desired target through abovementioned steps
When Thread Count is 10, now, queue server can be with newly-built 2 coupling threads.
In step 305, the matching keywords of user setup is obtained.
Wherein, this user can be the operation maintenance personnel of crawler system, and this matching keywords can be operation maintenance personnel
What crawler system provided, the matching keywords arranged in interface is set.
Within step 306, by least one mate thread, by the network data in this data queue with should
Matching keywords mates.
In this step, the appointment position that this matching keywords can be carried in internal memory by queue server,
And at least one coupling thread sends instruction message to this, this instruction message is used for indicating this at least one coupling
Thread reads this matching keywords from this appointment position.
Such as, after queue server gets new matching keywords, can load it in internal memory, and fixed
Time back up in disk, and notify all coupling threads being currently running, make coupling thread dynamic
Load new matching keywords, from data queue, extract network data, will be extracted by multimode matching algorithm
The network data matching keywords new with this mate, and output matching result.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides,
By the network data crawled by web crawlers being added into data queue, obtain the coupling of user setup
Key word, mates thread by least one, by the network data in this data queue and this matching keywords
Mate, when needing data web crawlers crawled by different matching keywords to mate,
Developer user has only to input new matching keywords, it is not necessary to the code of change reptile network, from
And reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the method that disclosure embodiment provides, by obtaining network number to be matched in this data queue
According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line
Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment
Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
Fig. 4 is according to a kind of side that network data carries out Keywords matching shown in further example embodiment
The flow chart of method, the method is applied in the queue server 120 in implementation environment as shown in Figure 1.Should
The method that network data carries out Keywords matching can include following several step.
In step 401, the network data that crawler server sends is received.
This crawler server can be the crawler server 110 shown in Fig. 1, in this step, crawler server
In web crawlers from network, crawl network data after, the network data crawled is carried out predetermined process
After (such as duplicate removal), network data is sent to queue server.The form of this network data can be literary composition
This form or other network data form etc..
In step 402, the network data received is added into data queue.
In step 403, the data volume of network data to be matched in this data queue is obtained.
In step 404, score number of passes is determined according to the data volume of this network data.
In step 405, by newly-built or closedown thread, at least one the coupling thread that will currently run
Thread Count be adjusted to this score number of passes.
In above-mentioned steps 402 to step 405, data queue can be arranged storage threshold value, such as 1,000 nets
Network data, the bar number quantity piled up when data queue has exceeded 1,000, and queue server then can increase automatically
New coupling thread, and load the most up-to-date coupling keyword, process current network data in time, prevent
Data stacking;At the same time it can also be coupling thread is arranged idle process Thread Count, when the number in data queue
According to the when of measuring less, suitably reducing Thread Count, save system resource, it realizes process and is referred to Fig. 3
Step 301 in illustrated embodiment is to the description under step 304, and here is omitted.
In a step 406, receiving the matching keywords that this crawler server sends, this matching keywords is to use
The key word that family is arranged in this crawler server.
Wherein, this matching keywords can be that operation maintenance personnel arranges setting in interface what crawler system provided
Matching keywords.Such as, operation maintenance personnel can be inputted by the inputting interface that crawler server provides or be set
Putting matching keywords, the matching keywords dynamic memory of operation maintenance personnel input or setting is arrived by crawler server
In the set data structure specified.Queue server offer web interface is to crawler server, when operation maintenance personnel is complete
In pairs after the reconfiguring of matching keywords and after notifying crawler server, crawler server can call queue clothes
The web interface that business device provides, (i.e. stores the set data of matching keywords by comprising this new matching keywords
Structure) message data pass to queue server.Optionally, this message data can also comprise reptile clothes
The mark of business device.
Due to a crawler server can corresponding multiple queue servers, when the plurality of queue server passes through
When the network data that crawler server is crawled by same matching keywords carries out multimode matching, operation maintenance personnel
Having only to arrange matching keywords in crawler server, each queue server i.e. can take from reptile automatically
Business device obtains this matching keywords, it is not necessary to be respectively provided with in multiple queue servers, simplify further
Operation maintenance personnel arranges the operation of matching keywords.
In step 407, the appointment position this matching keywords being carried in internal memory.
In this step, matching keywords is carried in team the specific bit of internal memory in server by queue server
In putting.The appointment position that matching keywords is stored in queue server internal memory with the form of respective character string,
May be alternatively stored in the caching of queue server, ROM or RAM.
In a step 408, to this, at least one coupling thread sends instruction message, and this instruction message is used for referring to
Show that this at least one coupling thread reads this matching keywords from this appointment position.
Such as, after queue server gets new matching keywords, can load it in internal memory, and fixed
Time back up in disk, and notify all coupling threads being currently running, make coupling thread dynamic
Load new matching keywords, from data queue, extract network data, will be extracted by multimode matching algorithm
The network data matching keywords new with this mate, and output matching result.
In sum, a kind of method that network data is carried out Keywords matching that disclosure embodiment provides,
The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial
Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords
Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened
Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus
Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the method that disclosure embodiment provides, by obtaining network number to be matched in this data queue
According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line
Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment
Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
It addition, the method that disclosure embodiment provides, queue server receives the network that crawler server sends
Data and matching keywords, and by least one coupling thread, network data and matching keywords are carried out many
Data are crawled by mould matching treatment and Data Matching segmentation performs, it is to avoid carry out substantial amounts of network data
Timing causes network data to crawl the congested of end, and the data improving crawler system crawl performance.
Fig. 5 is according to a kind of device that network data carries out Keywords matching shown in an exemplary embodiment
Block diagram, this device that network data carries out Keywords matching can be by hardware circuit or software and hard
It is all or part of that the mode that part combines is implemented as in queue server 120.Network data is closed by this
The device of keyword coupling may include that interpolation module the 501, first acquisition module 502 and matching module 503.
Add module 501, be configured to the network data crawled by web crawlers is added into data team
Row.
First acquisition module 502, is configured to obtain the matching keywords of user setup.
Matching module 503, is configured at least one coupling thread, by the network number in this data queue
Mate according to this matching keywords.
In sum, a kind of device that network data is carried out Keywords matching that disclosure embodiment provides,
The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial
Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords
Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened
Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus
Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Fig. 6 is according to a kind of dress that network data carries out Keywords matching shown in another exemplary embodiment
The block diagram put, this network data is carried out Keywords matching device can by hardware circuit or software and
It is all or part of that the mode of combination of hardware is implemented as in queue server 120.Network data is carried out by this
The device of Keywords matching may include that interpolation module the 601, first acquisition module 602, matching module 603,
Second acquisition module 604, determine module 605, adjusting module 606 and receiver module 607.
Add module 601, be configured to the network data crawled by web crawlers is added into data team
Row.
First acquisition module 602, is configured to obtain the matching keywords of user setup.
Matching module 603, is configured at least one coupling thread, by the network number in this data queue
Mate according to this matching keywords.
Second acquisition module 604, is configured to obtain the data of network data to be matched in this data queue
Amount.
Determine module 605, be configured to the data volume according to this network data and determine score number of passes.
Adjusting module 606, is configured to newly-built or closes thread, at least one coupling thread by this
Thread Count is adjusted to this score number of passes.
This determines module 605, including: first determine submodule 605a, second determine submodule 605b and meter
Operator module 605c.
First determines submodule 605a, and the data volume being configured as this network data is not more than the first data volume
During threshold value, determine that this score number of passes is First Line number of passes.
Second determines submodule 605b, is configured as the data volume of this network data not less than the second data volume
During threshold value, determine that this score number of passes is the second Thread Count.
Calculating sub module 605c, the data volume being configured as this network data is in this first data-quantity threshold
And time between this second data-quantity threshold, calculate this score number of passes according to the data volume of this network data.
This matching module 603, including: load submodule 603a and send submodule 603b.
Load submodule 603a, be configured to the appointment position being carried in internal memory by this matching keywords.
Send submodule 603b, be configured at least one coupling thread to this and send instruction message, this instruction
Message is used for indicating this at least one coupling thread to read this matching keywords from this appointment position.
Receiver module 607, is configured to the network data crawled by web crawlers is being added into data team
Before row, receive this network data that crawler server sends.
This first acquisition module 602, is configured to receive this matching keywords that this crawler server sends, should
Matching keywords is the key word that user is arranged in this crawler server.
In sum, a kind of device that network data is carried out Keywords matching that disclosure embodiment provides,
The network data crawled by web crawlers is added into data queue, and the coupling obtaining user setup is crucial
Word, mates thread by least one, the network data in this data queue is carried out with this matching keywords
Coupling, when needing the data crawled web crawlers by different matching keywords to mate, is opened
Originator user has only to input new matching keywords, it is not necessary to the code of change reptile network, thus
Reduce the difficulty that matching keywords is adjusted, reduce development cost.
Additionally, the device that disclosure embodiment provides, by obtaining network number to be matched in this data queue
According to data volume, determine score number of passes according to the data volume of this network data, by newly-built or closed line
Journey, by this, the Thread Count of at least one coupling thread is adjusted to this score number of passes, is ensureing matching treatment
Efficiency, it is to avoid while the network data in data queue is piled up, reaches saving and processes the effect of resource.
It addition, the device that disclosure embodiment provides, queue server receives the network that crawler server sends
Data and matching keywords, and by least one coupling thread, network data and matching keywords are carried out many
Data are crawled by mould matching treatment and Data Matching segmentation performs, it is to avoid carry out substantial amounts of network data
Timing causes network data to crawl the congested of end, and the data improving crawler system crawl performance.
The disclosure one exemplary embodiment additionally provides a kind of device that network data carries out Keywords matching,
It is capable of the method that network data is carried out Keywords matching that the disclosure provides.This device includes: process
Device, and for storing the memorizer of the executable instruction of processor.Wherein, processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one
Mate.
Optionally, described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh
Graticule number of passes.
Optionally, the described data volume according to described network data determines score number of passes, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes
For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes
It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold
Between time, calculate described score number of passes according to the data volume of described network data.
Optionally, described by least one coupling thread, by the network data in described data queue and institute
State matching keywords to mate, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least
One coupling thread reads described matching keywords from described appointment position.
Optionally, described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service
The described network data that device sends.
Optionally, the matching keywords of described acquisition user setup, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute
State the key word arranged in crawler server.
It should be noted is that, above-described embodiment provide device when realizing its function, only with above-mentioned respectively
The division of individual functional module is illustrated, in actual application, and can be according to actual needs and by above-mentioned merit
Distribution can be completed by different functional modules, the content structure of equipment will be divided into different functional modules,
To complete all or part of function described above.
About the device in above-described embodiment, wherein modules performs the concrete mode of operation relevant
The embodiment of the method is described in detail, explanation will be not set forth in detail herein.
Fig. 7 is the block diagram according to a kind of device 700 shown in an exemplary embodiment.Such as, device 700
May be provided in a server.With reference to Fig. 7, device 700 includes processing assembly 722, and it farther includes
One or more processors, and by the memory resource representated by memorizer 732, can be by for storage
The instruction that reason parts 722 perform, such as application program.In memorizer 732, the application program of storage can wrap
Include one or more each corresponding to one group instruction module.Joined additionally, process assembly 722
It is set to perform instruction, to perform above-mentioned by what queue server performed, network data to be carried out Keywords matching
Method.
Device 700 can also include that a power supply module 726 is configured to perform the power management of device 700,
One wired or wireless network interface 750 is configured to be connected to device 700 network, and an input is defeated
Go out (I/O) interface 758.Device 700 can operate based on the operating system being stored in memorizer 732, example
Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to these public affairs
Other embodiment opened.The application is intended to any modification, purposes or the adaptations of the disclosure,
These modification, purposes or adaptations are followed the general principle of the disclosure and include that the disclosure is not disclosed
Common knowledge in the art or conventional techniques means.Description and embodiments is considered only as exemplary
, the true scope of the disclosure and spirit are pointed out by claim below.
It should be appreciated that the disclosure is not limited to accurate knot described above and illustrated in the accompanying drawings
Structure, and various modifications and changes can carried out without departing from the scope.The scope of the present disclosure is only by appended
Claim limits.
Claims (13)
1. the method that network data is carried out Keywords matching, it is characterised in that described method includes:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one
Mate.
Method the most according to claim 1, it is characterised in that described method also includes:
Obtain the data volume of network data to be matched in described data queue;
Data volume according to described network data determines score number of passes;
By newly-built or closedown thread, Thread Count of at least one coupling thread described is adjusted to described mesh
Graticule number of passes.
Method the most according to claim 2, it is characterised in that the described number according to described network data
Score number of passes is determined according to amount, including:
When the data volume of described network data is not more than the first data-quantity threshold, determine described score number of passes
For First Line number of passes;
When the data volume of described network data is not less than the second data-quantity threshold, determine described score number of passes
It it is the second Thread Count;
When the data volume of described network data is in described first data-quantity threshold and described second data-quantity threshold
Between time, calculate described score number of passes according to the data volume of described network data.
Method the most according to claim 1, it is characterised in that described by least one coupling thread,
Network data in described data queue is mated with described matching keywords, including:
The appointment position that described matching keywords is carried in internal memory;
Send instruction message at least one coupling thread described, described instruction message be used for indicating described at least
One coupling thread reads described matching keywords from described appointment position.
Method the most according to claim 1, it is characterised in that described method also includes:
Before the network data crawled by web crawlers is added into data queue, receive reptile service
The described network data that device sends.
Method the most according to claim 5, it is characterised in that the coupling of described acquisition user setup is closed
Keyword, including:
Receiving the described matching keywords that described crawler server sends, described matching keywords is that user is in institute
State the key word arranged in crawler server.
7. the device that network data is carried out Keywords matching, it is characterised in that described device includes:
Add module, for the network data crawled by web crawlers is added into data queue;
First acquisition module, for obtaining the matching keywords of user setup;
Matching module, for by least one mate thread, by the network data in described data queue with
Described matching keywords mates.
Device the most according to claim 7, it is characterised in that described device also includes:
Second acquisition module, for obtaining the data volume of network data to be matched in described data queue;
Determine module, for determining score number of passes according to the data volume of described network data;
Adjusting module, for by newly-built or closedown thread, by the thread of at least one coupling thread described
Number is adjusted to described score number of passes.
Device the most according to claim 8, it is characterised in that described determine module, including:
First determines submodule, is used for when the data volume of described network data is not more than the first data-quantity threshold,
Determine that described score number of passes is First Line number of passes;
Second determines submodule, is used for when the data volume of described network data is not less than the second data-quantity threshold,
Determine that described score number of passes is the second Thread Count;
Calculating sub module, for being in described first data-quantity threshold and institute when the data volume of described network data
When stating between the second data-quantity threshold, calculate described score number of passes according to the data volume of described network data.
Device the most according to claim 7, it is characterised in that described matching module, including:
Load submodule, for the appointment position being carried in internal memory by described matching keywords;
Send submodule, for sending instruction message, described instruction message at least one coupling thread described
For indicating at least one coupling thread described to read described matching keywords from described appointment position.
11. devices according to claim 7, it is characterised in that described device also includes:
Receiver module, was used for before the network data crawled by web crawlers is added into data queue,
Receive the described network data that crawler server sends.
12. devices according to claim 11, it is characterised in that
Described first acquisition module, for receiving the described matching keywords that described crawler server sends, institute
Stating matching keywords is the key word that user is arranged in described crawler server.
13. 1 kinds of devices that network data is carried out Keywords matching, it is characterised in that described device includes:
Processor;
For storing the memorizer of the executable instruction of described processor;
Wherein, described processor is configured to:
The network data crawled by web crawlers is added into data queue;
Obtain the matching keywords of user setup;
Thread is mated, by the network data in described data queue and described matching keywords by least one
Mate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610282294.0A CN105930482A (en) | 2016-04-29 | 2016-04-29 | Method and apparatus for matching keyword with network data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610282294.0A CN105930482A (en) | 2016-04-29 | 2016-04-29 | Method and apparatus for matching keyword with network data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105930482A true CN105930482A (en) | 2016-09-07 |
Family
ID=56837626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610282294.0A Pending CN105930482A (en) | 2016-04-29 | 2016-04-29 | Method and apparatus for matching keyword with network data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105930482A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107193676A (en) * | 2017-05-19 | 2017-09-22 | 成都奇鲁科技有限公司 | Hardware analysis method and device |
CN112396536A (en) * | 2019-08-12 | 2021-02-23 | 北京国双科技有限公司 | Method and device for realizing intelligent service |
CN113157722A (en) * | 2021-04-01 | 2021-07-23 | 北京达佳互联信息技术有限公司 | Data processing method, device, server, system and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN104572901A (en) * | 2014-12-25 | 2015-04-29 | 小米科技有限责任公司 | Method and device for downloading webpage data |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN105138547A (en) * | 2015-07-10 | 2015-12-09 | 无锡天脉聚源传媒科技有限公司 | Data search method and apparatus |
-
2016
- 2016-04-29 CN CN201610282294.0A patent/CN105930482A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN104572901A (en) * | 2014-12-25 | 2015-04-29 | 小米科技有限责任公司 | Method and device for downloading webpage data |
CN105045838A (en) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | Network crawler system based on distributed storage system |
CN105138547A (en) * | 2015-07-10 | 2015-12-09 | 无锡天脉聚源传媒科技有限公司 | Data search method and apparatus |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107193676A (en) * | 2017-05-19 | 2017-09-22 | 成都奇鲁科技有限公司 | Hardware analysis method and device |
CN112396536A (en) * | 2019-08-12 | 2021-02-23 | 北京国双科技有限公司 | Method and device for realizing intelligent service |
CN113157722A (en) * | 2021-04-01 | 2021-07-23 | 北京达佳互联信息技术有限公司 | Data processing method, device, server, system and storage medium |
CN113157722B (en) * | 2021-04-01 | 2023-12-26 | 北京达佳互联信息技术有限公司 | Data processing method, device, server, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109086031B (en) | Business decision method and device based on rule engine | |
CN110362727A (en) | Third party for search system searches for application | |
CN107491488A (en) | The method and apparatus of page data collection | |
US11010215B2 (en) | Recommending applications based on call requests between applications | |
CN103765412A (en) | Predicting user navigation events | |
US20150120729A1 (en) | Web-based representational state transfer api server | |
US8793258B2 (en) | Predicting sharing on a social network | |
CN107291337A (en) | A kind of method and device that Operational Visit is provided | |
CN105930482A (en) | Method and apparatus for matching keyword with network data | |
CN110069693A (en) | Method and apparatus for determining target pages | |
CN109582844A (en) | A kind of method, apparatus and system identifying crawler | |
CN107291778A (en) | The collection method and device of data | |
CN107249019A (en) | Data handling system, method, device and server based on business | |
DE102022101525A1 (en) | INTELLIGENT COOLANT-ASSISTED LIQUID-TO-AIR HEAT EXCHANGER FOR COOLING SYSTEMS IN A DATA CENTER | |
CN104572901B (en) | The method for down loading and device of web data | |
WO2013026953A2 (en) | Method and apparatus for providing search with contextual processing | |
CN112383513A (en) | Crawler behavior detection method and device based on proxy IP address pool and storage medium | |
CN112507265A (en) | Method and device for anomaly detection based on tree structure and related products | |
KR20210064959A (en) | Advertisement management device managing advertisement provided via platform server and operation method of advertisement management device | |
CN116560661A (en) | Code optimization method, device, equipment and storage medium | |
CN112579853A (en) | Method and device for sequencing crawling links and storage medium | |
CN110215703A (en) | The selection method of game application, apparatus and system | |
CN115983275A (en) | Named entity identification method, system and electronic equipment | |
CN108595479A (en) | Web request processing method based on unified Web entrances | |
CN109246069A (en) | Webpage login method, device and readable storage medium storing program for executing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160907 |