CN108170744A - A kind of collecting method and device - Google Patents

A kind of collecting method and device Download PDF

Info

Publication number
CN108170744A
CN108170744A CN201711375381.1A CN201711375381A CN108170744A CN 108170744 A CN108170744 A CN 108170744A CN 201711375381 A CN201711375381 A CN 201711375381A CN 108170744 A CN108170744 A CN 108170744A
Authority
CN
China
Prior art keywords
keyword
data
retrieval result
dimension
determined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711375381.1A
Other languages
Chinese (zh)
Inventor
邢荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Cloud Service Information Technology Co Ltd
Original Assignee
Shandong Inspur Cloud Service Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Cloud Service Information Technology Co Ltd filed Critical Shandong Inspur Cloud Service Information Technology Co Ltd
Priority to CN201711375381.1A priority Critical patent/CN108170744A/en
Publication of CN108170744A publication Critical patent/CN108170744A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention provides a kind of collecting method and device, this method includes:At least one dimension and the corresponding at least one descriptor of each described dimension are set;The corresponding at least one target dimension of data to be collected is determined at least one dimension;According to the corresponding at least one descriptor of each described target dimension, at least one keyword is determined;Using at least one keyword, the data to be collected are retrieved, obtain retrieval result;According to the retrieval result, judge whether at least one keyword is reasonable, if so, carrying out data acquisition to the retrieval result.Therefore, scheme provided by the invention can improve data acquisition accuracy.

Description

A kind of collecting method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of collecting method and device.
Background technology
With the arrival in big data epoch, people recognize the importance of data further, therefore by internet data Acquisition get valuable data.
At present, the method for data acquisition is usually:Continuous depth is gone according to the data distribution of Website page using crawlers Enter to parse website, traverse the total data in the link parsed or the page, to collect required data.But due to mesh The data content of preceding major website constantly extends, and classification is also being on the increase.Therefore it is adopting with the aforedescribed process analytically In connection out or the page needed for acquisition during data, since acquisition range is excessive, leading to collected data, there are data Skimble-scamble problem that redundancy, data leakage are adopted, collection result is classified.Therefore, existing mode, the accuracy rate of data acquisition are relatively low.
Invention content
An embodiment of the present invention provides a kind of collecting method and devices, can improve data acquisition accuracy.
In a first aspect, an embodiment of the present invention provides a kind of collecting method, this method includes:
At least one dimension and the corresponding at least one descriptor of each described dimension are set;
The corresponding at least one target dimension of data to be collected is determined at least one dimension;
According to the corresponding at least one descriptor of each described target dimension, at least one keyword is determined;
Using at least one keyword, the data to be collected are retrieved, obtain retrieval result;
According to the retrieval result, judge whether at least one keyword is reasonable, if so, to the retrieval result Carry out data acquisition.
Preferably,
Further comprise:
When judging that at least one keyword is unreasonable, perform:
A1:Again according to the corresponding at least one descriptor of each described target dimension, at least one new key is determined Word;
A2:Using at least one new key, the data to be collected are retrieved, obtain new retrieval result;
A3:According to the new retrieval result, judge whether at least one new key is reasonable, if so, to described New retrieval result carries out data acquisition;Otherwise, step A1 is performed.
Preferably,
It is described according to the corresponding at least one descriptor of each described target dimension, determine at least one keyword, wrap It includes:
Summarize the corresponding at least one descriptor of each described target dimension;
Summarized descriptor is utilized, forms at least one keyword to be determined;
It is performed both by for keyword to be determined each described:Judge whether the keyword to be determined can characterize at least The feature of one target dimension, if so, the keyword to be determined is determined as keyword.
Preferably,
It is described to utilize at least one keyword, the data to be collected are retrieved, obtain retrieval result, are wrapped It includes:
At least one key combination is formed, wherein, each described key combination includes at least one keyword;
It is performed both by for key combination each described:Using preset crawlers, pass through the keyword At least one of combination data to be collected described in key search, obtain the corresponding retrieval result of the key combination.
Preferably,
It is described to judge whether at least one keyword is reasonable according to the retrieval result, including:
Determine at least one character string that the retrieval result includes;
Count the occurrence number of each character string;
It is performed both by for character string each described:Judge the character string whether at least one target dimension Feature matches;If it does not match, continuing to judge whether the occurrence number of the character string reaches preset number threshold Value, if not up to described frequency threshold value, judges that at least one keyword is reasonable.
Preferably,
It is described to judge whether at least one keyword is reasonable according to the retrieval result, including:
Count the data volume of the retrieval result;
Judge whether the data volume is more than preset data-quantity threshold, if it is not, then determining described at least one Keyword is reasonable.
Second aspect, an embodiment of the present invention provides a kind of data acquisition device, which includes:
Setup module, for setting at least one dimension and the corresponding at least one descriptor of each described dimension;
Dimension determining module, for determining data to be collected at least one dimension for being set in the setup module Corresponding at least one target dimension;
Keyword determining module, it is corresponding for each described target dimension for being determined according to the dimension determining module At least one descriptor determines at least one keyword;
Module is retrieved, at least one keyword determined using the keyword determining module, is treated to described Gathered data is retrieved, and obtains retrieval result;
Acquisition module for the retrieval result obtained according to the retrieval module, judges at least one key Whether word is reasonable, if so, carrying out data acquisition to the retrieval result.
Preferably,
The keyword determining module is further used in the triggering for receiving the acquisition module, again according to every The corresponding at least one descriptor of one target dimension, determines at least one new key;
The retrieval module is further used at least one new key determined using the keyword determining module Word retrieves the data to be collected, obtains new retrieval result;
The acquisition module is further used for obtaining the new retrieval result according to the retrieval module, judge described in extremely Whether a few new key is reasonable, if so, carrying out data acquisition to the new retrieval result;Otherwise, the key is triggered Word determining module.
Preferably,
The keyword determining module, including:Form submodule and determination sub-module;
The formation submodule, for summarizing the corresponding at least one descriptor of each described target dimension;Using institute The descriptor summarized forms at least one keyword to be determined;
The determination sub-module is performed both by for being directed to each described keyword to be determined:Judge the pass to be determined Whether key word can characterize the feature of at least one target dimension, if so, the keyword to be determined is determined as closing Key word.
Preferably,
The retrieval module, is used to form at least one key combination, wherein, it is wrapped in each described key combination Include at least one keyword;It is performed both by for each key combination:Using preset crawlers, pass through the pass At least one of key word combination data to be collected described in key search obtain the corresponding retrieval knot of the key combination Fruit.
Preferably,
The acquisition module, including:First statistic submodule and the first judging submodule;
First statistic submodule, for determining at least one character string that the retrieval result includes;Statistics is every The occurrence number of one character string;
First judging submodule is performed both by for being directed to each described character string:Whether judge the character string Match with the feature of at least one target dimension;If it does not match, continue to judge the occurrence number of the character string Whether reach preset frequency threshold value, if not up to described frequency threshold value, judge at least one keyword Rationally.
Preferably,
The acquisition module, including:Second statistic submodule and second judgment submodule;
Second statistic submodule, for counting the data volume of the retrieval result;
The second judgment submodule, for judging whether the data volume is more than preset data-quantity threshold, such as Fruit is no, it is determined that at least one keyword is reasonable.
An embodiment of the present invention provides a kind of collecting method and devices, can be set set according to business need first Quantity dimension and the corresponding at least one descriptor of each dimension.It, can be each when determining data to be collected The corresponding at least one target dimension of data to be collected is determined in the dimension of setting.It is and corresponding each according to each target dimension A descriptor determines at least one keyword.Gathered data is treated using identified keyword to be retrieved, to be examined Hitch fruit.Then when keyword determined by judging according to retrieval result is reasonable, data acquisition is carried out to retrieval.By upper It states it is found that can determine keyword by the corresponding target dimension of data to be collected in this programme.And utilize what is determined Keyword treats gathered data and is oriented retrieval, and data acquisition is carried out with the retrieval result retrieved to orientation.Due to retrieval The result is that obtained according to keyword orientation retrieval.Therefore, it is accurate can to improve data acquisition for scheme provided in an embodiment of the present invention True rate.
Description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments, for those of ordinary skill in the art, without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow chart of collecting method provided by one embodiment of the present invention;
Fig. 2 is a kind of flow chart for collecting method that another embodiment of the present invention provides;
Fig. 3 is the hardware architecture diagram of equipment where a kind of data acquisition device provided by one embodiment of the present invention;
Fig. 4 is a kind of structure diagram of data acquisition device provided by one embodiment of the present invention;
Fig. 5 is that a kind of data including forming submodule and determination sub-module provided by one embodiment of the present invention acquire The structure diagram of device;
Fig. 6 is provided by one embodiment of the present invention a kind of including the first statistic submodule and the first judging submodule The structure diagram of data acquisition device;
Fig. 7 is provided by one embodiment of the present invention a kind of including the second statistic submodule and second judgment submodule The structure diagram of data acquisition device.
Specific embodiment
Purpose, technical scheme and advantage to make the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art The all other embodiments obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
As shown in Figure 1, an embodiment of the present invention provides a kind of collecting method, this method may comprise steps of:
Step 101:At least one dimension and the corresponding at least one descriptor of each described dimension are set;
Step 102:The corresponding at least one target dimension of data to be collected is determined at least one dimension;
Step 103:According to the corresponding at least one descriptor of each described target dimension, at least one key is determined Word;
Step 104:Using at least one keyword, the data to be collected are retrieved, obtain retrieval result;
Step 105:According to the retrieval result, judge whether at least one keyword is reasonable, if so, to described Retrieval result carries out data acquisition.
Embodiment according to figure 1, first can according to business need set setting quantity dimension and each The corresponding at least one descriptor of dimension.When determining data to be collected, can determine to treat in the dimension of each setting The corresponding at least one target dimension of gathered data.And it according to the corresponding each descriptor of each target dimension, determines at least One keyword.Gathered data is treated using identified keyword to be retrieved, to obtain retrieval result.Then according to inspection When hitch fruit judges that identified keyword is reasonable, data acquisition is carried out to retrieval.By above-mentioned it is found that can be in this programme By the corresponding target dimension of data to be collected, keyword is determined.And determined keyword is utilized to treat gathered data Retrieval is oriented, data acquisition is carried out with the retrieval result retrieved to orientation.Since retrieval result is determined according to keyword It is obtained to retrieval.Therefore, scheme provided in an embodiment of the present invention can improve data acquisition accuracy.
In an embodiment of the invention, the specific pattern of data to be collected can be determined according to business need.For example, it treats Gathered data can be the data in each website.
In an embodiment of the invention, at least one dimension in flow chart shown in above-mentioned Fig. 1 involved by step 101 The pattern of quantity and each dimension can be determined according to business need.Such as at least one dimension can include but It is not limited to time dimension (time, month, season, week, day, hour), region dimension (province, city, area, county, village), type dimension (clothes, number, cuisines, toy, household electrical appliances, automobile, house property etc.), (such as tourism data can be divided into the exclusive dimension of specific industry:It is domestic Trip, travel abroad, periphery trip etc., for another example garment data can be divided into:Men's clothing, women's dress, underwear etc.) at least one of or it is multiple.
In the present embodiment, the quantity of the corresponding at least one descriptor of each dimension and pattern can also be according to industry Business requirement determines.For example, below by dimension to be illustrated for region dimension (province):Region dimension is corresponding at least one to retouch Predicate includes Hebei province, Henan Province, Shanxi Province, Shandong Province.
In an embodiment of the invention, the step 103 in flow chart shown in above-mentioned Fig. 1 is tieed up according to each described target Corresponding at least one descriptor is spent, determines at least one keyword, can be included:
Summarize the corresponding at least one descriptor of each described target dimension;
Summarized descriptor is utilized, forms at least one keyword to be determined;
It is performed both by for keyword to be determined each described:Judge whether the keyword to be determined can characterize at least The feature of one target dimension, if so, the keyword to be determined is determined as keyword.
In the present embodiment, summarized descriptor is utilized, the method for forming at least one keyword to be determined can wrap It includes following two:
Method one:Each descriptor is individually listed, at least one goal description word is selected in each descriptor, Each goal description word is determined as a keyword to be determined;
Method two:At least one goal description word is selected in each descriptor, forms at least one key to be determined Word, wherein, each keyword to be determined includes at least one goal description word.
It should be noted that no matter using which kind of above-mentioned method, keyword to be determined should be able to be described semantically Go out the feature of at least one target dimension.
Time dimension (time), region dimension (province), particular row are included at least one target dimension determined below It is illustrated for the exclusive dimension of industry (men's clothing, women's dress, underwear):The corresponding descriptor of time dimension (time) include 2016 with And 2017;Dimension corresponding descriptor in region includes Hebei province, Shandong Province;The corresponding descriptor packet of the exclusive dimension of specific industry Include men's clothing, women's dress.The keyword to be determined then determined can include men's clothing in 2016, Hebei province's women's dress in 2017.It can be seen that Men's clothing in 2016 can symbolize time dimension with the feature of the exclusive dimension of specific industry, and Hebei province's women's dress can characterize within 2017 Go out the feature of time dimension, region dimension and the exclusive dimension of specific industry.
According to above-described embodiment, by thus according to the corresponding target dimension of data to be collected and each target dimension pair The descriptor answered determines keyword, and each keyword for determining may be from being semantically depicted data to be collected Partial content.Therefore, keyword and the matching degree of data to be collected are higher.
In an embodiment of the invention, the step 104 in flow chart shown in above-mentioned Fig. 1 utilizes at least one key Word retrieves the data to be collected, obtains retrieval result, can include:
At least one key combination is formed, wherein, each described key combination includes at least one keyword;
It is performed both by for key combination each described:Using preset crawlers, pass through the keyword At least one of combination data to be collected described in key search, obtain the corresponding retrieval result of the key combination.
In the present embodiment, the method for forming at least one key combination can include:The first, by each key Word is used as a key combination;Second, each key combination includes at least two keywords, and each is closed The combination of key word is different.
In the present embodiment, it when data to be collected are website data, is performed both by for each key combination:It utilizes Crawlers are submitted to website includes the access request of key compositional, then in response message (such as the seed for getting website Address) when, pass through the corresponding webpage in browser access website seed address.It is inputted in column is retrieved each in key combination Then keyword will obtain the corresponding retrieval result of key combination.
According to above-described embodiment, at least one key combination is formed using identified keyword.Then using advance The crawlers of setting are treated gathered data using each key combination and are retrieved.Due at least one keyword group The various combined situations of keyword can be covered by closing, therefore, utilize formed key combination can to data to be retrieved into Row complete search, to lower the probability for data omission occur.
In an embodiment of the invention, can include after at least one key combination of formation into one:
Whether the data volume of data to be collected is judged beyond preset total amount threshold value, if so, by described to be collected Data split at least one subdata to be collected, wherein, the quantity of each subdata to be collected is respectively less than described total Measure threshold value;
It is described by data to be collected described in the key search of at least one of the key combination, including:
Pass through each described subdata to be collected of at least one of key combination key search.
In an embodiment of the invention, realize the step 105 in flow chart shown in above-mentioned Fig. 1 according to the retrieval result It is following two to judge that the whether rational method of at least one keyword includes at least:
Method one:
In an embodiment of the invention, the step 105 in flow chart shown in above-mentioned Fig. 1 is sentenced according to the retrieval result Whether at least one keyword that breaks is reasonable, including:
Determine at least one character string that the retrieval result includes;
Count the occurrence number of each character string;
It is performed both by for character string each described:Judge the character string whether at least one target dimension Feature matches;If it does not match, continuing to judge whether the occurrence number of the character string reaches preset number threshold Value, if not up to described frequency threshold value, judges that at least one keyword is reasonable.
In the present embodiment, the method for at least one character string that retrieval result includes can be:The first, exists at random At least one character string is extracted in retrieval result;Second, determine each character string that retrieval result includes.
In the present embodiment, below using above-mentioned first method at random retrieval result extract at least one character string as Example illustrates:At least one target dimension includes time dimension (time), region dimension (province), the exclusive dimension of specific industry (men's clothing, women's dress, underwear).Character string 1 and character string 2 have been randomly selected in retrieval result.By judge character string 1 (such as 2017) match with feature time of target dimension time dimension, it is determined that the character string 1 is reasonable;By judging character string 2 (such as biscuit) and the feature of any one target dimension mismatch, and judge that it occurs 50 times and had exceeded to set in advance Fixed frequency threshold value, it is determined that the character string 2 is unreasonable.By above-mentioned it is found that character string 2 is unreasonable, then judge to determine Keyword in there are unreasonable keyword, thus determine that keyword is unreasonable.
According to above-described embodiment, determine at least one character string that retrieval result includes, according to each character string with The occurrence number of the matching degree of each target dimension and each character string, whether keyword closes determined by judgement Reason.Since character string is determined from retrieval result, character string, which can really reflect in retrieval result, wraps The content included, therefore can accurately judge whether keyword is reasonable according to character string.
Method two:
In an embodiment of the invention, the step 105 in flow chart shown in above-mentioned Fig. 1 is sentenced according to the retrieval result Whether at least one keyword that breaks is reasonable, including:
The data volume of the retrieval result is counted,
Judge whether the data volume is more than preset data-quantity threshold, if it is not, then determining described at least one Keyword is reasonable.
In the present embodiment, the specific pattern of data volume can be determined according to business need.For example can be numerical value, ratio Any one in paricular value, number of pages value.
In the present embodiment, when the data volume for judging retrieval result is less than data-quantity threshold, illustrate that keyword is determined More precisely, can accurately retrieve satisfactory data;It has been more than data in the data volume for judging retrieval result When measuring threshold value, then illustrate fixed unreasonable of keyword, retrieve in data and unwanted redundant data occur.
In the present embodiment, below by data volume to be illustrated for number of pages value:Preset data-quantity threshold is Page 100.And the number of pages value for counting retrieval result is page 110.Then judge that the number of pages value page 110 in retrieval result has exceeded Preset data-quantity threshold page 100.Illustrate to retrieve in data and unwanted redundant data occur, the key determined Word is unreasonable, needs to redefine keyword.
According to above-described embodiment, according to the relationship between the data volume of retrieval result and preliminary setting data amount threshold value, really Whether reasonable determine keyword.Since the data volume of retrieval result can really reflect that retrieval result whether there is redundant digit According to, therefore can accurately judge whether keyword is reasonable according to the data volume of retrieval result.
In an embodiment of the invention, the above method one can be combined to realize collecting method with method two.
In an embodiment of the invention, collecting method further comprises:
When judging that at least one keyword is unreasonable, perform:
A1:Again according to the corresponding at least one descriptor of each described target dimension, at least one new key is determined Word;
A2:Using at least one new key, the data to be collected are retrieved, obtain new retrieval result;
A3:According to the new retrieval result, judge whether at least one new key is reasonable, if so, to described New retrieval result carries out data acquisition;Otherwise, step A1 is performed.
In the present embodiment, when judging that at least one keyword is unreasonable, illustrating retrieval result, there are data redundancies Etc. abnormal conditions.In order to obtain more accurate retrieval result need again again according to the corresponding descriptor of each target dimension come Determine new key.
After new key is determined, need to treat gathered data again using identified new key and be examined Rope, to obtain new retrieval result, and whether new key determined by being judged again new retrieval result is reasonable.Judging When reasonable, then data acquisition can be carried out to new retrieval result.When judging unreasonable, then continue to repeat the above process, directly Until judging that determined keyword is reasonable.
According to above-described embodiment, need to retouch according to each target dimension is corresponding again when judging that keyword is unreasonable Predicate determines new key, and data acquisition is carried out to treat gathered data again according to new key.Due to can basis Retrieval result redefines keyword, therefore retrieval result can be made more accurate.
In an embodiment of the invention, it is involved to the retrieval in the step 105 in flow chart shown in above-mentioned Fig. 1 As a result data acquisition is carried out, can be included:
Determine at least one retrieval parameter;
The corresponding at least one target data of each retrieval parameter is acquired in the retrieval result.
In the present embodiment, retrieval parameter can be according to business need.For example can be some region in a time Garment marketing information.
According to above-described embodiment, data acquisition is carried out using retrieval result, since retrieval result reduces data acquisition Range.Therefore, the efficiency of data acquisition is higher.
Below by taking the data A to be collected provided website A carries out data acquisition as an example, collecting method is said It is bright.As shown in Fig. 2, the collecting method includes:
Step 201:At least one dimension and the corresponding at least one descriptor of each dimension are set.
In this step, the dimension of setting includes time dimension, region dimension, type dimension, the exclusive dimension of specific industry. Wherein, the corresponding descriptor of time dimension includes 2016 and 2017;The corresponding descriptor of region dimension include Hebei province, Henan Province, Shanxi Province, Shandong Province.The corresponding descriptor of type dimension includes clothes, number, cuisines;The exclusive dimension of specific industry Corresponding descriptor includes men's clothing, women's dress.
Step 202:The corresponding at least one target dimension of data to be collected is determined at least one dimension.
In this step, determine that the corresponding target dimensions of data A to be collected include time dimension, region dimension, specific The exclusive dimension of industry.
Step 203:Summarize the corresponding at least one descriptor of each target dimension.
In this step, the descriptor of sum time dimension, region dimension, the exclusive dimension of specific industry.
Step 204:Summarized descriptor is utilized, forms at least one keyword to be determined.
In this step, the keyword to be determined of formation includes men's clothing in 2016, Hebei province's women's dress in 2017.
Step 205:In at least one keyword to be determined, a keyword to be determined is selected to be treated really as current successively Determine keyword.
Step 206:Judge whether current keyword to be determined can characterize the feature of at least one target dimension, if It is to perform step 207;Otherwise, step 208 is performed.
In this step, when men's clothing is current keyword to be determined within 2016, men's clothing can symbolize the time within 2016 Dimension performs step 208 with the feature of the exclusive dimension of specific industry.It is current keyword to be determined in Hebei province's women's dress in 2017 When, Hebei province's women's dress in 2017 can symbolize the feature of time dimension, region dimension and the exclusive dimension of specific industry, perform Step 208.
Step 207:Current keyword to be determined is determined as keyword.
Step 208:Judge whether current keyword to be determined is the last one keyword to be determined, if so, performing step Rapid 209;Otherwise, step 205 is performed.
In this step, men's clothing in 2016, Hebei province's women's dress in 2017 are determined as keyword.
Step 209:At least one key combination is formed, wherein, each key combination includes at least one pass Key word.
In this step, the key combination of formation includes:Men's clothing in 2016, Hebei province's women's dress in 2017, man in 2016 Fill Hebei province's women's dress in 2017.
Step 210:It is performed both by for each key combination:Using preset crawlers, pass through keyword At least one of combination key search data to be collected, obtain the corresponding retrieval result of key combination.
It is illustrated for Hebei province's women's dress within 2017:Using presetting Crawlers, keyword (men's clothing in 2016, the Hebei province in 2017 included by the Hebei province's women's dress in 2017 of men's clothing in 2016 Women's dress) retrieval data A to be collected, obtain the corresponding retrieval result of the Hebei province's women's dress in 2017 of men's clothing in 2016.
Step 211:The data volume of the current obtained retrieval result of statistics.
In this step, the number of pages value for counting retrieval result is page 90.
Step 212:Judge whether data volume is more than preset data-quantity threshold, if not, performing step 213;It is no Then, step 214 is performed.
In this step, judge that page 90 are less than data-quantity threshold page 100, perform step 213.
Step 213:It determines that at least one keyword is reasonable, data acquisition is carried out to retrieval result.
In this step, determine that keyword men's clothing in 2016, Hebei province's women's dress in 2017 are reasonable, to retrieval result into line number According to acquisition.
For example, the service sales information of clothes in 2017 is obtained from retrieval result.
Step 214:Again according to the corresponding at least one descriptor of each target dimension, at least one new key is determined Word.
Step 215:It using at least one new key, treats gathered data and is retrieved, obtain new retrieval result, it will be new Retrieval result performs step 211 as current obtained retrieval result.
As shown in Figure 3, Figure 4, an embodiment of the present invention provides a kind of data acquisition devices.Device embodiment can be by soft Part is realized, can also be realized by way of hardware or software and hardware combining.For hardware view, as shown in figure 3, for this hair A kind of hardware structure diagram of equipment where the data acquisition device that bright embodiment provides, in addition to processor shown in Fig. 3, memory, Except network interface and nonvolatile memory, the equipment in embodiment where device can also usually include other hardware, Such as it is responsible for the forwarding chip of processing message.For implemented in software, as shown in figure 4, as the dress on a logical meaning It puts, is to read computer program instructions corresponding in nonvolatile memory in memory by the CPU of equipment where it to transport What row was formed.Data acquisition device provided in this embodiment, including:
Setup module 401, for setting at least one dimension and the corresponding at least one description of each described dimension Word;
Dimension determining module 402, for determining to wait to adopt at least one dimension for setting in the setup module 401 Collect the corresponding at least one target dimension of data;
Keyword determining module 403, for each the described target dimension determined according to the dimension determining module 402 Corresponding at least one descriptor, determines at least one keyword;
Module 404 is retrieved, it is right at least one keyword determined using the keyword determining module 403 The data to be collected are retrieved, and obtain retrieval result;
Acquisition module 405, for described according to the obtained retrieval result of retrieval module 404, judging at least one Whether a keyword is reasonable, if so, carrying out data acquisition to the retrieval result.
According to embodiment shown in Fig. 4, can the corresponding mesh of data to be collected be passed through with keyword determining module in this programme Dimension is marked, determines keyword.Retrieval module utilizes determined keyword to treat gathered data and is oriented retrieval, so that Acquisition module can carry out data acquisition to the retrieval result that orientation retrieves.Since retrieval result is to be oriented to examine according to keyword What rope obtained.Therefore, scheme provided in an embodiment of the present invention can improve data acquisition accuracy.
In an embodiment of the invention, the keyword determining module 403 is further used for receiving the acquisition During the triggering of module 405, again according to the corresponding at least one descriptor of each described target dimension, determine at least one new Keyword;
The retrieval module 404 is further used for determining using the keyword determining module 403 described at least one New key retrieves the data to be collected, obtains new retrieval result;
The acquisition module 405 is further used for obtaining the new retrieval result according to the retrieval module 404, judge Whether at least one new key is reasonable, if so, carrying out data acquisition to the new retrieval result;Otherwise, institute is triggered State keyword determining module 403.
In an embodiment of the invention, as shown in figure 5, the keyword determining module 403 can include:Form submodule Block 4031 and determination sub-module 4032;
The formation submodule 4031, for summarizing the corresponding at least one descriptor of each described target dimension;Profit With the descriptor summarized, at least one keyword to be determined is formed;
The determination sub-module 4032 is performed both by for being directed to each described keyword to be determined:It is treated described in judgement really Determine whether keyword can characterize the feature of at least one target dimension, if so, the keyword to be determined is determined For keyword.
In an embodiment of the invention, the retrieval module 404, is used to form at least one key combination, wherein, Each described key combination includes at least one keyword;It is performed both by for each key combination:Using advance The crawlers of setting by data to be collected described in the key search of at least one of the key combination, obtain institute State the corresponding retrieval result of key combination.
In an embodiment of the invention, as shown in fig. 6, the acquisition module 405 can include:First statistic submodule 4051 and first judging submodule 4052;
First statistic submodule 4051, for determining at least one character string that the retrieval result includes;System Count the occurrence number of each character string;
First judging submodule 4052 is performed both by for being directed to each described character string:Judge the character string Whether match with the feature of at least one target dimension;If it does not match, continue to judge the appearance of the character string Whether number reaches preset frequency threshold value, if not up to described frequency threshold value, judges at least one pass Key word is reasonable.
In an embodiment of the invention, as shown in fig. 7, the acquisition module 405 can include:Second statistic submodule 4053 and second judgment submodule 4054;
Second statistic submodule 4053, for counting the data volume of the retrieval result,
The second judgment submodule 4054, for judging whether the data volume is more than preset data volume threshold Value, if it is not, then determining that at least one keyword is reasonable.
A kind of readable medium is provided in one embodiment of the invention, which includes:Execute instruction, when storage is controlled When the processor of device processed performs the execute instruction, the storage control performs data acquisition side described in any one of the above embodiments Method.
In an embodiment of the invention, above-mentioned Fig. 6 can be combined to realize data with structure diagram shown in Fig. 7 Harvester.
A kind of storage control is provided in one embodiment of the invention, which includes:Processor, memory And bus;The memory is used to store execute instruction;The processor is connect with the memory by the bus;Work as institute When stating storage control operation, the processor performs the execute instruction of the memory storage, so that the storage control Device processed performs collecting method described in any one of the above embodiments.
The contents such as the information exchange between each unit, implementation procedure in above device, due to implementing with the method for the present invention Example can be found in the narration in the method for the present invention embodiment based on same design, particular content, and details are not described herein again.
In conclusion each embodiment of the present invention can at least realize following advantageous effect:
1st, in embodiments of the present invention, setting quantity dimension and each dimension can be set according to business need first Spend corresponding at least one descriptor.When determining data to be collected, can determine to wait to adopt in the dimension of each setting Collect the corresponding at least one target dimension of data.And according to the corresponding each descriptor of each target dimension, determine at least one A keyword.Gathered data is treated using identified keyword to be retrieved, to obtain retrieval result.Then according to retrieval As a result when keyword determined by judging is reasonable, data acquisition is carried out to retrieval.By above-mentioned it is found that can lead in this programme The corresponding target dimension of data to be collected is crossed, determines keyword.And utilize determined keyword treat gathered data into Row orientation retrieval carries out data acquisition with the retrieval result retrieved to orientation.Since retrieval result is oriented according to keyword What retrieval obtained.Therefore, scheme provided in an embodiment of the present invention can improve data acquisition accuracy.
2nd, in embodiments of the present invention, by thus according to the corresponding target dimension of data to be collected and each target dimension Corresponding descriptor is spent to determine keyword, each keyword determined may be from being semantically depicted number to be collected According to partial content.Therefore, keyword and the matching degree of data to be collected are higher.
3rd, in embodiments of the present invention, at least one key combination is formed using identified keyword.Then it uses Preset crawlers are treated gathered data using each key combination and are retrieved.Due at least one key Word combines the various combined situations that can cover keyword, therefore, utilizes formed key combination can be to number to be retrieved According to complete search is carried out, to lower the probability for data omission occur.
4th, in embodiments of the present invention, at least one character string that retrieval result includes is determined, according to each character String and the matching degree of each target dimension and the occurrence number of each character string, whether keyword determined by judgement Rationally.Since character string is determined from retrieval result, character string can really reflect in retrieval result Including content, therefore can accurately judge whether keyword reasonable according to character string.
5th, in embodiments of the present invention, according to the pass between the data volume of retrieval result and preliminary setting data amount threshold value System, determines whether keyword is reasonable.Since the data volume of retrieval result can really reflect retrieval result with the presence or absence of superfluous Remainder evidence, therefore can accurately judge whether keyword is reasonable according to the data volume of retrieval result.
6th, in embodiments of the present invention, it needs to be corresponded to according to each target dimension again when judging that keyword is unreasonable Descriptor, determine new key, so as to treated again according to new key gathered data carry out data acquisition.Due to can be with Keyword is redefined according to retrieval result, therefore retrieval result can be made more accurate.
7th, in embodiments of the present invention, data acquisition is carried out using retrieval result, is adopted since retrieval result reduces data The range of collection.Therefore, the efficiency of data acquisition is higher.
It should be noted that herein, such as first and second etc relational terms are used merely to an entity Or operation is distinguished with another entity or operation, is existed without necessarily requiring or implying between these entities or operation Any actual relationship or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non- It is exclusive to include, so that process, method, article or equipment including a series of elements not only include those elements, But also it including other elements that are not explicitly listed or further includes solid by this process, method, article or equipment Some elements.In the absence of more restrictions, the element limited by sentence " including one ", is not arranged Except in the process, method, article or apparatus that includes the element also in the presence of other identical factor.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and aforementioned program can be stored in computer-readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is performed;And aforementioned storage medium includes:ROM, RAM, magnetic disc or light In the various media that can store program code such as disk.
It is last it should be noted that:The foregoing is merely presently preferred embodiments of the present invention, is merely to illustrate the skill of the present invention Art scheme, is not intended to limit the scope of the present invention.Any modification for being made all within the spirits and principles of the present invention, Equivalent replacement, improvement etc., are all contained in protection scope of the present invention.

Claims (10)

1. a kind of collecting method, which is characterized in that
At least one dimension and the corresponding at least one descriptor of each described dimension are set;
It further includes:
The corresponding at least one target dimension of data to be collected is determined at least one dimension;
According to the corresponding at least one descriptor of each described target dimension, at least one keyword is determined;
Using at least one keyword, the data to be collected are retrieved, obtain retrieval result;
According to the retrieval result, judge whether at least one keyword is reasonable, if so, being carried out to the retrieval result Data acquire.
2. according to the method described in claim 1, it is characterized in that,
Further comprise:
When judging that at least one keyword is unreasonable, perform:
A1:Again according to the corresponding at least one descriptor of each described target dimension, at least one new key is determined;
A2:Using at least one new key, the data to be collected are retrieved, obtain new retrieval result;
A3:According to the new retrieval result, judge whether at least one new key is reasonable, if so, to the new inspection Hitch fruit carries out data acquisition;Otherwise, step A1 is performed.
3. according to the method described in claim 1, it is characterized in that,
It is described that at least one keyword is determined according to the corresponding at least one descriptor of each described target dimension, including:
Summarize the corresponding at least one descriptor of each described target dimension;
Summarized descriptor is utilized, forms at least one keyword to be determined;
It is performed both by for keyword to be determined each described:It is at least one to judge whether the keyword to be determined can characterize The feature of the target dimension, if so, the keyword to be determined is determined as keyword;
And/or
It is described to utilize at least one keyword, the data to be collected are retrieved, obtain retrieval result, including:
At least one key combination is formed, wherein, each described key combination includes at least one keyword;
It is performed both by for key combination each described:Using preset crawlers, pass through the key combination At least one of data to be collected described in key search, obtain the corresponding retrieval result of the key combination.
4. method according to any one of claims 1 to 3, which is characterized in that
It is described to judge whether at least one keyword is reasonable according to the retrieval result, including:
Determine at least one character string that the retrieval result includes;
Count the occurrence number of each character string;
It is performed both by for character string each described:Judge the character string whether the feature at least one target dimension Match;If it does not match, continue to judge whether the occurrence number of the character string reaches preset frequency threshold value, such as Fruit is not up to the frequency threshold value, then judges that at least one keyword is reasonable.
5. method according to any one of claims 1 to 3, which is characterized in that
It is described to judge whether at least one keyword is reasonable according to the retrieval result, including:
Count the data volume of the retrieval result;
Judge whether the data volume is more than preset data-quantity threshold, if it is not, then determining at least one key Word is reasonable.
6. a kind of data acquisition device, which is characterized in that
Setup module, for setting at least one dimension and the corresponding at least one descriptor of each described dimension;
Dimension determining module, for determining that data to be collected correspond at least one dimension for being set in the setup module At least one target dimension;
Keyword determining module, it is corresponding at least for each described target dimension for being determined according to the dimension determining module One descriptor, determines at least one keyword;
Module is retrieved, at least one keyword determined using the keyword determining module, to described to be collected Data are retrieved, and obtain retrieval result;
Acquisition module for the retrieval result obtained according to the retrieval module, judges that at least one keyword is It is no reasonable, if so, carrying out data acquisition to the retrieval result.
7. device according to claim 6, which is characterized in that
The keyword determining module is further used in the triggering for receiving the acquisition module, again according to each The corresponding at least one descriptor of the target dimension, determines at least one new key;
The retrieval module is further used at least one new key determined using the keyword determining module, The data to be collected are retrieved, obtain new retrieval result;
The acquisition module is further used for obtaining the new retrieval result according to the retrieval module, judge described at least one Whether a new key is reasonable, if so, carrying out data acquisition to the new retrieval result;Otherwise, it is true to trigger the keyword Cover half block.
8. device according to claim 6, which is characterized in that
The keyword determining module, including:Form submodule and determination sub-module;
The formation submodule, for summarizing the corresponding at least one descriptor of each described target dimension;Using being summarized Descriptor, form at least one keyword to be determined;
The determination sub-module is performed both by for being directed to each described keyword to be determined:Judge the keyword to be determined Whether the feature of at least one target dimension can be characterized, if so, the keyword to be determined is determined as keyword;
And/or
The retrieval module, is used to form at least one key combination, wherein, each described key combination include to A few keyword;It is performed both by for each key combination:Using preset crawlers, pass through the keyword At least one of combination data to be collected described in key search, obtain the corresponding retrieval result of the key combination.
9. according to any device of claim 6 to 8, which is characterized in that
The acquisition module, including:First statistic submodule and the first judging submodule;
First statistic submodule, for determining at least one character string that the retrieval result includes;Count each The occurrence number of the character string;
First judging submodule is performed both by for being directed to each described character string:Judge the character string whether with extremely The feature of a few target dimension matches;If it does not match, continue to judge the character string occurrence number whether Reach preset frequency threshold value, if not up to described frequency threshold value, judge that at least one keyword is reasonable.
10. according to any device of claim 6 to 8, which is characterized in that
The acquisition module, including:Second statistic submodule and second judgment submodule;
Second statistic submodule, for counting the data volume of the retrieval result;
The second judgment submodule, for judging whether the data volume is more than preset data-quantity threshold, if not, Then determine that at least one keyword is reasonable.
CN201711375381.1A 2017-12-19 2017-12-19 A kind of collecting method and device Pending CN108170744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711375381.1A CN108170744A (en) 2017-12-19 2017-12-19 A kind of collecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711375381.1A CN108170744A (en) 2017-12-19 2017-12-19 A kind of collecting method and device

Publications (1)

Publication Number Publication Date
CN108170744A true CN108170744A (en) 2018-06-15

Family

ID=62522452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711375381.1A Pending CN108170744A (en) 2017-12-19 2017-12-19 A kind of collecting method and device

Country Status (1)

Country Link
CN (1) CN108170744A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750280A (en) * 2011-04-19 2012-10-24 国际商业机器公司 Computer processing method and system for search
CN104731903A (en) * 2015-03-23 2015-06-24 魏强 Method for searching for enterprise on basis of products and search device
US20150248471A1 (en) * 2014-03-03 2015-09-03 Fujitsu Limited Group forming method, data collecting method and data collecting apparatus
CN105243106A (en) * 2015-09-22 2016-01-13 百度在线网络技术(北京)有限公司 Method and apparatus used for generating inquiry results
CN106445916A (en) * 2016-09-19 2017-02-22 合肥清浊信息科技有限公司 Semantic analysis method for patent retrieval

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750280A (en) * 2011-04-19 2012-10-24 国际商业机器公司 Computer processing method and system for search
US20150248471A1 (en) * 2014-03-03 2015-09-03 Fujitsu Limited Group forming method, data collecting method and data collecting apparatus
CN104731903A (en) * 2015-03-23 2015-06-24 魏强 Method for searching for enterprise on basis of products and search device
CN105243106A (en) * 2015-09-22 2016-01-13 百度在线网络技术(北京)有限公司 Method and apparatus used for generating inquiry results
CN106445916A (en) * 2016-09-19 2017-02-22 合肥清浊信息科技有限公司 Semantic analysis method for patent retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
优酷: "优酷6.13.3", 《安智市场》 *

Similar Documents

Publication Publication Date Title
US9135370B2 (en) Method and apparatus of generating update parameters and displaying correlated keywords
CN104426713B (en) The monitoring method and device of web site access effect data
CN103886068B (en) Data processing method and device for Internet user's behavioural analysis
CN104239351B (en) A kind of training method and device of the machine learning model of user behavior
WO2015085961A1 (en) User profile configuring method and device
CN104346354B (en) It is a kind of that the method and device for recommending word is provided
CN110347561B (en) Monitoring alarm method and terminal equipment
CN106547793A (en) The method and apparatus for obtaining proxy server address
CN108492150B (en) Method and system for determining entity heat degree
CN103970747B (en) Data processing method for network side computer to order search results
CN107369058A (en) A kind of correlation recommendation method and server
CN106936778A (en) The abnormal detection method of website traffic and device
CN107483381A (en) The monitoring method and device of interlock account
CN108023764A (en) Abnormality eliminating method and device
CN104391953B (en) Detect the method and device of webpage renewal
CN112632446A (en) Page access path construction method and system
CN104598595A (en) Fraud webpage detection method and corresponding device
CN107832444A (en) Event based on search daily record finds method and device
CN109816004A (en) Source of houses picture classification method, device, equipment and storage medium
CN106101117B (en) A kind of fishing website blocking-up method, device and system
CN108170744A (en) A kind of collecting method and device
CN106257449A (en) A kind of information determines method and apparatus
CN106789392A (en) A kind of methods, devices and systems for monitoring web crawlers
CN105988881B (en) Method and device for processing resource access operation information
CN104462392B (en) Share the statistical method and device of capacity of returns

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180615

RJ01 Rejection of invention patent application after publication