CN107895042A - A kind of data capture method and device - Google Patents

A kind of data capture method and device Download PDF

Info

Publication number
CN107895042A
CN107895042A CN201711237292.0A CN201711237292A CN107895042A CN 107895042 A CN107895042 A CN 107895042A CN 201711237292 A CN201711237292 A CN 201711237292A CN 107895042 A CN107895042 A CN 107895042A
Authority
CN
China
Prior art keywords
data
client
parameter information
instruction
sent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711237292.0A
Other languages
Chinese (zh)
Inventor
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Sohu New Media Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sohu New Media Information Technology Co Ltd filed Critical Beijing Sohu New Media Information Technology Co Ltd
Priority to CN201711237292.0A priority Critical patent/CN107895042A/en
Publication of CN107895042A publication Critical patent/CN107895042A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of data capture method and device, by when needing to carry out data acquisition, the client corresponding to data to be obtained is determined first, and the control instruction for including Data Identification to be obtained is sent to client, so that user end to server sends the digital independent dummy instruction for including Data Identification to be obtained, to simulate operation of the real user behavior in client, so that service end can directly receive the digital independent dummy instruction sent by client, by the digital independent dummy instruction that client directly transmits with request key, when avoiding directly by web crawlers progress data acquisition, the request instruction of transmission is without request key, the problem of digital independent can not be realized.

Description

A kind of data capture method and device
Technical field
The present invention relates to data fields, more particularly to a kind of data capture method and device.
Background technology
Core of the web crawlers as search engine, it crawls the quality that efficiency directly affects vertical search engine Effect.
However, traditional oriented network reptile adaptability is poor, run into it is counter climb strategy harsh website when often catch the flap See elbow, be in particular in for some content pages, for its content displaying in specific client, access needs client every time Request key corresponding to generation, and service end can verify its key, not refuse to provide content service, example if if checking Such as:For the content page in wechat subscription number, it is only illustrated in wechat, when to consult its content page, it is necessary to wechat Client generates request instruction, and by asking key with corresponding in the request instruction of client generation, and this is oriented network Reptile institute is irrealizable.
The content of the invention
In view of this, the present invention provides a kind of data capture method and device, to solve to adopt web crawlers in the prior art It can not obtain and be illustrated in particular clients, access needs asking for the content page of request key corresponding to client generation every time Topic, its concrete scheme are as follows:
A kind of data capture method, including:
Client corresponding to determining Data Identification to be obtained and data to be obtained is instructed according to data acquisition;
The control instruction for including the Data Identification to be obtained is sent to the client, so that the client is connecing In the case of receiving the control instruction, send the digital independent simulation comprising the Data Identification to be obtained to server and refer to Order;
Obtain the parameter information that service end returns according to the digital independent dummy instruction;
The parameter information is analyzed, obtains the data to be obtained.
Further, the parameter information for obtaining service end and being returned according to the digital independent dummy instruction, including:
The parameter information that service end returns according to the digital independent dummy instruction is intercepted by way of man-in-the-middle attack.
Further, in addition to:
Other specification in the parameter information of acquisition in addition to necessary parameter is filtered, the parameter after filtering is believed Breath is analyzed, and obtains the data to be obtained.
Further, in addition to:
When receiving construction request, the parameter information of the acquisition is constructed in batches, it is described to be obtained to obtain in batches Data.
Further, it is described that the control instruction for including the Data Identification to be obtained is sent to the client, so that The client sends the number for including the Data Identification to be obtained to server in the case where receiving the control instruction According to reading dummy instruction, including:
Determine that dummy instruction sends opportunity;
When reaching the dummy instruction transmission opportunity, sent to the client and include the Data Identification to be obtained Control instruction so that the client is in the case where receiving the control instruction, sends to server and treated comprising described Obtain the digital independent dummy instruction of Data Identification.
A kind of data acquisition facility, including:Determining unit, transmitting element, parameter acquiring unit and data capture unit, its In:
The determining unit is used for according to corresponding to data acquisition instructs determination Data Identification to be obtained and data to be obtained Client;
The transmitting element is used to send the control instruction for including the Data Identification to be obtained to the client, with Make the client in the case where receiving the control instruction, sent to server comprising the Data Identification to be obtained Digital independent dummy instruction;
The parameter acquiring unit is used to obtain the parameter information that service end returns according to the digital independent dummy instruction;
The data capture unit is used to analyze the parameter information, obtains the data to be obtained.
Further, the parameter acquiring unit is used for:
The parameter information that service end returns according to the digital independent dummy instruction is intercepted by way of man-in-the-middle attack.
Further, in addition to:Parameter filter element, wherein:
The parameter filter element is used to filter the other specification in the parameter information of acquisition in addition to necessary parameter, The parameter information after filtering is sent to the data capture unit.
Further, in addition to:Structural unit, wherein:
The structural unit is used for the parameter information for when receiving construction request, constructing the acquisition in batches, and will criticize The parameter information of amount construction is respectively sent to the different data capture units.
A kind of data acquisition facility, including:Reptile scheduler module, client simulation operations module, go-between's module and mark Quasi- reptile module, wherein:
The reptile scheduler module is used to send data acquisition instruction;
The client simulation operations module is used to determine Data Identification to be obtained according to data acquisition instruction and treated Client corresponding to data is obtained, the control instruction for including the Data Identification to be obtained is sent to the client, so that The client sends the number for including the Data Identification to be obtained to server in the case where receiving the control instruction According to reading dummy instruction;
Go-between's module is used to obtain the parameter information that service end returns according to the digital independent dummy instruction;
The standard reptile module is used to analyze the parameter information, obtains the data to be obtained.
It can be seen from the above technical proposal that data capture method disclosed in the present application and device, by needing to carry out During data acquisition, it is first determined the client corresponding to data to be obtained, and sent to client and include data to be obtained The control instruction of mark, so that user end to server sends the digital independent dummy instruction for including Data Identification to be obtained, with Just operation of the real user behavior in client is simulated, so that service end can directly receive the data sent by client Dummy instruction is read, by the digital independent dummy instruction that client directly transmits with request key, that is, is avoided directly by net When network reptile carries out data acquisition, the request instruction of transmission is without request key, it is impossible to the problem of realizing digital independent.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of data capture method disclosed in the embodiment of the present invention;
Fig. 2 is a kind of flow chart of data capture method disclosed in the embodiment of the present invention;
Fig. 3 is a kind of structural representation of data acquisition facility disclosed in the embodiment of the present invention;
Fig. 4 is a kind of structural representation of data acquisition facility disclosed in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
The invention discloses a kind of data capture method, its flow chart as shown in figure 1, including:
Step S11, client corresponding to determining Data Identification to be obtained and data to be obtained is instructed according to data acquisition;
Data Identification to be obtained is used to indicate data to be obtained, can uniquely determine to wait to obtain by Data Identification to be obtained Access evidence.
When processor determines to carry out the acquisition of data, generation data acquisition instruction, data acquisition instruction includes needs and obtained The data taken, i.e., data to be obtained, or:Including be data to be obtained mark, rather than data to be obtained can be with Including:Client corresponding to data to be obtained.
When not including the client corresponding to data to be obtained in data acquisition instruction, can be instructed by data acquisition In data to be obtained determine its corresponding client, specifically, can be:The source of data to be obtained is determined first, and this is treated It is the client corresponding to data to be obtained to obtain the client corresponding to the source of data.
Such as:Data to be obtained are:An article in wechat client in some wechat public number, will obtain this article Chapter, it is necessary to determine that the client that this article is based on is wechat client first.
Wherein, processor can be specially:Reptile scheduler module, and the data capture method institute base disclosed in the present embodiment In device be:Web crawlers device.
Step S12, the control instruction for including Data Identification to be obtained is sent to client, so that client is receiving In the case of control instruction, the digital independent dummy instruction for including Data Identification to be obtained is sent to server;
When user will obtain the data based on the client using client wants, user's row can be carried out in the client For operation, i.e., request of data is sent to service end, to obtain the data in the client, read or stored, service end When only receiving the request of data of client transmission, its data for obtaining of needs can be just returned to, this is due to client to clothes The request of data that business end is sent includes asking key, and service end only receives the just meeting of the request of data comprising the request key Return to the data.
Therefore, in this programme, the digital independent dummy instruction for data to be obtained is sent based on client, i.e., in visitor In the end of family, modelling customer behavior operation, the data read command of simulation is sent, service end is received the digital independent of simulation and refers to Order, because digital independent dummy instruction is that the user directly simulated sends data read command to service end in the client, because This, it is same in the data read command for the simulation that service end receives to include request key, after being verified to the request key, Service end will be considered that the digital independent dummy instruction is that user is sent by client, therefore, can return to its needs and read Parameter information.
Step S13, the parameter information that service end returns according to digital independent dummy instruction is obtained;
Service end, to client feedback parameter information, obtains the parameter information, obtaining should according to digital independent dummy instruction Parameter information can be that processor obtains, or the web crawlers device that the present embodiment is based on obtains, in this process In, client can obtain the parameter information, and the parameter information can not also be obtained.
Further, can be after client gets the parameter information, directly when client obtains the parameter information Connect the web crawlers device that the parameter information is forwarded to the present embodiment and is based on;Can also be:The parameter is believed in service end During breath is sent to client, the web crawlers device disclosed in the present embodiment obtains to the parameter information.
When client does not obtain the parameter information, Ke Yiwei:The parameter is sent to the process of client in service end In, the parameter information is intercepted by way of man-in-the-middle attack, i.e., the parameter information directly sent service end to client enters Row interception, so that client does not receive the parameter information, and go-between's module is obtained by way of man-in-the-middle attack The parameter information.
The both ends of man-in-the-middle attack, i.e. attacker and communication be respectively created it is independent contact, and exchange received by it Data, the both ends of communication are made to think their the connection by a secret and other side's direct dialogues, but actually whole meeting Words are all controlled completely by attacker.In a man-in-the-niiddle, attacker can intercept the call of communication two party and insert newly interior Hold.
In this programme, go-between's module only needs the parameter information that interception service end is sent to client, so as to Client-based data to be obtained are obtained by the digital independent dummy instruction of client.
Step S14, parameter information is analyzed, obtains data to be obtained.
The web crawlers device that scheme disclosed in the present embodiment is based on, its parameter information obtained is after decrypted Request URL, herein, it is only necessary to after analyzing the parameter information, download and preserve, that is, realize web crawlers Client-based data are crawled.
Data capture method disclosed in the present embodiment, by when needing to carry out data acquisition, it is first determined to be obtained Client corresponding to data, and sent to client and include the control instruction of Data Identification to be obtained so that client to Server sends the digital independent dummy instruction for including Data Identification to be obtained, to simulate real user behavior in client Operation, so that service end can directly receive the digital independent dummy instruction sent by client, directly sent out by client The digital independent dummy instruction sent is with request key, that is, when avoiding directly by web crawlers progress data acquisition, transmission Request instruction is without request key, it is impossible to the problem of realizing digital independent.
Present embodiment discloses a kind of data capture method, its flow chart as shown in Fig. 2 including:
Step S21, client corresponding to determining Data Identification to be obtained and data to be obtained is instructed according to data acquisition;
Step S22, the control instruction for including Data Identification to be obtained is sent to client, so that client is receiving In the case of control instruction, the digital independent dummy instruction for including Data Identification to be obtained is sent to server;
Determine that dummy instruction sends opportunity, when reaching dummy instruction transmission opportunity, included to client transmission and need to be obtained The control instruction of Data Identification is taken, so that client in the case where receiving control instruction, sends to include to server and waits to obtain Take the digital independent dummy instruction of Data Identification.
Processor determines the transmission opportunity of dummy instruction, only when reaching the transmission opportunity of the dummy instruction, just carries out The transmission of dummy instruction, in order to be monitored to crawling process.
Step S23, the parameter information that service end is back to client according to data read command is obtained;
It can also include:To in the parameter information of acquisition except other specification that must be in addition to parameter filters, after filtering Parameter information analyzed, obtain data to be obtained.
The parameter information of acquisition is filtered, in order to which independent parameter in parameter information, or inessential parameter are filtered Fall, wherein, independent parameter or inessential parameter can include:Advertisement, or, the content unrelated with data to be obtained, can also be Other guide, it is not specifically limited herein.
Step S24, when receiving construction request, the parameter information of batch construction acquisition is to be obtained to obtain in batches Data;
Processor determines whether progress parametric configuration, if it is not, being then directly entered in next step, carries out the analysis of parameter information; If parametric configuration can be carried out, firstly the need of progress parametric configuration is determined when, when reaching the opportunity of parametric configuration, generate Construction request, batch construction is carried out to the parameter information of acquisition, in order to obtain data to be obtained in batches.
The information specifically, go-between's module gets parms, when receiving batch construction request, to the parameter information of acquisition Batch construction is carried out, corresponding, standard reptile module receives the parameter information of go-between's module.
After go-between's module carries out batch construction to parameter information, there can be multiple parameters information, and go-between's mould Block and standard reptile module are one-to-many annexation, i.e. multiple parameters information is respectively sent to different by go-between's module Standard reptile module, such as:Go-between's module sends the first parameter information to the first standard reptile module, by the second parameter information Send to the second standard reptile module, the like.
Wherein, go-between's module can also be:More than one, each go-between's module can correspond to multiple standards respectively Reptile module.
Analysis acquisition is carried out to multiple parameters information by multiple standard reptile modules, and preserved, is obtained it is achieved thereby that treating The extensive batch for evidence of fetching captures.
Step S25, parameter information is analyzed, obtains data to be obtained.
Data capture method disclosed in the present embodiment, by when needing to carry out data acquisition, it is first determined to be obtained Client corresponding to data, and sent to client and include the control instruction of Data Identification to be obtained so that client to Server sends the digital independent dummy instruction for including Data Identification to be obtained, to simulate real user behavior in client Operation, so that service end can directly receive the digital independent dummy instruction sent by client, directly sent out by client The digital independent dummy instruction sent is with request key, that is, when avoiding directly by web crawlers progress data acquisition, transmission Request instruction is without request key, it is impossible to the problem of realizing digital independent.
Present embodiment discloses a kind of data acquisition facility, its structural representation as shown in figure 3, including:
Determining unit 31, transmitting element 32, parameter acquiring unit 33 and data capture unit 34.
Wherein it is determined that unit 31, which is used to be instructed according to data acquisition, determines that Data Identification to be obtained and data to be obtained are corresponding Client;
Data Identification to be obtained is used to indicate data to be obtained, can uniquely determine to wait to obtain by Data Identification to be obtained Access evidence.
When processor determines to carry out the acquisition of data, generation data acquisition instruction, data acquisition instruction includes needs and obtained The data taken, i.e., data to be obtained, or:Including be data to be obtained mark, rather than data to be obtained can be with Including:Client corresponding to data to be obtained.
When not including the client corresponding to data to be obtained in data acquisition instruction, can be instructed by data acquisition In data to be obtained determine its corresponding client, specifically, can be:The source of data to be obtained is determined first, and this is treated It is the client corresponding to data to be obtained to obtain the client corresponding to the source of data.
Such as:Data to be obtained are:An article in wechat client in some wechat public number, will obtain this article Chapter, it is necessary to determine that the client that this article is based on is wechat client first.
Wherein, processor can be specially:Reptile scheduler module, and the data acquisition facility institute base disclosed in the present embodiment In system be:Web crawlers device.
Transmitting element 32 is used to send the control instruction for including Data Identification to be obtained to client, so that client exists In the case of receiving control instruction, the digital independent dummy instruction for including Data Identification to be obtained is sent to server;
When user will obtain the data based on the client using client wants, user's row can be carried out in the client For operation, i.e., request of data is sent to service end, to obtain the data in the client, read or stored, service end When only receiving the request of data of client transmission, its data for obtaining of needs can be just returned to, this is due to client to clothes The request of data that business end is sent includes asking key, and service end only receives the just meeting of the request of data comprising the request key Return to the data.
Therefore, in this programme, the digital independent dummy instruction for data to be obtained is sent based on client, i.e., in visitor In the end of family, modelling customer behavior operation, the data read command of simulation is sent, service end is received the digital independent of simulation and refers to Order, because digital independent dummy instruction is that the user directly simulated sends data read command to service end in the client, because This, it is same in the data read command for the simulation that service end receives to include request key, after being verified to the request key, Service end will be considered that the digital independent dummy instruction is that user is sent by client, therefore, can return to its needs and read Parameter information.
Determine that dummy instruction sends opportunity, when reaching dummy instruction transmission opportunity, included to client transmission and need to be obtained The control instruction of Data Identification is taken, so that client in the case where receiving control instruction, sends to include to server and waits to obtain Take the digital independent dummy instruction of Data Identification.
Processor determines the transmission opportunity of dummy instruction, only when reaching the transmission opportunity of the dummy instruction, just carries out The transmission of dummy instruction, in order to be monitored to crawling process.
Parameter acquiring unit 33 is used to obtain the parameter letter that service end is back to client according to digital independent dummy instruction Breath;
Service end, to client feedback parameter information, obtains the parameter information, obtaining should according to digital independent dummy instruction Parameter information can be that processor obtains, or the web crawlers device that the present embodiment is based on obtains, in this process In, client can obtain the parameter information, and the parameter information can not also be obtained.
Further, can be after client gets the parameter information, directly when client obtains the parameter information Connect the web crawlers device that the parameter information is forwarded to the present embodiment and is based on;Can also be:The parameter is believed in service end During breath is sent to client, the web crawlers device disclosed in the present embodiment obtains to the parameter information.
When client does not obtain the parameter information, Ke Yiwei:The parameter is sent to the process of client in service end In, the parameter information is intercepted by way of man-in-the-middle attack, i.e., the parameter information directly sent service end to client enters Row interception, so that client does not receive the parameter information, and go-between's module is obtained by way of man-in-the-middle attack The parameter information.
The both ends of man-in-the-middle attack, i.e. attacker and communication be respectively created it is independent contact, and exchange received by it Data, the both ends of communication are made to think their the connection by a secret and other side's direct dialogues, but actually whole meeting Words are all controlled completely by attacker.In a man-in-the-niiddle, attacker can intercept the call of communication two party and insert newly interior Hold.
In this programme, go-between's module only needs the parameter information that interception service end is sent to client, so as to Client-based data to be obtained are obtained by the digital independent dummy instruction of client.
Data acquisition facility disclosed in the present embodiment, it can also include:Parameter filter element.
Wherein, parameter filter element be used in the parameter information of acquisition except other specification that must be in addition to parameter was carried out Filter, analyzes the parameter information after filtering, obtains data to be obtained.
The parameter information of acquisition is filtered, in order to which independent parameter in parameter information, or inessential parameter are filtered Fall, wherein, independent parameter or inessential parameter can include:Advertisement, or, the content unrelated with data to be obtained, can also be Other guide, it is not specifically limited herein.
Data acquisition facility disclosed in the present embodiment, it can also include:Structural unit.
Structural unit be used for when receive construction request when, batch construction obtain parameter information, and by batch construction Parameter information is respectively sent to different data capture units.
Processor determines whether progress parametric configuration, if it is not, being then directly entered in next step, carries out the analysis of parameter information; If parametric configuration can be carried out, firstly the need of progress parametric configuration is determined when, when reaching the opportunity of parametric configuration, generate Construction request, batch construction is carried out to the parameter information of acquisition, in order to obtain data to be obtained in batches.
The information specifically, go-between's module gets parms, when receiving batch construction request, to the parameter information of acquisition Batch construction is carried out, corresponding, standard reptile module receives the parameter information of go-between's module.
After go-between's module carries out batch construction to parameter information, there can be multiple parameters information, and go-between's mould Block and standard reptile module are one-to-many annexation, i.e. multiple parameters information is respectively sent to different by go-between's module Standard reptile module, such as:Go-between's module sends the first parameter information to the first standard reptile module, by the second parameter information Send to the second standard reptile module, the like.
Wherein, go-between's module can also be:More than one, each go-between's module can correspond to multiple standards respectively Reptile module.
Analysis acquisition is carried out to multiple parameters information by multiple standard reptile modules, and preserved, is obtained it is achieved thereby that treating The extensive batch for evidence of fetching captures.
Data capture unit 34 is used to analyze parameter information, obtains data to be obtained.
The web crawlers device that scheme disclosed in the present embodiment is based on, its parameter information obtained is after decrypted Request URL, herein, it is only necessary to after analyzing the parameter information, download and preserve, that is, realize web crawlers Client-based data are crawled.
Data acquisition facility disclosed in the present embodiment, by when needing to carry out data acquisition, it is first determined to be obtained Client corresponding to data, and sent to client and include the control instruction of Data Identification to be obtained so that client to Server sends the digital independent dummy instruction for including Data Identification to be obtained, to simulate real user behavior in client Operation, so that service end can directly receive the digital independent dummy instruction sent by client, directly sent out by client The digital independent dummy instruction sent is with request key, that is, when avoiding directly by web crawlers progress data acquisition, transmission Request instruction is without request key, it is impossible to the problem of realizing digital independent.
Present embodiment discloses a kind of data acquisition facility, its structural representation as shown in figure 4, including:
Reptile scheduler module 41, client simulation operations module 42, go-between's module 43 and standard reptile module 44.
Wherein, reptile scheduler module 41 is used to send data acquisition instruction;
When reptile scheduler module 41 determines to carry out the acquisition of data, generation data acquisition instruction, wrapped in data acquisition instruction Include the data for needing to obtain, i.e., data to be obtained, or:Including be data to be obtained mark, rather than number to be obtained According to can also include:Client corresponding to data to be obtained.
When not including the client corresponding to data to be obtained in data acquisition instruction, can be instructed by data acquisition In data to be obtained determine its corresponding client, specifically, can be:The source of data to be obtained is determined first, and this is treated It is the client corresponding to data to be obtained to obtain the client corresponding to the source of data.
Such as:Data to be obtained are:An article in wechat client in some wechat public number, will obtain this article Chapter, it is necessary to determine that the client that this article is based on is wechat client first.
Client simulation operations module 42, which is used to be instructed according to data acquisition, determines Data Identification to be obtained and number to be obtained According to corresponding client, the control instruction for including Data Identification to be obtained is sent to client, so that client is receiving In the case of control instruction, the digital independent dummy instruction for including Data Identification to be obtained is sent to server;
When user will obtain the data based on the client using client wants, user's row can be carried out in the client For operation, i.e., request of data is sent to service end, to obtain the data in the client, read or stored, service end When only receiving the request of data of client transmission, its data for obtaining of needs can be just returned to, this is due to client to clothes The request of data that business end is sent includes asking key, and service end only receives the just meeting of the request of data comprising the request key Return to the data.
Therefore, in this programme, the digital independent dummy instruction for data to be obtained is sent based on client, i.e., in visitor In the end of family, modelling customer behavior operation, the data read command of simulation is sent, service end is received the digital independent of simulation and refers to Order, because digital independent dummy instruction is that the user directly simulated sends data read command to service end in the client, because This, it is same in the data read command for the simulation that service end receives to include request key, after being verified to the request key, Service end will be considered that the digital independent dummy instruction is that user is sent by client, therefore, can return to its needs and read Parameter information.
Determine that dummy instruction sends opportunity, when reaching dummy instruction transmission opportunity, included to client transmission and need to be obtained The control instruction of Data Identification is taken, so that client in the case where receiving control instruction, sends to include to server and waits to obtain Take the digital independent dummy instruction of Data Identification.
Reptile scheduler module determines the transmission opportunity of dummy instruction, only when reaching the transmission opportunity of the dummy instruction, The transmission of dummy instruction is just carried out, in order to be monitored to crawling process.
Go-between's module 43 is used to obtain the parameter information that service end is back to client according to digital independent dummy instruction;
Service end, to client feedback parameter information, obtains the parameter information, obtaining should according to digital independent dummy instruction Parameter information can be that processor obtains, or the web crawlers device that the present embodiment is based on obtains, in this process In, client can obtain the parameter information, and the parameter information can not also be obtained.
Further, can be after client gets the parameter information, directly when client obtains the parameter information Connect the web crawlers device that the parameter information is forwarded to the present embodiment and is based on;Can also be:The parameter is believed in service end During breath is sent to client, the web crawlers device disclosed in the present embodiment obtains to the parameter information.
When client does not obtain the parameter information, Ke Yiwei:The parameter is sent to the process of client in service end In, the parameter information is intercepted by way of man-in-the-middle attack, i.e., the parameter information directly sent service end to client enters Row interception, so that client does not receive the parameter information, and go-between's module is obtained by way of man-in-the-middle attack The parameter information.
The both ends of man-in-the-middle attack, i.e. attacker and communication be respectively created it is independent contact, and exchange received by it Data, the both ends of communication are made to think their the connection by a secret and other side's direct dialogues, but actually whole meeting Words are all controlled completely by attacker.In a man-in-the-niiddle, attacker can intercept the call of communication two party and insert newly interior Hold.
In this programme, go-between's module only needs the parameter information that interception service end is sent to client, so as to Client-based data to be obtained are obtained by the digital independent dummy instruction of client.
Wherein, the other specification that go-between's module can be also used in the parameter information to acquisition in addition to necessary parameter is carried out Filtering, analyzes the parameter information after filtering, obtains data to be obtained.
The parameter information of acquisition is filtered, in order to which independent parameter in parameter information, or inessential parameter are filtered Fall, wherein, independent parameter or inessential parameter can include:Advertisement, or, the content unrelated with data to be obtained, can also be Other guide, it is not specifically limited herein.
Wherein, go-between's module can be also used for when receiving construction request, the parameter information that batch construction obtains, and The parameter information of batch construction is respectively sent to different data capture units.
Reptile scheduler module determines whether progress parametric configuration, if it is not, being then directly entered in next step, carries out parameter information Analysis;If parametric configuration can be carried out, firstly the need of progress parametric configuration is determined when, when the opportunity for reaching parametric configuration When, generation construction request, batch construction is carried out to the parameter information of acquisition, in order to obtain data to be obtained in batches.
The information specifically, go-between's module gets parms, when receiving batch construction request, to the parameter information of acquisition Batch construction is carried out, corresponding, standard reptile module receives the parameter information of go-between's module.
After go-between's module carries out batch construction to parameter information, there can be multiple parameters information, and go-between's mould Block and standard reptile module are one-to-many annexation, i.e. multiple parameters information is respectively sent to different by go-between's module Standard reptile module, such as:Go-between's module sends the first parameter information to the first standard reptile module, by the second parameter information Send to the second standard reptile module, the like.
Wherein, go-between's module can also be:More than one, each go-between's module can correspond to multiple standards respectively Reptile module.
Analysis acquisition is carried out to multiple parameters information by multiple standard reptile modules, and preserved, is obtained it is achieved thereby that treating The extensive batch for evidence of fetching captures.
Standard reptile module 44 is used to analyze parameter information, obtains data to be obtained.
The web crawlers device that scheme disclosed in the present embodiment is based on, its parameter information obtained is after decrypted Request URL, herein, it is only necessary to after analyzing the parameter information, download and preserve, that is, realize web crawlers Client-based data are crawled.
Data acquisition facility disclosed in the present embodiment, by when needing to carry out data acquisition, it is first determined to be obtained Client corresponding to data, and sent to client and include the control instruction of Data Identification to be obtained so that client to Server sends the digital independent dummy instruction for including Data Identification to be obtained, to simulate real user behavior in client Operation, so that service end can directly receive the digital independent dummy instruction sent by client, directly sent out by client The digital independent dummy instruction sent is with request key, that is, when avoiding directly by web crawlers progress data acquisition, transmission Request instruction is without request key, it is impossible to the problem of realizing digital independent.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment For, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is said referring to method part It is bright.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, the composition and step of each example are generally described according to function in the above description.These Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty Technical staff can realize described function using distinct methods to each specific application, but this realization should not Think beyond the scope of this invention.
Directly it can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims (10)

  1. A kind of 1. data capture method, it is characterised in that including:
    Client corresponding to determining Data Identification to be obtained and data to be obtained is instructed according to data acquisition;
    The control instruction for including the Data Identification to be obtained is sent to the client, so that the client is receiving In the case of the control instruction, the digital independent dummy instruction for including the Data Identification to be obtained is sent to server;
    Obtain the parameter information that service end returns according to the digital independent dummy instruction;
    The parameter information is analyzed, obtains the data to be obtained.
  2. 2. according to the method for claim 1, it is characterised in that the acquisition service end refers to according to digital independent simulation The parameter information returned is made, including:
    The parameter information that service end returns according to the digital independent dummy instruction is intercepted by way of man-in-the-middle attack.
  3. 3. according to the method for claim 1, it is characterised in that also include:
    Other specification in the parameter information of acquisition in addition to necessary parameter is filtered, the parameter information after filtering is entered Row analysis, obtains the data to be obtained.
  4. 4. according to the method for claim 1, it is characterised in that also include:
    When receiving construction request, the parameter information of the acquisition is constructed in batches, to obtain the data to be obtained in batches.
  5. 5. according to the method for claim 1, it is characterised in that it is described to the client send include it is described to be obtained The control instruction of Data Identification, so that the client in the case where receiving the control instruction, sends to server and wrapped Digital independent dummy instruction containing the Data Identification to be obtained, including:
    Determine that dummy instruction sends opportunity;
    When reaching the dummy instruction transmission opportunity, the control for including the Data Identification to be obtained is sent to the client System instruction, so that the client in the case where receiving the control instruction, is sent comprising described to be obtained to server The digital independent dummy instruction of Data Identification.
  6. A kind of 6. data acquisition facility, it is characterised in that including:Determining unit, transmitting element, parameter acquiring unit and data obtain Unit is taken, wherein:
    The determining unit is used for the client according to corresponding to data acquisition instructs determination Data Identification to be obtained and data to be obtained End;
    The transmitting element is used to send the control instruction for including the Data Identification to be obtained to the client, so that institute Client is stated in the case where receiving the control instruction, the data for including the Data Identification to be obtained are sent to server Read dummy instruction;
    The parameter acquiring unit is used to obtain the parameter information that service end returns according to the digital independent dummy instruction;
    The data capture unit is used to analyze the parameter information, obtains the data to be obtained.
  7. 7. device according to claim 6, it is characterised in that the parameter acquiring unit is used for:
    The parameter information that service end returns according to the digital independent dummy instruction is intercepted by way of man-in-the-middle attack.
  8. 8. device according to claim 6, it is characterised in that also include:Parameter filter element, wherein:
    The parameter filter element is used to filter the other specification in the parameter information of acquisition in addition to necessary parameter, incited somebody to action The parameter information after filter is sent to the data capture unit.
  9. 9. device according to claim 6, it is characterised in that also include:Structural unit, wherein:
    The structural unit is used for when receiving construction request, constructs the parameter information of the acquisition in batches, and by batch structure The parameter information made is respectively sent to the different data capture units.
  10. A kind of 10. data acquisition facility, it is characterised in that including:Reptile scheduler module, client simulation operations module, centre People's module and standard reptile module, wherein:
    The reptile scheduler module is used to send data acquisition instruction;
    The client simulation operations module is used to determine Data Identification to be obtained and to be obtained according to data acquisition instruction Client corresponding to data, the control instruction for including the Data Identification to be obtained is sent to the client, so that described Client sends the data comprising the Data Identification to be obtained in the case where receiving the control instruction, to server and read Modulus pseudoinstruction;
    Go-between's module is used to obtain the parameter information that service end returns according to the digital independent dummy instruction;
    The standard reptile module is used to analyze the parameter information, obtains the data to be obtained.
CN201711237292.0A 2017-11-30 2017-11-30 A kind of data capture method and device Pending CN107895042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711237292.0A CN107895042A (en) 2017-11-30 2017-11-30 A kind of data capture method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711237292.0A CN107895042A (en) 2017-11-30 2017-11-30 A kind of data capture method and device

Publications (1)

Publication Number Publication Date
CN107895042A true CN107895042A (en) 2018-04-10

Family

ID=61807055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711237292.0A Pending CN107895042A (en) 2017-11-30 2017-11-30 A kind of data capture method and device

Country Status (1)

Country Link
CN (1) CN107895042A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289576A1 (en) * 2009-11-23 2011-11-24 Fred Cheng Rubbing encryption algorithm and security attack safe otp token
CN105743868A (en) * 2014-12-11 2016-07-06 中国科学院声学研究所 Data acquisition system supporting encrypted and non-encrypted protocols and method
CN106789601A (en) * 2017-02-08 2017-05-31 奥秘智能科技(洛阳)有限公司 Universal data collection and supervisor control and method based on wechat public platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289576A1 (en) * 2009-11-23 2011-11-24 Fred Cheng Rubbing encryption algorithm and security attack safe otp token
CN105743868A (en) * 2014-12-11 2016-07-06 中国科学院声学研究所 Data acquisition system supporting encrypted and non-encrypted protocols and method
CN106789601A (en) * 2017-02-08 2017-05-31 奥秘智能科技(洛阳)有限公司 Universal data collection and supervisor control and method based on wechat public platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卢杨: "一种基于P2P技术的分布式微博爬虫系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN103023710B (en) A kind of safety test system and method
CN108206972B (en) Direct broadcasting room popularity processing method, device, server and storage medium
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN105577602B (en) Data push method and device based on open application programming interface
CN107431712A (en) Network flow daily record for multi-tenant environment
CN105516133A (en) User identity verification method, server and client
CN107465651A (en) Network attack detecting method and device
CN106341381A (en) Method and system of key management for rack server system
CN108259432A (en) A kind of management method of API Calls, equipment and system
CN103618696B (en) Method and server for processing cookie information
CN102761450B (en) System, method and device for website analysis
CN107241344A (en) Intercept method, apparatus and system of the client to the access of hostile network server
CN109471993A (en) Online webpage evidence collecting method, device, storage medium and equipment in real time
CN107229877A (en) Certificate management, acquisition methods, device, computer program and electronic equipment
CN106156133A (en) Method that control table substance is submitted to again, Apparatus and system
CN109756337A (en) A kind of safety access method and device of service interface
CN107886437A (en) Actively survey processing method, system, computer equipment and storage medium
CN110069911A (en) Access control method, device, system, electronic equipment and readable storage medium storing program for executing
CN114187006A (en) Block chain supervision-based federal learning method
CN109522501A (en) Content of pages management method and its device
CN104462242B (en) Webpage capacity of returns statistical method and device
CN107623693A (en) Domain name mapping means of defence and device, system, computing device, storage medium
CN107368407A (en) Information processing method and device
CN109446807A (en) The method, apparatus and electronic equipment of malicious robot are intercepted for identification
CN108924159A (en) The verification method and device in a kind of message characteristic identification library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180410

RJ01 Rejection of invention patent application after publication