CN111259224B - Data crawling method and device - Google Patents
Data crawling method and device Download PDFInfo
- Publication number
- CN111259224B CN111259224B CN202010105861.1A CN202010105861A CN111259224B CN 111259224 B CN111259224 B CN 111259224B CN 202010105861 A CN202010105861 A CN 202010105861A CN 111259224 B CN111259224 B CN 111259224B
- Authority
- CN
- China
- Prior art keywords
- data
- code
- error
- key
- error data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/362—Software debugging
- G06F11/3624—Software debugging by performing operations on the source code, e.g. via a compiler
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Storage Device Security (AREA)
Abstract
A data crawling method and device comprises the following steps: crawling data; finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; and obtaining a key from the encrypted code and carrying out data correction by combining with a code corresponding to the error data to obtain corrected correct data. According to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained.
Description
Technical Field
The application relates to a data crawling method and device.
Background
A web crawler (also called web spider, web robot) is a program or script that automatically captures information in a network according to a certain rule.
At present, in order to prevent the problems of slow response speed and resource occupation of a plurality of websites, crawlers can be prevented from acquiring information. The purpose can be achieved directly by a mode of requesting for shielding, but sometimes, because complete shielding is not desired, only partial key information is desired not to be crawled, a key information graphing mode can be adopted at the moment, but the graphing mode is relatively easy to crack by an OCR (optical character recognition) mode, so that methods for preventing important data from being crawled by an encryption mode appear, and the working efficiency and the accuracy of data crawling are greatly reduced by the method.
Disclosure of Invention
In order to solve the above problem, an aspect of the present application provides a data crawling method, including the following steps: crawling data; finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; and obtaining a key from the encrypted code and carrying out data correction by combining with a code corresponding to the error data to obtain corrected correct data. According to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained. It should be noted that the present application includes two technical solutions, one is to find an error from the final data, and then to find the position where the error is made by reverse pursuit; the other is to determine some possible error positions from the original code, and correct the positions in the process of acquiring data, so that the finally obtained data is correct data.
Preferably, the encryption code is a front-end code. The front-end code refers to a code preceding to the content code, and is generally used for functions such as identification, indexing and the like, and in the anti-crawler process, the front-end code is often used for encryption operation.
Preferably, the method for locating error data is performed as follows: and positioning the front-end code of the data in the data crawling process, and positioning the data with the front-end code.
Preferably, the data having the front-end code is error data. The method screens the encrypted data by screening the front-end codes, and then reversely decodes the encrypted data in the processing process of the codes with the data so as to obtain correct data.
Preferably, the key is a correct arrangement of the crawled data.
Preferably, the data correction is performed as follows: slicing the error data and then reordering the slices with the key to obtain the correct data.
Preferably, the correct data is placed in the designated location using the location determined by the front end code of the incorrect data.
Preferably, the front-end code is a number of a class value in the tag.
Preferably, the encryption method is a confusion encryption mode of multiple JS types.
On the other hand, the application provides a data crawling device, which comprises the following modules:
the data acquisition module crawls data;
the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code;
and the correction module is used for obtaining the key from the encrypted code and correcting the data by combining the code corresponding to the error data to obtain corrected correct data.
This application can bring following beneficial effect: according to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained. It should be noted that the present application includes two technical solutions, one is to find an error from the final data, and then to find the position where the error is made by reverse pursuit; the other method is to determine some possible error positions from the original code, and correct the error positions in the data acquisition process, so that the finally obtained data is correct data.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of an embodiment;
FIG. 2 is a key map;
FIG. 3 is a graph of data for out-of-order scatter;
FIG. 4 is a diagram of ordered correctly combined data;
FIG. 5 is a schematic view of another embodiment;
fig. 6 shows a specific cracking process of the second embodiment.
Detailed Description
In order to clearly explain the technical features of the present invention, the present application will be explained in detail by the following embodiments and the accompanying drawings.
In a first embodiment, as shown in fig. 1, the following is performed:
s101, crawling data;
s201, confirming which data are error data from the crawled data;
s301, positioning error data to an error code, and finding an encryption code from a front-end code of the error code;
s401, obtaining a secret key and correcting according to the condition of error data to obtain final correct data.
As shown in FIG. 1, the referenced web page is https:// car. Autohome. Com. Cn/config/series/146.Html, and as can be seen from FIG. 1, the directly identified identification result is: a8 The comfort of the Plus A8L 50TFS quattro in the 2019 style is reduced by Audi; study of its front end code resulted in span class = "hs _ kw49_ configfe", then decipher hs _ kw49_ configfe as "audi", and position before A8, resulted in audi A8 2019 Plus A8L 50TFS quattro ".
The specific decoding process is as follows: crawling data, using canonical matching to extract required data from the crawling data and filtering irrelevant data to obtain < span class = ' hs _ kw49_ configLn ' > < span A8 2019 type Plus A8L 50TFSI quattro, then using canonical matching < span \ s class = [ \ ' \ hs _ kw (\ \ d +) __________________, and then using canonical matching to extract front end code < span class = ' hs _ kw49_ configLn ' > < span >, using canonical extraction number 49 to obtain number 49, and obtaining the number of the A8 front encrypted word with the number 49.
And extracting related front-end codes from the crawled data by using regular matching according to the code rule of the encryption part, filtering irrelevant codes, reserving the encrypted codes, decrypting by using a reverse decryption method to obtain a key (shown as figure 2) and unordered scattered data (shown as figure 3), and correcting the unordered data by using the key to obtain ordered and correctly combined data (shown as figure 4).
The sequence number starts with 0 and the 49 th in the ordered correct data is exactly Audi. And replacing Audi to obtain corrected data: audi A8 2019 Plus A8L 50TFSI Quattro.
In a second embodiment, as shown in fig. 5, this is done as follows:
s201, crawling data;
s202, finding a front-end code in the crawling data, and defaulting that the front-end code has influence on a following content code;
s203, cracking to obtain a key, and cracking to obtain corresponding data;
and S204, combining the data obtained in the step S203 with the data extracted by the content codes later to obtain correct data.
As shown in fig. 6, the referenced webpage is https:// car.autohome.com.cn/config/series/146.Html, when a front end code is detected, < span class = 'hs _ kw49_ configLn' >/span >, the front end code < span class = 'hs _ kw49_ configLn' >/span > is extracted, 49 is obtained by using canonical extraction numbers, and the serial number of the word encrypted before A8 is obtained as 49.
And extracting related front-end codes from the crawled data by using regular matching according to the code rule of the encryption part, filtering irrelevant codes, reserving the encrypted codes, decrypting by using a reverse decryption method to obtain a key (shown as figure 2) and unordered scattered data (shown as figure 3), and correcting the unordered data by using the key to obtain ordered and correctly combined data (shown as figure 4).
The sequence number starts with 0 and the 49 th in the ordered correct data is exactly Audi. And replacing Audi to obtain corrected data: audi A8 2019 Plus A8L 50TFSI Quattro.
In a third embodiment, a data crawling apparatus comprises the following modules: the data acquisition module crawls data; the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code; and the correction module is used for obtaining the key from the encrypted code and correcting the data by combining the code corresponding to the error data to obtain corrected correct data.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (4)
1. A data crawling method is characterized in that: the method comprises the following steps:
crawling data;
finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; the encrypted code is a front-end code; the front-end code is a number of a class value in the label;
obtaining a key from the encrypted code and correcting the data by combining a code corresponding to the error data to obtain corrected correct data;
the method for locating error data is carried out as follows: in the data crawling process, data extraction is carried out by using regular matching so as to locate the front-end code of the data; filtering the extracted intermediate data to locate error data; the error data is data with a front end code;
the key is obtained from the encrypted code and data correction is carried out by combining with the code corresponding to the error data according to the following modes: reversely decrypting the encrypted code to obtain a key and unordered data corresponding to the encrypted code; the secret key is a correct arrangement mode of crawled data;
slicing the error data to obtain an encryption serial number corresponding to the error data; the encryption sequence number and the unordered data are then reordered using a key to obtain the correct data.
2. The data crawling method according to claim 1, wherein: the correct data is put into the designated position by using the position determined by the front end code of the error data.
3. A data crawling method according to claim 1, characterized in that: the encryption method is an obfuscated encryption mode of multiple JS types.
4. A data crawling apparatus, characterized in that: the system comprises the following modules:
the data acquisition module crawls data;
the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code; the encrypted code is a front-end code; the front-end code is a number of a class value in the label;
the correction module is used for obtaining a secret key from the encrypted code and correcting data by combining a code corresponding to the error data to obtain corrected correct data;
the method for locating error data is carried out as follows: in the data crawling process, data extraction is carried out by using regular matching so as to locate the front-end code of the data; filtering the extracted intermediate data to locate error data; the error data is data with a front end code;
the key is obtained from the encrypted code and data correction is carried out by combining with the code corresponding to the error data according to the following modes: reversely decrypting the encrypted code to obtain a key and unordered data corresponding to the encrypted code; the secret key is a correct arrangement mode of crawled data;
slicing the error data to obtain an encryption serial number corresponding to the error data; the encryption sequence number and the unordered data are then reordered using a key to obtain the correct data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010105861.1A CN111259224B (en) | 2020-02-20 | 2020-02-20 | Data crawling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010105861.1A CN111259224B (en) | 2020-02-20 | 2020-02-20 | Data crawling method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111259224A CN111259224A (en) | 2020-06-09 |
CN111259224B true CN111259224B (en) | 2023-02-21 |
Family
ID=70954521
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010105861.1A Active CN111259224B (en) | 2020-02-20 | 2020-02-20 | Data crawling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111259224B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609666A (en) * | 2012-01-20 | 2012-07-25 | 飞天诚信科技股份有限公司 | Protecting method for packing executable program |
CN103309809A (en) * | 2013-06-21 | 2013-09-18 | 宁夏新航信息科技有限公司 | Intelligent debugging method of computer software |
CN105260193A (en) * | 2015-11-03 | 2016-01-20 | 国云科技股份有限公司 | Self healing frame and healing method of large software |
CN110262784A (en) * | 2019-06-06 | 2019-09-20 | 秒针信息技术有限公司 | A kind of cloud notes implementation method and device |
-
2020
- 2020-02-20 CN CN202010105861.1A patent/CN111259224B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609666A (en) * | 2012-01-20 | 2012-07-25 | 飞天诚信科技股份有限公司 | Protecting method for packing executable program |
CN103309809A (en) * | 2013-06-21 | 2013-09-18 | 宁夏新航信息科技有限公司 | Intelligent debugging method of computer software |
CN105260193A (en) * | 2015-11-03 | 2016-01-20 | 国云科技股份有限公司 | Self healing frame and healing method of large software |
CN110262784A (en) * | 2019-06-06 | 2019-09-20 | 秒针信息技术有限公司 | A kind of cloud notes implementation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111259224A (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103888490B (en) | A kind of man-machine knowledge method for distinguishing of full automatic WEB client side | |
US10049096B2 (en) | System and method of template creation for a data extraction tool | |
CN105844140A (en) | Website login brute force crack method and system capable of identifying verification code | |
CN107145481B (en) | Electronic equipment, storage medium, and method and device for filling webpage form | |
CN110866258B (en) | Rapid vulnerability positioning method, electronic device and storage medium | |
CN112989348B (en) | Attack detection method, model training method, device, server and storage medium | |
CN102479138A (en) | System and method for detecting error by utilizing image | |
CN111639648A (en) | Certificate identification method and device, computing equipment and storage medium | |
CN111583000B (en) | Method and device for identifying behavior of surrounding mark and string mark, computer equipment and storage medium | |
CN103077062A (en) | Method and device for detecting code change | |
CN106599001A (en) | Webpage content acquisition method and system | |
CN101763593A (en) | Method and device for realizing audit log of system | |
US10782942B1 (en) | Rapid onboarding of data from diverse data sources into standardized objects with parser and unit test generation | |
CN111259224B (en) | Data crawling method and device | |
CN105975599B (en) | Method and device for monitoring page embedded points of website | |
CN101261643B (en) | Website page information statistical method and apparatus | |
CN112860957B (en) | Method, medium and system for checking fixed value list | |
CN102982291B (en) | The acquisition methods of trusted file digital signature and device | |
CN102467664A (en) | Method and device for assisting with optical character recognition | |
CN112506897A (en) | Method and system for analyzing and positioning data quality problem | |
CN106097403B (en) | Method for acquiring network protected index data based on image curve calculation | |
CN103455757A (en) | Method and device for identifying virus | |
CN111966881A (en) | Webpage information extraction method and system and electronic equipment | |
CN106445626A (en) | Data analysis method and device | |
CN114238733A (en) | Key information extraction method and device, computer storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |