CN111259224B - Data crawling method and device - Google Patents

Data crawling method and device Download PDF

Info

Publication number
CN111259224B
CN111259224B CN202010105861.1A CN202010105861A CN111259224B CN 111259224 B CN111259224 B CN 111259224B CN 202010105861 A CN202010105861 A CN 202010105861A CN 111259224 B CN111259224 B CN 111259224B
Authority
CN
China
Prior art keywords
data
code
error
key
error data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010105861.1A
Other languages
Chinese (zh)
Other versions
CN111259224A (en
Inventor
钟琴隆
杜志诚
杜明本
于文才
马强
刘霞
王冬冬
李春勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Banner Information Co ltd
Original Assignee
Shandong Banner Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Banner Information Co ltd filed Critical Shandong Banner Information Co ltd
Priority to CN202010105861.1A priority Critical patent/CN111259224B/en
Publication of CN111259224A publication Critical patent/CN111259224A/en
Application granted granted Critical
Publication of CN111259224B publication Critical patent/CN111259224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3624Software debugging by performing operations on the source code, e.g. via a compiler

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Storage Device Security (AREA)

Abstract

A data crawling method and device comprises the following steps: crawling data; finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; and obtaining a key from the encrypted code and carrying out data correction by combining with a code corresponding to the error data to obtain corrected correct data. According to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained.

Description

Data crawling method and device
Technical Field
The application relates to a data crawling method and device.
Background
A web crawler (also called web spider, web robot) is a program or script that automatically captures information in a network according to a certain rule.
At present, in order to prevent the problems of slow response speed and resource occupation of a plurality of websites, crawlers can be prevented from acquiring information. The purpose can be achieved directly by a mode of requesting for shielding, but sometimes, because complete shielding is not desired, only partial key information is desired not to be crawled, a key information graphing mode can be adopted at the moment, but the graphing mode is relatively easy to crack by an OCR (optical character recognition) mode, so that methods for preventing important data from being crawled by an encryption mode appear, and the working efficiency and the accuracy of data crawling are greatly reduced by the method.
Disclosure of Invention
In order to solve the above problem, an aspect of the present application provides a data crawling method, including the following steps: crawling data; finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; and obtaining a key from the encrypted code and carrying out data correction by combining with a code corresponding to the error data to obtain corrected correct data. According to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained. It should be noted that the present application includes two technical solutions, one is to find an error from the final data, and then to find the position where the error is made by reverse pursuit; the other is to determine some possible error positions from the original code, and correct the positions in the process of acquiring data, so that the finally obtained data is correct data.
Preferably, the encryption code is a front-end code. The front-end code refers to a code preceding to the content code, and is generally used for functions such as identification, indexing and the like, and in the anti-crawler process, the front-end code is often used for encryption operation.
Preferably, the method for locating error data is performed as follows: and positioning the front-end code of the data in the data crawling process, and positioning the data with the front-end code.
Preferably, the data having the front-end code is error data. The method screens the encrypted data by screening the front-end codes, and then reversely decodes the encrypted data in the processing process of the codes with the data so as to obtain correct data.
Preferably, the key is a correct arrangement of the crawled data.
Preferably, the data correction is performed as follows: slicing the error data and then reordering the slices with the key to obtain the correct data.
Preferably, the correct data is placed in the designated location using the location determined by the front end code of the incorrect data.
Preferably, the front-end code is a number of a class value in the tag.
Preferably, the encryption method is a confusion encryption mode of multiple JS types.
On the other hand, the application provides a data crawling device, which comprises the following modules:
the data acquisition module crawls data;
the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code;
and the correction module is used for obtaining the key from the encrypted code and correcting the data by combining the code corresponding to the error data to obtain corrected correct data.
This application can bring following beneficial effect: according to the method and the device, the original data of the data are obtained by analyzing the codes corresponding to the data, the key is obtained by analyzing the codes of the source of the error data, and then the correct data is obtained. It should be noted that the present application includes two technical solutions, one is to find an error from the final data, and then to find the position where the error is made by reverse pursuit; the other method is to determine some possible error positions from the original code, and correct the error positions in the data acquisition process, so that the finally obtained data is correct data.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of an embodiment;
FIG. 2 is a key map;
FIG. 3 is a graph of data for out-of-order scatter;
FIG. 4 is a diagram of ordered correctly combined data;
FIG. 5 is a schematic view of another embodiment;
fig. 6 shows a specific cracking process of the second embodiment.
Detailed Description
In order to clearly explain the technical features of the present invention, the present application will be explained in detail by the following embodiments and the accompanying drawings.
In a first embodiment, as shown in fig. 1, the following is performed:
s101, crawling data;
s201, confirming which data are error data from the crawled data;
s301, positioning error data to an error code, and finding an encryption code from a front-end code of the error code;
s401, obtaining a secret key and correcting according to the condition of error data to obtain final correct data.
As shown in FIG. 1, the referenced web page is https:// car. Autohome. Com. Cn/config/series/146.Html, and as can be seen from FIG. 1, the directly identified identification result is: a8 The comfort of the Plus A8L 50TFS quattro in the 2019 style is reduced by Audi; study of its front end code resulted in span class = "hs _ kw49_ configfe", then decipher hs _ kw49_ configfe as "audi", and position before A8, resulted in audi A8 2019 Plus A8L 50TFS quattro ".
The specific decoding process is as follows: crawling data, using canonical matching to extract required data from the crawling data and filtering irrelevant data to obtain < span class = ' hs _ kw49_ configLn ' > < span A8 2019 type Plus A8L 50TFSI quattro, then using canonical matching < span \ s class = [ \ ' \ hs _ kw (\ \ d +) __________________, and then using canonical matching to extract front end code < span class = ' hs _ kw49_ configLn ' > < span >, using canonical extraction number 49 to obtain number 49, and obtaining the number of the A8 front encrypted word with the number 49.
And extracting related front-end codes from the crawled data by using regular matching according to the code rule of the encryption part, filtering irrelevant codes, reserving the encrypted codes, decrypting by using a reverse decryption method to obtain a key (shown as figure 2) and unordered scattered data (shown as figure 3), and correcting the unordered data by using the key to obtain ordered and correctly combined data (shown as figure 4).
The sequence number starts with 0 and the 49 th in the ordered correct data is exactly Audi. And replacing Audi to obtain corrected data: audi A8 2019 Plus A8L 50TFSI Quattro.
In a second embodiment, as shown in fig. 5, this is done as follows:
s201, crawling data;
s202, finding a front-end code in the crawling data, and defaulting that the front-end code has influence on a following content code;
s203, cracking to obtain a key, and cracking to obtain corresponding data;
and S204, combining the data obtained in the step S203 with the data extracted by the content codes later to obtain correct data.
As shown in fig. 6, the referenced webpage is https:// car.autohome.com.cn/config/series/146.Html, when a front end code is detected, < span class = 'hs _ kw49_ configLn' >/span >, the front end code < span class = 'hs _ kw49_ configLn' >/span > is extracted, 49 is obtained by using canonical extraction numbers, and the serial number of the word encrypted before A8 is obtained as 49.
And extracting related front-end codes from the crawled data by using regular matching according to the code rule of the encryption part, filtering irrelevant codes, reserving the encrypted codes, decrypting by using a reverse decryption method to obtain a key (shown as figure 2) and unordered scattered data (shown as figure 3), and correcting the unordered data by using the key to obtain ordered and correctly combined data (shown as figure 4).
The sequence number starts with 0 and the 49 th in the ordered correct data is exactly Audi. And replacing Audi to obtain corrected data: audi A8 2019 Plus A8L 50TFSI Quattro.
In a third embodiment, a data crawling apparatus comprises the following modules: the data acquisition module crawls data; the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code; and the correction module is used for obtaining the key from the encrypted code and correcting the data by combining the code corresponding to the error data to obtain corrected correct data.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (4)

1. A data crawling method is characterized in that: the method comprises the following steps:
crawling data;
finding and positioning the position of error data and a code position corresponding to the error data from the crawled data and finding an encrypted code; the encrypted code is a front-end code; the front-end code is a number of a class value in the label;
obtaining a key from the encrypted code and correcting the data by combining a code corresponding to the error data to obtain corrected correct data;
the method for locating error data is carried out as follows: in the data crawling process, data extraction is carried out by using regular matching so as to locate the front-end code of the data; filtering the extracted intermediate data to locate error data; the error data is data with a front end code;
the key is obtained from the encrypted code and data correction is carried out by combining with the code corresponding to the error data according to the following modes: reversely decrypting the encrypted code to obtain a key and unordered data corresponding to the encrypted code; the secret key is a correct arrangement mode of crawled data;
slicing the error data to obtain an encryption serial number corresponding to the error data; the encryption sequence number and the unordered data are then reordered using a key to obtain the correct data.
2. The data crawling method according to claim 1, wherein: the correct data is put into the designated position by using the position determined by the front end code of the error data.
3. A data crawling method according to claim 1, characterized in that: the encryption method is an obfuscated encryption mode of multiple JS types.
4. A data crawling apparatus, characterized in that: the system comprises the following modules:
the data acquisition module crawls data;
the positioning module finds and positions the position of the error data and the code position corresponding to the error data from the crawled data and finds the encrypted code; the encrypted code is a front-end code; the front-end code is a number of a class value in the label;
the correction module is used for obtaining a secret key from the encrypted code and correcting data by combining a code corresponding to the error data to obtain corrected correct data;
the method for locating error data is carried out as follows: in the data crawling process, data extraction is carried out by using regular matching so as to locate the front-end code of the data; filtering the extracted intermediate data to locate error data; the error data is data with a front end code;
the key is obtained from the encrypted code and data correction is carried out by combining with the code corresponding to the error data according to the following modes: reversely decrypting the encrypted code to obtain a key and unordered data corresponding to the encrypted code; the secret key is a correct arrangement mode of crawled data;
slicing the error data to obtain an encryption serial number corresponding to the error data; the encryption sequence number and the unordered data are then reordered using a key to obtain the correct data.
CN202010105861.1A 2020-02-20 2020-02-20 Data crawling method and device Active CN111259224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010105861.1A CN111259224B (en) 2020-02-20 2020-02-20 Data crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010105861.1A CN111259224B (en) 2020-02-20 2020-02-20 Data crawling method and device

Publications (2)

Publication Number Publication Date
CN111259224A CN111259224A (en) 2020-06-09
CN111259224B true CN111259224B (en) 2023-02-21

Family

ID=70954521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010105861.1A Active CN111259224B (en) 2020-02-20 2020-02-20 Data crawling method and device

Country Status (1)

Country Link
CN (1) CN111259224B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609666A (en) * 2012-01-20 2012-07-25 飞天诚信科技股份有限公司 Protecting method for packing executable program
CN103309809A (en) * 2013-06-21 2013-09-18 宁夏新航信息科技有限公司 Intelligent debugging method of computer software
CN105260193A (en) * 2015-11-03 2016-01-20 国云科技股份有限公司 Self healing frame and healing method of large software
CN110262784A (en) * 2019-06-06 2019-09-20 秒针信息技术有限公司 A kind of cloud notes implementation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609666A (en) * 2012-01-20 2012-07-25 飞天诚信科技股份有限公司 Protecting method for packing executable program
CN103309809A (en) * 2013-06-21 2013-09-18 宁夏新航信息科技有限公司 Intelligent debugging method of computer software
CN105260193A (en) * 2015-11-03 2016-01-20 国云科技股份有限公司 Self healing frame and healing method of large software
CN110262784A (en) * 2019-06-06 2019-09-20 秒针信息技术有限公司 A kind of cloud notes implementation method and device

Also Published As

Publication number Publication date
CN111259224A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
US10049096B2 (en) System and method of template creation for a data extraction tool
CN105844140A (en) Website login brute force crack method and system capable of identifying verification code
CN107145481B (en) Electronic equipment, storage medium, and method and device for filling webpage form
CN110866258B (en) Rapid vulnerability positioning method, electronic device and storage medium
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN102479138A (en) System and method for detecting error by utilizing image
CN111639648A (en) Certificate identification method and device, computing equipment and storage medium
CN111583000B (en) Method and device for identifying behavior of surrounding mark and string mark, computer equipment and storage medium
CN103077062A (en) Method and device for detecting code change
CN106599001A (en) Webpage content acquisition method and system
CN101763593A (en) Method and device for realizing audit log of system
US10782942B1 (en) Rapid onboarding of data from diverse data sources into standardized objects with parser and unit test generation
CN111259224B (en) Data crawling method and device
CN105975599B (en) Method and device for monitoring page embedded points of website
CN101261643B (en) Website page information statistical method and apparatus
CN112860957B (en) Method, medium and system for checking fixed value list
CN102982291B (en) The acquisition methods of trusted file digital signature and device
CN102467664A (en) Method and device for assisting with optical character recognition
CN112506897A (en) Method and system for analyzing and positioning data quality problem
CN106097403B (en) Method for acquiring network protected index data based on image curve calculation
CN103455757A (en) Method and device for identifying virus
CN111966881A (en) Webpage information extraction method and system and electronic equipment
CN106445626A (en) Data analysis method and device
CN114238733A (en) Key information extraction method and device, computer storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant