CN111523014A

CN111523014A - Open source data processing method and system based on countermeasure sample

Info

Publication number: CN111523014A
Application number: CN202010337835.1A
Authority: CN
Inventors: 顾钊铨; 廖续鑫; 方滨兴; 王乐; 王新刚; 张川京; 王玥天
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-11

Abstract

The invention discloses an open source data processing method and system based on countermeasure samples, the method firstly disassembles open source data information X into a plurality of inseparable minimum units to form a picture set A, then generates a countermeasure sample picture set D according to the picture set A and an identification model B, meets the condition that the difference between a picture D and a picture a is smaller than a preset threshold value, and finally splices the countermeasure sample picture set D to generate open source data information X' for being displayed by a network front end. According to the technical scheme, even if the web crawler can capture the data, the information in the data is difficult to analyze correctly on the premise that normal reading and open source data using of common users are not influenced, and the cracking difficulty and cost are improved.

Description

Open source data processing method and system based on countermeasure sample

Technical Field

The invention relates to the technical field of computers, in particular to an open source data processing method and system based on countermeasure samples.

Background

Today, with the rapid development of information technology, it is difficult for users to ensure that their own data is not easily acquired and used by others. Companies and individuals generally display their own data in an open source manner, such as a web page, but due to the existence of a web crawler, an attacker can easily acquire the open source data of the companies and individuals by using the web crawler.

A web crawler is a program or script that automatically crawls open-source data of the world wide web according to certain rules. People can acquire the required data information in batch by using the web crawler without manual acquisition, so that manpower and material resources are saved. For some important data, such as data valuable to a company, or private data for an individual, a data provider does not want the data to be crawled in batches by a web crawler, and then the prevention of the web crawler by using a crawler countermeasure is omitted.

The existing anti-crawler strategies include the following: (1) forbidden ip, User-agents or cookies: the operation and maintenance personnel of the webpage find the latest abnormal access ip and User-agents or cookies through analyzing the log, and forbid the access of the IP and the User-agents or the cookies through a blacklist mechanism; (2) verification of the verification code: when a certain user has too many access times, the request is automatically skipped to a verification code page, and the website can be continuously accessed only after a correct verification code is input; (3) javascript rendering or ajax asynchronous transmission: the method comprises the following steps that a webpage developer puts important information into a webpage but does not write the important information into an html tag, a browser can automatically render js codes in a < script > tag to display the information in the browser, a crawler does not have the capability of executing the js codes, so that the information generated by js events cannot be read out, a server returns a webpage frame to a client when accessing the webpage, a data packet is transmitted to the client through an asynchronous ajax technology in the process of interacting with the client and is displayed on the webpage, and the information directly captured by the crawler is empty; (4) the data identification difficulty is improved: some important data that is updated frequently may change its representation, for example, text information is segmented into a series of pictures, which are difficult to analyze and process later even if the web crawler acquires the relevant data.

Although the above anti-crawler strategy can reduce the probability of crawling data by the crawler to some extent, the influence on normal users is also large, for example, the blacklist of the first scheme is not well set, and the normal users are easily injured by mistake. In addition, for the web crawler capable of learning through a machine, after the web crawler captures data, the web crawler can further analyze the data, and bypass an anti-crawler strategy to acquire important data.

Disclosure of Invention

The invention provides an open source data processing method and system based on countermeasure samples, which are used for preventing a crawler from capturing open source data and being difficult to normally identify data content, and improving cracking difficulty and cost.

In order to solve the above technical problem, an embodiment of the present invention provides an open source data processing method based on countermeasure samples, including:

decomposing open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;

generating a confrontation sample picture set D according to the picture set A and the identification model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;

and splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for a network front end to display the open source data information X'.

Further, according to the picture set a and the recognition model set B, a confrontation sample picture set D is generated, specifically:

taking the picture set A and the recognition model set B as input, adding interference noise to the input, and generating the confrontation sample picture set D;

and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.

Further, the open source data information X includes: numeric, English, Chinese or pictorial;

when the open source data information X is a number, the minimum unit is each number;

when the open source data information X is English, the minimum unit is each letter;

when the open-source data information X is Chinese, the minimum unit is each Chinese character;

when the open source data information X is a picture, the minimum unit is a picture of a fixed size.

Accordingly, the present invention provides an open source data processing system based on countermeasure samples, comprising: the system comprises a disassembling module, a countermeasure sample generating module and a splicing module;

the disassembling module is used for disassembling the open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;

the confrontation sample generation module is used for generating a confrontation sample picture set D according to the picture set A and the recognition model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;

and the splicing module is used for splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the network front end to display the open source data information X'.

Further, the confrontation sample generating module is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B, and specifically includes:

the countermeasure sample generation module takes the picture set A and the recognition model set B as input, adds interference noise to the input and generates a countermeasure sample picture set D;

The embodiment of the invention has the following beneficial effects:

the invention provides an open source data processing method and system based on countermeasure samples, the method firstly disassembles open source data information X into a plurality of inseparable minimum units to form a picture set A, then generates a countermeasure sample picture set D according to the picture set A and an identification model B, meets the condition that the difference between a picture D and a picture a is smaller than a preset threshold value, and finally splices the countermeasure sample picture set D to generate open source data information X' for being displayed by a network front end. Compared with the prior art that the web crawler can perform data processing on the captured data through the recognition model so as to recognize the original data, the technical scheme of the invention enables the web crawler to be difficult to correctly analyze the information even if the web crawler can capture the data on the premise of not influencing normal reading and use of open source data by a common user, thereby improving the cracking difficulty and the cost.

Drawings

FIG. 1 is a flow chart illustrating an embodiment of an open source data processing method based on countermeasure samples according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of an open-source data processing structure based on countermeasure samples according to the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by the relevant server, and the server is taken as an example for explanation below.

Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of an open-source data processing method based on countermeasure samples according to the present invention. As shown in fig. 1, the method includes steps 101 to 103, and each step is as follows:

step 101: and (3) disassembling the open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A.

In this embodiment, the open source data information X includes: numeric, English, Chinese, or pictorial. When the open source data information X is a number, the minimum unit for disassembling is each number; when the open source data information X is English, the minimum unit for disassembling is each letter; when the open source data information X is Chinese, the minimum unit of disassembly is each Chinese character; when the open source data information X is a picture, the minimum unit of disassembly is a picture of a fixed size.

In this embodiment, a minimum unit after disassembly corresponds to a picture, and each picture may be set to a fixed size for the convenience of subsequent data processing. A picture set A is formed by a plurality of pictures after being disassembled, and corresponding to a common user, the picture set A can be normally identified by naked eyes no matter whether X or A is displayed at the front end of the network.

Step 102: generating a confrontation sample picture set D according to the picture set A and the identification model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; and the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value.

In this embodiment, step 102 specifically includes: taking the picture set A and the recognition model set B as input, adding interference noise to the input, and generating a confrontation sample picture set D; and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.

The recognition model set B of this embodiment includes a plurality of recognition models, which are used by the web crawler to recognize the captured data. For the web crawler, no matter what is displayed is X or A, the data captured by the web crawler is one element in the image set A, and the image set A is restored through a common identification model to become data which can be recognized or identified by the web crawler, so that subsequent processing can be performed. The models can identify the picture set A with high accuracy, and the models are used as a model set B.

In this embodiment, the relationship between each picture D in the generated countermeasure sample picture set D and the original picture a needs to satisfy | D-a | ≦ that is, the added interference noise does not exceed, in order to ensure that a general user can normally recognize the information corresponding to D through the naked eye. The value of (b) can be adjusted according to the actual situation, thereby adjusting the strength of the confrontation. Preferably, the perturbation of a single pixel is between [ -16,16], a value set in this way.

Step 103: and splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the front end of the network to display the open source data information X'.

In this embodiment, the splicing process may be performed before the network front end receives data, or the anti-sample picture set D is sent to the network front end, and the network front end performs splicing and displaying. Because the antagonistic sample is added into the X ', the web crawler cannot be correctly identified through the identification model set B, the information in the X ' cannot be analyzed, and an ordinary user can normally read the content of the X ' through naked eyes without influencing normal reading and use.

To better illustrate the flow of steps and principles of the present invention, the following specific examples are presented.

Assuming that the commodity popularity of a certain company to a certain industry of a certain city is measured by a specific calculated index and displayed through a webpage, each commodity corresponds to the index thereof one by one and is updated every day. The company may have normal user access to read daily index of goods, but does not want the crawler system to crawl daily data of the company for processing analysis, and thus involves protection of the goods index data.

The index data information here is important data information X, and the index is digital information and is composed of numbers from 0 to 9, for example, the index of product 1 is 9236, the index of product 2 is 1045, and the index of product 3 is 5678. By performing minimum unit segmentation on the data information, four pictures corresponding to the commodity 1 can be obtained, namely, pictures corresponding to the numbers 9, 2, 3 and 6, and similarly, pictures corresponding to the commodities 2 and 3 can be obtained, and the pictures are the picture set a.

The digital image recognition models commonly used by the crawler system comprise a plurality of image sets A, the image sets A are used as input, output classification is generated through the models, the models can accurately recognize the image sets A at a high probability, and the models form a model set B.

For the picture set a and the model set B, after no more than interference noise is added to each picture in the picture set a, and the countermeasure sample set D is generated, these countermeasure sample sets do not have a great influence on the user, that is, the user can still accurately identify the corresponding numbers 9236, 1045, 5678, while the model set B is erroneously identified, for example, as 5148, 7876, 6810, and so on.

If the web page displays the index data, the four confrontation sample pictures of '9', '2', '3', '6' in the picture set D are spliced into a picture set to be displayed on the web page. Therefore, the information display mode does not affect the normal user to read the index information content every day, but for the crawler system, even if the crawler system can bypass the existing anti-crawler strategies to obtain data, the index image set also needs to be subjected to image recognition, because the images are countervailing samples and cannot be accurately recognized by using the image recognition model set B, the recognition difficulty is very high, further data processing and analysis are difficult to perform, and the content of the index data information is effectively protected.

Therefore, compared with the traditional anti-crawler strategy, the open-source data processing method based on the countermeasure samples has higher cracking difficulty and cost, and even if the crawler system can acquire the open-source data, the data content is difficult to correctly identify. The invention is a universal open source data protection method, and for different forms of data information contents, such as digital, English data, Chinese data, pictures, etc., the processing flows of the method are basically consistent, and the method can protect several different types of data. Furthermore, compared with the existing anti-crawler strategy, the scheme provided by the invention has the advantages that the interference to the user is small, and the user experience is good.

Accordingly, referring to fig. 2, fig. 2 is a schematic structural diagram of an embodiment of an open-source data processing system based on countermeasure samples according to the present invention. As shown in fig. 2, the system includes a decommissioning module 201, a challenge sample generation module 202, and a splicing module 203;

the disassembling module 201 is configured to disassemble the open source data information X to be processed into a plurality of inseparable minimum units, so as to form a picture set a.

The confrontation sample generation module 202 is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; and the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value.

In this embodiment, the confrontation sample generating module 202 is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B, specifically: the countermeasure sample generation module 202 takes the picture set a and the recognition model set B as input, adds interference noise to the input, and generates a countermeasure sample picture set D; and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

It will be understood by those skilled in the art that all or part of the processes of the above embodiments may be implemented by hardware related to instructions of a computer program, and the computer program may be stored in a computer readable storage medium, and when executed, may include the processes of the above embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. An open source data processing method based on countermeasure samples, comprising:

2. The method for processing open-source data based on countermeasure samples according to claim 1, wherein a countermeasure sample picture set D is generated according to the picture set a and the recognition model set B, specifically:

3. The method for processing open-source data based on countermeasure samples according to claim 1 or 2, wherein the open-source data information X includes: numeric, English, Chinese or pictorial;

4. An open source data processing system based on countermeasure samples, comprising: the system comprises a disassembling module, a countermeasure sample generating module and a splicing module;

5. The system according to claim 4, wherein the confrontation sample generation module is configured to generate a confrontation sample picture set D from the picture set a and the recognition model set B, specifically:

6. The countermeasure-sample-based open-source data processing system of claim 4 or 5, wherein the open-source data information X includes: numeric, English, Chinese or pictorial;