CN111523014A - Open source data processing method and system based on countermeasure sample - Google Patents

Open source data processing method and system based on countermeasure sample Download PDF

Info

Publication number
CN111523014A
CN111523014A CN202010337835.1A CN202010337835A CN111523014A CN 111523014 A CN111523014 A CN 111523014A CN 202010337835 A CN202010337835 A CN 202010337835A CN 111523014 A CN111523014 A CN 111523014A
Authority
CN
China
Prior art keywords
source data
picture
data information
open source
picture set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010337835.1A
Other languages
Chinese (zh)
Inventor
顾钊铨
廖续鑫
方滨兴
王乐
王新刚
张川京
王玥天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202010337835.1A priority Critical patent/CN111523014A/en
Publication of CN111523014A publication Critical patent/CN111523014A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Abstract

The invention discloses an open source data processing method and system based on countermeasure samples, the method firstly disassembles open source data information X into a plurality of inseparable minimum units to form a picture set A, then generates a countermeasure sample picture set D according to the picture set A and an identification model B, meets the condition that the difference between a picture D and a picture a is smaller than a preset threshold value, and finally splices the countermeasure sample picture set D to generate open source data information X' for being displayed by a network front end. According to the technical scheme, even if the web crawler can capture the data, the information in the data is difficult to analyze correctly on the premise that normal reading and open source data using of common users are not influenced, and the cracking difficulty and cost are improved.

Description

Open source data processing method and system based on countermeasure sample
Technical Field
The invention relates to the technical field of computers, in particular to an open source data processing method and system based on countermeasure samples.
Background
Today, with the rapid development of information technology, it is difficult for users to ensure that their own data is not easily acquired and used by others. Companies and individuals generally display their own data in an open source manner, such as a web page, but due to the existence of a web crawler, an attacker can easily acquire the open source data of the companies and individuals by using the web crawler.
A web crawler is a program or script that automatically crawls open-source data of the world wide web according to certain rules. People can acquire the required data information in batch by using the web crawler without manual acquisition, so that manpower and material resources are saved. For some important data, such as data valuable to a company, or private data for an individual, a data provider does not want the data to be crawled in batches by a web crawler, and then the prevention of the web crawler by using a crawler countermeasure is omitted.
The existing anti-crawler strategies include the following: (1) forbidden ip, User-agents or cookies: the operation and maintenance personnel of the webpage find the latest abnormal access ip and User-agents or cookies through analyzing the log, and forbid the access of the IP and the User-agents or the cookies through a blacklist mechanism; (2) verification of the verification code: when a certain user has too many access times, the request is automatically skipped to a verification code page, and the website can be continuously accessed only after a correct verification code is input; (3) javascript rendering or ajax asynchronous transmission: the method comprises the following steps that a webpage developer puts important information into a webpage but does not write the important information into an html tag, a browser can automatically render js codes in a < script > tag to display the information in the browser, a crawler does not have the capability of executing the js codes, so that the information generated by js events cannot be read out, a server returns a webpage frame to a client when accessing the webpage, a data packet is transmitted to the client through an asynchronous ajax technology in the process of interacting with the client and is displayed on the webpage, and the information directly captured by the crawler is empty; (4) the data identification difficulty is improved: some important data that is updated frequently may change its representation, for example, text information is segmented into a series of pictures, which are difficult to analyze and process later even if the web crawler acquires the relevant data.
Although the above anti-crawler strategy can reduce the probability of crawling data by the crawler to some extent, the influence on normal users is also large, for example, the blacklist of the first scheme is not well set, and the normal users are easily injured by mistake. In addition, for the web crawler capable of learning through a machine, after the web crawler captures data, the web crawler can further analyze the data, and bypass an anti-crawler strategy to acquire important data.
Disclosure of Invention
The invention provides an open source data processing method and system based on countermeasure samples, which are used for preventing a crawler from capturing open source data and being difficult to normally identify data content, and improving cracking difficulty and cost.
In order to solve the above technical problem, an embodiment of the present invention provides an open source data processing method based on countermeasure samples, including:
decomposing open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;
generating a confrontation sample picture set D according to the picture set A and the identification model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;
and splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for a network front end to display the open source data information X'.
Further, according to the picture set a and the recognition model set B, a confrontation sample picture set D is generated, specifically:
taking the picture set A and the recognition model set B as input, adding interference noise to the input, and generating the confrontation sample picture set D;
and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
Further, the open source data information X includes: numeric, English, Chinese or pictorial;
when the open source data information X is a number, the minimum unit is each number;
when the open source data information X is English, the minimum unit is each letter;
when the open-source data information X is Chinese, the minimum unit is each Chinese character;
when the open source data information X is a picture, the minimum unit is a picture of a fixed size.
Accordingly, the present invention provides an open source data processing system based on countermeasure samples, comprising: the system comprises a disassembling module, a countermeasure sample generating module and a splicing module;
the disassembling module is used for disassembling the open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;
the confrontation sample generation module is used for generating a confrontation sample picture set D according to the picture set A and the recognition model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;
and the splicing module is used for splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the network front end to display the open source data information X'.
Further, the confrontation sample generating module is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B, and specifically includes:
the countermeasure sample generation module takes the picture set A and the recognition model set B as input, adds interference noise to the input and generates a countermeasure sample picture set D;
and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
Further, the open source data information X includes: numeric, English, Chinese or pictorial;
when the open source data information X is a number, the minimum unit is each number;
when the open source data information X is English, the minimum unit is each letter;
when the open-source data information X is Chinese, the minimum unit is each Chinese character;
when the open source data information X is a picture, the minimum unit is a picture of a fixed size.
The embodiment of the invention has the following beneficial effects:
the invention provides an open source data processing method and system based on countermeasure samples, the method firstly disassembles open source data information X into a plurality of inseparable minimum units to form a picture set A, then generates a countermeasure sample picture set D according to the picture set A and an identification model B, meets the condition that the difference between a picture D and a picture a is smaller than a preset threshold value, and finally splices the countermeasure sample picture set D to generate open source data information X' for being displayed by a network front end. Compared with the prior art that the web crawler can perform data processing on the captured data through the recognition model so as to recognize the original data, the technical scheme of the invention enables the web crawler to be difficult to correctly analyze the information even if the web crawler can capture the data on the premise of not influencing normal reading and use of open source data by a common user, thereby improving the cracking difficulty and the cost.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of an open source data processing method based on countermeasure samples according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of an open-source data processing structure based on countermeasure samples according to the present invention.
Detailed Description
The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by the relevant server, and the server is taken as an example for explanation below.
Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of an open-source data processing method based on countermeasure samples according to the present invention. As shown in fig. 1, the method includes steps 101 to 103, and each step is as follows:
step 101: and (3) disassembling the open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A.
In this embodiment, the open source data information X includes: numeric, English, Chinese, or pictorial. When the open source data information X is a number, the minimum unit for disassembling is each number; when the open source data information X is English, the minimum unit for disassembling is each letter; when the open source data information X is Chinese, the minimum unit of disassembly is each Chinese character; when the open source data information X is a picture, the minimum unit of disassembly is a picture of a fixed size.
In this embodiment, a minimum unit after disassembly corresponds to a picture, and each picture may be set to a fixed size for the convenience of subsequent data processing. A picture set A is formed by a plurality of pictures after being disassembled, and corresponding to a common user, the picture set A can be normally identified by naked eyes no matter whether X or A is displayed at the front end of the network.
Step 102: generating a confrontation sample picture set D according to the picture set A and the identification model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; and the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value.
In this embodiment, step 102 specifically includes: taking the picture set A and the recognition model set B as input, adding interference noise to the input, and generating a confrontation sample picture set D; and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
The recognition model set B of this embodiment includes a plurality of recognition models, which are used by the web crawler to recognize the captured data. For the web crawler, no matter what is displayed is X or A, the data captured by the web crawler is one element in the image set A, and the image set A is restored through a common identification model to become data which can be recognized or identified by the web crawler, so that subsequent processing can be performed. The models can identify the picture set A with high accuracy, and the models are used as a model set B.
In this embodiment, the relationship between each picture D in the generated countermeasure sample picture set D and the original picture a needs to satisfy | D-a | ≦ that is, the added interference noise does not exceed, in order to ensure that a general user can normally recognize the information corresponding to D through the naked eye. The value of (b) can be adjusted according to the actual situation, thereby adjusting the strength of the confrontation. Preferably, the perturbation of a single pixel is between [ -16,16], a value set in this way.
Step 103: and splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the front end of the network to display the open source data information X'.
In this embodiment, the splicing process may be performed before the network front end receives data, or the anti-sample picture set D is sent to the network front end, and the network front end performs splicing and displaying. Because the antagonistic sample is added into the X ', the web crawler cannot be correctly identified through the identification model set B, the information in the X ' cannot be analyzed, and an ordinary user can normally read the content of the X ' through naked eyes without influencing normal reading and use.
To better illustrate the flow of steps and principles of the present invention, the following specific examples are presented.
Assuming that the commodity popularity of a certain company to a certain industry of a certain city is measured by a specific calculated index and displayed through a webpage, each commodity corresponds to the index thereof one by one and is updated every day. The company may have normal user access to read daily index of goods, but does not want the crawler system to crawl daily data of the company for processing analysis, and thus involves protection of the goods index data.
The index data information here is important data information X, and the index is digital information and is composed of numbers from 0 to 9, for example, the index of product 1 is 9236, the index of product 2 is 1045, and the index of product 3 is 5678. By performing minimum unit segmentation on the data information, four pictures corresponding to the commodity 1 can be obtained, namely, pictures corresponding to the numbers 9, 2, 3 and 6, and similarly, pictures corresponding to the commodities 2 and 3 can be obtained, and the pictures are the picture set a.
The digital image recognition models commonly used by the crawler system comprise a plurality of image sets A, the image sets A are used as input, output classification is generated through the models, the models can accurately recognize the image sets A at a high probability, and the models form a model set B.
For the picture set a and the model set B, after no more than interference noise is added to each picture in the picture set a, and the countermeasure sample set D is generated, these countermeasure sample sets do not have a great influence on the user, that is, the user can still accurately identify the corresponding numbers 9236, 1045, 5678, while the model set B is erroneously identified, for example, as 5148, 7876, 6810, and so on.
If the web page displays the index data, the four confrontation sample pictures of '9', '2', '3', '6' in the picture set D are spliced into a picture set to be displayed on the web page. Therefore, the information display mode does not affect the normal user to read the index information content every day, but for the crawler system, even if the crawler system can bypass the existing anti-crawler strategies to obtain data, the index image set also needs to be subjected to image recognition, because the images are countervailing samples and cannot be accurately recognized by using the image recognition model set B, the recognition difficulty is very high, further data processing and analysis are difficult to perform, and the content of the index data information is effectively protected.
Therefore, compared with the traditional anti-crawler strategy, the open-source data processing method based on the countermeasure samples has higher cracking difficulty and cost, and even if the crawler system can acquire the open-source data, the data content is difficult to correctly identify. The invention is a universal open source data protection method, and for different forms of data information contents, such as digital, English data, Chinese data, pictures, etc., the processing flows of the method are basically consistent, and the method can protect several different types of data. Furthermore, compared with the existing anti-crawler strategy, the scheme provided by the invention has the advantages that the interference to the user is small, and the user experience is good.
Accordingly, referring to fig. 2, fig. 2 is a schematic structural diagram of an embodiment of an open-source data processing system based on countermeasure samples according to the present invention. As shown in fig. 2, the system includes a decommissioning module 201, a challenge sample generation module 202, and a splicing module 203;
the disassembling module 201 is configured to disassemble the open source data information X to be processed into a plurality of inseparable minimum units, so as to form a picture set a.
The confrontation sample generation module 202 is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; and the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value.
And the splicing module is used for splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the network front end to display the open source data information X'.
In this embodiment, the confrontation sample generating module 202 is configured to generate a confrontation sample picture set D according to the picture set a and the recognition model set B, specifically: the countermeasure sample generation module 202 takes the picture set a and the recognition model set B as input, adds interference noise to the input, and generates a countermeasure sample picture set D; and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
The invention provides an open source data processing method and system based on countermeasure samples, the method firstly disassembles open source data information X into a plurality of inseparable minimum units to form a picture set A, then generates a countermeasure sample picture set D according to the picture set A and an identification model B, meets the condition that the difference between a picture D and a picture a is smaller than a preset threshold value, and finally splices the countermeasure sample picture set D to generate open source data information X' for being displayed by a network front end. Compared with the prior art that the web crawler can perform data processing on the captured data through the recognition model so as to recognize the original data, the technical scheme of the invention enables the web crawler to be difficult to correctly analyze the information even if the web crawler can capture the data on the premise of not influencing normal reading and use of open source data by a common user, thereby improving the cracking difficulty and the cost.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
It will be understood by those skilled in the art that all or part of the processes of the above embodiments may be implemented by hardware related to instructions of a computer program, and the computer program may be stored in a computer readable storage medium, and when executed, may include the processes of the above embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims (6)

1. An open source data processing method based on countermeasure samples, comprising:
decomposing open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;
generating a confrontation sample picture set D according to the picture set A and the identification model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;
and splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for a network front end to display the open source data information X'.
2. The method for processing open-source data based on countermeasure samples according to claim 1, wherein a countermeasure sample picture set D is generated according to the picture set a and the recognition model set B, specifically:
taking the picture set A and the recognition model set B as input, adding interference noise to the input, and generating the confrontation sample picture set D;
and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
3. The method for processing open-source data based on countermeasure samples according to claim 1 or 2, wherein the open-source data information X includes: numeric, English, Chinese or pictorial;
when the open source data information X is a number, the minimum unit is each number;
when the open source data information X is English, the minimum unit is each letter;
when the open-source data information X is Chinese, the minimum unit is each Chinese character;
when the open source data information X is a picture, the minimum unit is a picture of a fixed size.
4. An open source data processing system based on countermeasure samples, comprising: the system comprises a disassembling module, a countermeasure sample generating module and a splicing module;
the disassembling module is used for disassembling the open source data information X to be processed into a plurality of inseparable minimum units to form a picture set A;
the confrontation sample generation module is used for generating a confrontation sample picture set D according to the picture set A and the recognition model set B; the identification model set B comprises a plurality of identification models used by the web crawler for identifying the captured data; the difference between the picture D in the confrontation sample picture set D and the corresponding picture a in the picture set A is smaller than a preset threshold value;
and the splicing module is used for splicing the pictures in the countermeasure sample picture set D according to the disassembly sequence of the open source data information X to generate open source data information X 'for the network front end to display the open source data information X'.
5. The system according to claim 4, wherein the confrontation sample generation module is configured to generate a confrontation sample picture set D from the picture set a and the recognition model set B, specifically:
the countermeasure sample generation module takes the picture set A and the recognition model set B as input, adds interference noise to the input and generates a countermeasure sample picture set D;
and each picture a belongs to A to generate a corresponding confrontation sample picture d, and | d-a | is less than or equal to.
6. The countermeasure-sample-based open-source data processing system of claim 4 or 5, wherein the open-source data information X includes: numeric, English, Chinese or pictorial;
when the open source data information X is a number, the minimum unit is each number;
when the open source data information X is English, the minimum unit is each letter;
when the open-source data information X is Chinese, the minimum unit is each Chinese character;
when the open source data information X is a picture, the minimum unit is a picture of a fixed size.
CN202010337835.1A 2020-04-24 2020-04-24 Open source data processing method and system based on countermeasure sample Pending CN111523014A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010337835.1A CN111523014A (en) 2020-04-24 2020-04-24 Open source data processing method and system based on countermeasure sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010337835.1A CN111523014A (en) 2020-04-24 2020-04-24 Open source data processing method and system based on countermeasure sample

Publications (1)

Publication Number Publication Date
CN111523014A true CN111523014A (en) 2020-08-11

Family

ID=71910897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010337835.1A Pending CN111523014A (en) 2020-04-24 2020-04-24 Open source data processing method and system based on countermeasure sample

Country Status (1)

Country Link
CN (1) CN111523014A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN110008680A (en) * 2019-04-03 2019-07-12 华南师范大学 System and method is generated based on the identifying code to resisting sample
CN110727934A (en) * 2019-10-22 2020-01-24 成都知道创宇信息技术有限公司 Anti-crawler method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743901A (en) * 2016-03-07 2016-07-06 携程计算机技术(上海)有限公司 Server, anti-crawler system and anti-crawler verification method
CN110008680A (en) * 2019-04-03 2019-07-12 华南师范大学 System and method is generated based on the identifying code to resisting sample
CN110727934A (en) * 2019-10-22 2020-01-24 成都知道创宇信息技术有限公司 Anti-crawler method and device

Similar Documents

Publication Publication Date Title
CN104766014B (en) For detecting the method and system of malice network address
US8385589B2 (en) Web-based content detection in images, extraction and recognition
CN104615760B (en) Fishing website recognition methods and system
WO2020151173A1 (en) Webpage tampering detection method and related apparatus
CN106599940A (en) Picture character identification method and apparatus thereof
CN107908959A (en) Site information detection method, device, electronic equipment and storage medium
CN112565250B (en) Website identification method, device, equipment and storage medium
CN111586005B (en) Scanner scanning behavior identification method and device
CN110955590A (en) Interface detection method, image processing method, device, electronic equipment and storage medium
CN113347177A (en) Phishing website detection method, phishing website detection system, electronic device and readable storage medium
CA3144405A1 (en) Text information recognizing method, extracting method, devices and system
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN114363019A (en) Method, device and equipment for training phishing website detection model and storage medium
CN114117299A (en) Website intrusion tampering detection method, device, equipment and storage medium
CN114356747A (en) Display content testing method, device, equipment, storage medium and program product
CN114004277A (en) Small sample threat risk early warning method and device based on deep learning
Polireddi et al. Web accessibility evaluation of private and government websites for people with disabilities through fuzzy classifier in the USA
Qu Research on password detection technology of iot equipment based on wide area network
CN114254231A (en) Webpage content extraction method
CN111626356A (en) Advertisement recognition method, model training method, electronic device and storage medium
CN111523014A (en) Open source data processing method and system based on countermeasure sample
CN114722323A (en) System and method for safety examination based on webpage content
CN111125605B (en) Page element acquisition method and device
CN110851349B (en) Page abnormity display detection method, terminal equipment and storage medium
Tanniru et al. Online Fake Logo Detection System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200811

RJ01 Rejection of invention patent application after publication