CN113032655A - Method for extracting and fixing dark network electronic data - Google Patents

Method for extracting and fixing dark network electronic data Download PDF

Info

Publication number
CN113032655A
CN113032655A CN202110399112.9A CN202110399112A CN113032655A CN 113032655 A CN113032655 A CN 113032655A CN 202110399112 A CN202110399112 A CN 202110399112A CN 113032655 A CN113032655 A CN 113032655A
Authority
CN
China
Prior art keywords
website
electronic data
webpage
darknet
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110399112.9A
Other languages
Chinese (zh)
Inventor
汤艳君
安俊霖
明泰龙
刘丛睿
张一鸣
刘俊泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Criminal Police University
Original Assignee
China Criminal Police University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Criminal Police University filed Critical China Criminal Police University
Priority to CN202110399112.9A priority Critical patent/CN113032655A/en
Publication of CN113032655A publication Critical patent/CN113032655A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention provides a method for extracting and fixing dark net electronic data. Extracting and fixing the dark web page data and the transaction data through a web crawler technology, calculating an integrity check value of the electronic data, and finally presenting a result in a visual form at the front end of the dark web online evidence obtaining system. The webpage crawling of the hidden network law violation mall based on Tor and the extraction and fixation of the illegal transaction text information in the website are realized, the crawled clues or evidence files are stored in the database and can be presented to case handling personnel in a visual form, the normal operation of the hidden network website is not influenced in the whole crawling and fixation process, the hidden network violation mall security document capturing method has high accuracy and concealment, and an effective evidence obtaining method is provided for hidden network violation documents. Therefore, the manual operation of the investigation personnel can be greatly reduced, and the daily work efficiency can be improved.

Description

Method for extracting and fixing dark network electronic data
Technical Field
The invention belongs to the technical field of electronic data forensics, and relates to a method for extracting and fixing dark network electronic data.
Background
The anonymity of the darknet makes the personal information of the user difficult to find; the communication content is difficult to monitor, the hidden network becomes a gathering place for criminals to engage in security activities endangering society and countries, the number of illegal transactions is large, the contents relate to a plurality of fields, common transaction contents comprise citizen personal information, drug poisons and the like, serious harm is caused to social security, and the hidden network crime is not easy to be attacked.
With the development of the hidden network, the number of cases related to illegal websites in the hidden network is also increased, criminal facts are not limited to citizen information transaction, cases are complex and difficult to analyze, and in the face of illegal websites with more pages, complicated screen capture video evidence collection becomes a big problem for case handling of policemen by public security authorities. The counter personnel mainly have the following difficulties in the evidence obtaining of the hidden net: (1) the life cycle of a large number of illegal websites of the hidden network is very short; (2) after the criminal implements the crime, the criminal can rapidly delete the network crime traces; (3) how to store and manage complex and massive data of the darknet website; (4) under the circumstances that the darknet crime is just started in the attack, marketable evidence obtaining products specially aiming at the darknet are lacked.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for crawling hidden web sites and fixing transaction information texts.
The invention provides a method for extracting and fixing dark network electronic data, which comprises the following steps:
step 1, compiling a crawler of a hidden network website, and crawling the hidden network website by using a browser test frame;
step 2, fixing the trade text information of the hidden network website;
and 3, visually presenting the evidence obtaining result.
In the method for extracting and fixing the darknet electronic data, the method for acquiring the domain name address of the darknet website comprises the following steps:
(1) google syntax search: searching by utilizing a Google grammar, acquiring a hundred-degree result link related to a dark web domain name, and storing the hundred-degree result link in a local txt document;
(2) the regular expression is as follows: according to the form of the Tor network domain name, collecting the domain name address through a regular expression, and matching by using the following regular expression of URL:
^https?:\/\/(([a-z0-9_-]{1,64}\.){0,4}[a-z0-9=]{16}\.onion)(:|\/|$)
and storing the Tor domain name address collected by the method into url.
In the method for extracting and fixing the darknet electronic data, the crawling of the darknet website by using the browser test frame in the step 1 specifically comprises the following steps: aiming at the website structure of the hidden Net, a hidden Net crawler crawls by using a Selenium automated testing frame, and a Selenium test directly runs in a browser and supports automatic recording action and automatic generation of test scripts of different languages, namely Net, Java and Perl.
In the method for extracting and fixing the darknet electronic data, the crawling of the darknet webpage in the step 1 specifically comprises the following steps:
step 1.1: searching web pages: firstly, accessing a target website input by a user by using Selenium, then searching a webpage according to an address provided by the user, traversing by adopting a breadth-first algorithm in the webpage searching step, taking an input domain name of a main website as an entrance, taking the main website as a first-layer node, then crawling all nodes positioned on a second layer, and crawling layer by layer until an un-crawled queue is empty or a set level is reached to extract an un-accessed URL list to be crawled;
step 1.2: webpage analysis: performing data analysis on the content in the webpage, taking the downloaded webpage document character string as input, extracting data required in the webpage and an URL list to be crawled which is not visited, and then storing the data in a MySQL database;
step 1.3: webpage downloading: adopting a web crawler of a Selenium automated testing framework of Python to perform screen capture fixing on a darknet website, and performing full screen capture on webpage data by using a save _ screen () function in the Selenium;
step 1.4: electronic data preservation: carrying out electronic data integrity check on the extracted screenshot, and calculating an electronic data integrity check value;
step 1.5: and (3) data storage: storing the downloaded screenshot and the integrity check value to the local or a server according to the requirements of the user;
step 1.6: the new URL is extracted from the database-extracted fixed web site table.
In the method for extracting and fixing the darknet electronic data, the step 1.4 is to calculate the electronic data integrity check value by using a hash algorithm, which specifically comprises the following steps:
(1) the opening mode of the electronic data file is a binary mode, namely a b mode is used when the file is opened;
(2) using SHA-1 as the way to compute the hash value, the result of SHA-1 is 20 bytes long, typically represented by a 40-bit 16-ary string.
In the method for extracting and fixing the dark network electronic data, the step 2 specifically comprises the following steps:
step 2.1: automatic registration of a hidden network website: configuring according to the rule of website registered accounts, registering multiple accounts and sending Cookie requests, and preventing the request frequency limitation aiming at Cookie;
step 2.2: the hidden network website automatically logs in: utilizing a Requests module of Python to simulate login, creating a Session () automatic management Cookies through a Requests library, submitting the Cookies to a server when a user submits a verification code, sending a login request through a post Session () function, and completing automatic login;
step 2.3: IP (Internet protocol) sealing prevention: whether a target site has a plurality of access restriction protection measures such as UA protection, access frequency restriction and the like or not is judged, so that a Tor automatic switching script needs to be written by Python, and Tor automatic switching is realized when the IP is judged to be Ban;
step 2.4: crawling transaction data: and viewing the classification of the website homepage and the sub-classification below each classification through a Tor browser, acquiring and iteratively classifying all types, substituting qeaid, page number page and other parameters of the types into a template URL to capture the maximum page number and a detailed list of the columns, and continuously turning pages and iterating the process above to crawl all page numbers and types after the arrow indicates the maximum page number.
In the method for extracting and fixing the dark network electronic data, step 3 is specifically as follows: visually displaying the extracted dark web page screenshot, the electronic data check value and the dark web page address in a Django background, and installing and deploying the Django REST background, wherein the key steps of deployment are as follows:
(1) creating a screen _ notes project, installing a Pymysql database driver, changing database parameters, creating a screen _ db database in MySQL after configuration, and viewing the created table in MySQL after executing a database migration command.
(2) And creating a worker APP. Py defines the model, modifies the model file, and executes the migration command.
(3) And after the URL is configured, starting the local server.
The method for crawling hidden web sites and fixing the text of the transaction information has the following beneficial effects:
1) the hidden network website can change the domain name at any time, the server can be transferred or closed at any time, the electronic data storage state is complex, the expression forms are various, the processing of the hidden network website is greatly different from the processing means of the traditional evidence, and the invention of the hidden network electronic data extraction and fixation method can extract and fix the illegal hidden network website.
2) When a public security organization uses a traditional webpage screen capturing fixed mode to obtain clues or evidences such as illegal transaction case values, crime grouping hierarchical structures and the like, sentence information related to illegal transactions is often required to be screened out manually, and timeliness of case detection is seriously affected. The invention of the hidden network electronic data extraction and fixation method can extract and fix the transaction data information in the hidden network case according to the requirements of public security organs, and the transaction data information is displayed in a visual form, thereby improving the hidden network investigation and case handling efficiency.
Drawings
FIG. 1 is a registration interface for a darknet Chinese forum;
FIG. 2 shows the result of obtaining the registration verification code of the hidden network Chinese forum;
FIG. 3 is the automatic registration result of the hidden web Chinese forum;
FIG. 4 is a registered account test results save table;
FIG. 5 is an actual registered account test result;
FIG. 6 shows the result of obtaining the registration verification code of the hidden Chinese forum;
FIG. 7 is the automatic login result of the hidden Chinese forum;
FIG. 8 is a result of a Tor auto-switch script test;
FIG. 9 is a fixed display result of the data extraction of the darknet webpage;
FIG. 10 is a configuration MySQL data source;
FIG. 11 is a price for a darknet Chinese forum seller to sell items.
Detailed Description
In order to better explain the technical scheme of the invention, the related prior art and the existing defects are briefly introduced.
1) Tor anonymous communication system: the Tor anonymous communication system (The Second Generation on Router), which is a Second Generation Onion routing system, is composed of a set of Onion routers (also called Tor nodes), and not only can provide client anonymous communication, but also can help users to surf The internet anonymously and protect The privacy of The users. Tor has a hiding service, and the specific flow is that Tor can help hide the physical address of the user when the user performs activities such as instant messaging. In the case where the service provider knows only the domain name information, other users can smoothly access their hidden services through the "Rendezvous Point (RP).
2) Web crawler technology: a Web Crawler (Web Crawler) is a program or script that automatically crawls network information according to certain rules. The web crawler can crawl from specified URLs and all URLs associated with it, ultimately extracting the required data from the pages of the URLs, according to the user's needs. The web crawler system mainly comprises three parts, namely web page access, web page information crawling and storage of acquired data.
3) Web page search algorithm
For a web crawler, there are many ways to crawl in the huge internet, and how to traverse all nodes at high speed without repetition depends on the way it crawls. The URL of the next crawl of the web crawler is determined by a crawler algorithm, so that the algorithm adopted for node crawling is one of important research contents of the web crawler. The current major search strategies include: depth-first search and breadth-first search.
Depth-first crawling prioritizes depth. Due to the hierarchical design of the website, the website can be compared with a binary tree, the child links of the website are nodes on the binary tree, and the starting URL is a root node on the binary tree. The principle of the depth-first algorithm is to use an input domain name of a main website as an entrance, then judge whether the domain name is linked with the same website, prevent crawling out of the website and lead to unlimited attempts to crawl, then crawl all matched sub-domain names (such as sub-domain name _1 and sub-domain name _2 … …), use the crawled sub-domain name _1 as a new entrance, and continue to crawl all sub-domain names matched with the sub-domain name _1 until the end is grabbed. Assume that the link structure of a web site is as shown. Depth-first crawling prioritizes depth. And (3) starting from the root node, crawling downwards along a path pointed by the root node, and if the nodes are provided with child nodes and the child nodes are not accessed, the crawler will crawl downwards until all the nodes under the path are traversed, then returning to the starting URL to continue crawling along another path, and repeating the crawling process until the non-crawled queue is empty. For example, the binary tree structure shown in the above diagram is that the crawling route with depth priority is: A-B-D-G-E-H-C-F-I. The depth-first method has the advantages of simple crawling mode and capability of deep analysis inside the website page. However, if the recursion is continuously carried out in such a way that the crawl is not carried out or the recursion is too deep, the crawl is finally trapped in a crawler trap, and the breadth range of the website structure is not considered, so that the crawled content is too narrow.
The breadth-first algorithm is based on the principle of breadth-first, namely, hierarchical crawling, and can also be called as breadth-first algorithm. The essence of the breadth-first algorithm is realized in a queue mode, a root node is used as a first-layer node, then all nodes on a second layer are crawled layer by layer until the non-crawled queue is empty or a set layer number is reached. By taking the structure shown in the above diagram as an example, assuming that the leftmost node is crawled first each time, the breadth-first crawling route is: A-B-C-D-E-F-G-H-I. Compared with a depth-first algorithm, the breadth-first algorithm is simpler, and hierarchical crawling can be performed only by clearing the website structure, so that the method is very suitable for websites with simpler hierarchical structures. If the website has a large scale and a deep structure is complex, the crawling mode is difficult to crawl to a deeper webpage and simultaneously crawls a large number of irrelevant webpages, so that the crawling efficiency is reduced.
In the crawler design of the hidden web site, compared with a depth-first algorithm, the breadth-first algorithm has the following characteristics:
first, the structure of the website is such that the more important web pages are placed closer to the start URL, and the more deeply the website is located, the less important the web pages are.
Secondly, the depth of the web page cannot be set to be very high in the website design, so that a very short path always exists in a certain page, and the breadth first can ensure that the node where the web page is located can be crawled at the fastest speed.
Thirdly, in order to improve crawler efficiency, crawlers are generally captured in a multi-thread or multi-process mode, and cooperation of crawlers among multiple threads is facilitated preferentially by breadth.
The invention provides a method for extracting and fixing the dark net electronic data, which is suitable for the following computer system environments:
the logic architecture adopts a B/S architecture. And a system administrator logs in through a browser to examine the extraction fixed result of the dark web page data, the electronic data integrity check value and the extraction fixed result of the dark web transaction data.
The front end is mainly visually realized by adopting a Grafana front end Framework, and the back end is developed by adopting a Django REST Framework. And the data of the dark net online system is extracted and fixed and then stored in the MySQL database, so that data support is provided for realizing business logic. Grafana carries out interface layout and style design, Webpack is adopted for carrying out file packaging, a plurality of css files and js files related in the front end are compressed and packaged, and development difficulty is simplified in a modularization mode. Vue is adopted to realize the updating and showing of the page by operating data.
The back end adopts a Django REST Framework, and the method has the advantages that Python codes can be directly operated in Django programs, and convenience is brought to acquisition of network data. Wherein the data transfer between the front and back ends, and the dependency between the management objects can be guaranteed. After the project development is completed, the project is packaged into a war package and deployed to a server, and services such as load balancing, reverse proxy, static file processing and the like are performed through Caddy.
The data storage mainly uses a MySQL database to carry out persistent storage on information such as user information, dark web page data, dark web transaction data, electronic data integrity check values and the like.
The invention provides a method for extracting and fixing electronic data of a darknet, which is to deploy a Tor environment for effectively crawling the darknet by a crawler program. The Tor is deployed in the system, the Tor anonymous communication system (the second communication On-ion router) realizes the anonymous communication service of the client, and the Tor can hide the physical address of the user when the user performs activities such as instant messaging and the like. The protocol used by Tor is Socks5, and two modules of Requests and Selenium of Python make GET and POST Requests, since Requests modules cannot resolve the onion address through a conventional DNS server, and the onion address can be resolved through the Tor network. Py file in the program.
The install command line agent Proxychains. As Tor and Proxychanins can construct a Socks proxy chain IP springboard, anonymity is realized when crawler test is carried out. Since version 3 of the software does not support the specified address release, the Proxychain version 4.0 needs to be installed. Edit/etc/proxychains. conf after installation, modify the socks5 address to socks 5127.0.0.19150, and add a localnet proxy ip/255.255.255.255.
An active acquisition mode is adopted for extracting the data of the hidden web page, and data acquisition is started according to a command input by a user, so that the resource occupation is reduced, and the flexibility is realized; and a passive acquisition mode is fixedly adopted for extracting dark network transaction data, and after a dark network website is added, data acquisition is carried out according to data required by a user and the data are sent to a server at a specified time interval.
The invention relates to a method for extracting and fixing the dark net electronic data, which comprises the following steps:
step 1, crawling of a hidden web site: in order to collect hidden network data, a hidden network website crawler is compiled, and a browser test frame is used for crawling, so that the functions of downloading and analyzing the webpage of the hidden network illegal transaction website are realized;
the method for acquiring the domain name address of the hidden network station mainly comprises the following 2 schemes:
(1) google grammar search
In order to continuously acquire and monitor data published on websites in the public network, a set of acquisition programs for the hundred-degree result links is developed, namely, the acquisition programs are searched by using Google grammar, the hundred-degree result links related to the dark web domain name are acquired and are stored in a local txt document.
(2) Regular expression
According to the form of the Tor network domain name, the domain name address can be collected through a regular expression, and the following regular expression of URL is used for matching:
^https?:\/\/(([a-z0-9_-]{1,64}\.){0,4}[a-z0-9=]{16}\.onion)(:|\/|$)
and storing the Tor domain name address collected by the method into url.
The step 1 of crawling the hidden web site by using the browser test framework specifically comprises the following steps: aiming at the website structure of the hidden Net, a hidden Net crawler crawls by using an automatic test framework of a Selenium browser, and a Selenium test directly runs in the browser and supports automatic recording action and automatic generation of test scripts of different languages, namely Net, Java and Perl.
The method for crawling a single hidden network page and crawling hidden network pages in batches in the step 1 of the invention specifically comprises the following steps:
step 1.1: and searching web pages. Firstly, accessing a target website input by a user by using the Selenium, and then searching a webpage according to an address provided by the user. In order to extract all sub-websites and webpages under the hidden website, the webpage searching step adopts a breadth-first algorithm to traverse, firstly, an input domain name of a main website is used as an entrance, the main website is used as a first-layer node, then all nodes on a second layer are crawled layer by layer, and a URL list to be crawled is extracted until an un-crawled queue is empty or a set level is reached. In the step, a DOM tree is created according to an HTML webpage by using a PyQuery library in Python, and search traversal of various nodes is performed in the form of the tree. The DOM is a standard API that handles HTML and XML documents. The DOM provides an access model to the entire document as a tree structure, with each node of the tree representing an HTML tag or text item within a tag. The DOM tree structure accurately describes the interrelationship between tags in an HTML document. The DOM tree structure can conveniently locate each element according to the upper and lower level relations.
And storing the generated URL information into an extracted fixed website table.
Then, ULR deduplication is carried out, and the URL is stored in a database for deduplication: and storing each crawled page in a database, wherein before each storage, whether the current database exists (namely whether the crawled page exists) is searched in a traversal mode before each storage, if so, the current database is not stored, otherwise, the current database is stored, and the next database is continuously stored until the end.
And step 1.2, webpage analysis. And analyzing data in the webpage content, taking the webpage document character string downloaded by the webpage downloading module as input, and extracting the required content and the URL list to be crawled which is not accessed. The main method of the webpage storage module is to realize the function of writing in files. The webpage storage module is actually a part of webpage processing, after data of a webpage are crawled, the data needed in the webpage are extracted by using the webpage analysis module, and then the data are stored in the MySQL database. The webpage analysis module is a tool for carrying out data analysis on webpage contents. And (3) using a PyQuery library in Python, wherein the syntax strictly follows the jQuery specification, creating a DOM tree based on an HTML page, and finally positioning each element through search traversal of each node.
And step 1.3, downloading the webpage. And aiming at extraction fixation of the dark web page data, screen capturing fixation is carried out on the dark web illegal website by adopting a web crawler of a Selenium automated testing framework of Python, and full screen capturing is carried out on the web page data by utilizing a save _ screen () function in the Selenium.
Downloading the illegal hidden network web pages into a local computer is a core method for obtaining evidence of the electronic data of the hidden network. In the invention, a PhantomJS simulation browser is selected, which has no graphical interface and can not realize the downloading of static resources such as pictures and CSS, but can use the Requests library in Python to crawl the static resources such as CSS and pictures.
And step 1.4, electronic data preservation. And carrying out electronic data integrity check on the extracted screenshot, and calculating an electronic data integrity check value. To prevent the electronic data from being tampered or destroyed, the electronic data is calculated by using a specific algorithm such as a hash algorithm, and a data value for verifying the integrity of the data is obtained.
Electronic data preservation is a very important functional step of the present invention. Aiming at the problems that electronic data is easy to be tampered and safe risks exist in storage, the integrity check value is calculated for the extracted and fixed data of the darknet webpage, so that the integrity of evidence can be guaranteed, and the investigation personnel can be assisted to fill in an electronic data extraction fixed list.
The present invention chooses SHA-1 as the hash calculation, and the result of SHA-1 is 20 bytes long, usually expressed as a 16-ary string of 40 bits. The hash algorithm is a one-way algorithm, and a user can generate a unique hash value with a certain length for target information through the hash algorithm, but cannot reversely obtain the target information through the hash value, namely the one-way mapping algorithm. The hash algorithm uses a hash function as a function calculation method, and the hash function can map data of any size to data of a fixed size, and is also called as a hash function.
The hash function adopted by the invention has the following characteristics:
(1) easy compression: for an input M of any size, the length of the Hash value is small, and in practical applications, the length of the Hash value generated by the function H is fixed.
(2) Easy calculation: for any given message, it is easier to calculate its Hash value.
(3) Unidirectional: given a certain hash function H and hash value H (M), it is computationally difficult to derive M, and even if many hash algorithms are currently broken on the market, the time taken for breaking is quite long. Thus, the hash function is also relatively secure.
(4) High sensitivity: this is from a bit-wise perspective, meaning that a 1-bit input change will cause 1/2 bits to change. Any change in message M results in a change in the hash value h (M). That is, if the inputs are slightly different, the outputs after the hash operation are certainly different.
When the method is used for calculating the integrity check value of the screen capture of the hidden web page, the following points need to be noticed:
(1) the opening mode of the electronic data file must be a binary mode, namely a b mode is used when the file is opened, otherwise, the electronic data integrity check value is opened in a text mode when Python is used for calculating the electronic data integrity check value, and finally, an error electronic data integrity check value is obtained.
(2) If a 16-Bit (Bit) value is required for SHA-1, then the call object's digest () and hexdigest () produce 20-Bit (Bit) and 40-Bit (Bit) hash values, respectively.
And finally, the calculated integrity check value corresponds to the webpage screen capture one by one.
And step 1.5, data storage. And storing the downloaded screenshot and the integrity check value to the local or a server according to the requirements of the user.
And if the user selects to store the screenshot information and the integrity check value in the server, storing the screenshot information and the integrity check value in the set server-side MySQL database.
And if the screenshot is selected to be saved locally, creating a local folder, and saving the screenshot and the integrity check value in the user computer.
Meanwhile, in order to record the webpage time, the datetime library is used for obtaining the local time of the current system and storing the local time in the database.
And step 1.6, extracting a new URL from the database extracted fixed website table. Whether crawling is repeated or not is judged at first, the crawler removes URLs which are repeatedly crawled, the phenomenon that the same webpage is repeatedly crawled is avoided, the efficiency of the crawler is affected, and redundant data are generated. Crawlers typically place URLs to be crawled in a queue, extract new URLs from crawled pages, first determine that these new URLs have not been crawled before they are placed in the queue, and if they have been crawled before they are placed in the queue, they are not placed in the queue. If the URL is repeatedly crawled, continuously judging the next URL; and if the webpage is not crawled, returning to the fourth part of the cycle to crawl the webpage.
Step 2, fixing the text information of the hidden network website transaction: the method comprises the steps that a crawler is customized according to the requirements of evidence obtaining personnel aiming at a hidden network illegal transaction website needing to be logged in, so that automatic registration, automatic logging, IP (Internet protocol) prevention forbidding, crawling of a transaction data text and fixing of the transaction data text to a database are realized;
the text information is specifically:
(1) extracting fixed information from web page data
The web page data extraction fixed information table mainly comprises fields such as a Hash value, a darknet web page address, a darknet web page screenshot and data acquisition time, and the detailed information of each field is shown in table 1.
TABLE 1
Figure RE-GDA0003047982560000131
(2) Transaction data extraction fixed information
The transaction data extraction fixed information table mainly includes fields such as an extracted fixed website ID, a published commodity serial number, a seller ID, a published commodity title, a transaction unit price, a commodity sales volume, a hidden website state commodity picture, a hidden website address, and the like, and detailed information of each field is shown in table 2.
TABLE 2
Figure RE-GDA0003047982560000132
(3) Transaction data extraction fixed task information
The transaction data extraction fixed task table is a table structure for storing extraction tasks created by extracted websites in the dark web transaction data extraction fixed function implementation process, and mainly includes fields such as ID of the created tasks of the extracted websites, addresses of the extracted websites, website names, and scanning start and scanning end times, and detailed information of the specific fields is shown in table 3.
TABLE 3
Figure RE-GDA0003047982560000141
(4) Extracting fixed task information from web page data
The web page data extraction fixed task represents a table structure of an extraction task created by storing extracted websites in the process of extracting the hidden network web page data for a fixed time, wherein the table structure mainly comprises fields of extracted website names, extraction fixed states, Proxy last working time, error reporting information and the like, and detailed information of the specific fields is shown in table 4.
TABLE 4
Figure RE-GDA0003047982560000142
Step 2.1: automatic registration of a hidden network website: and configuring according to the rules of the website registered account, such as whether a user name and a password need to be capital and small, and character and number specification. Meanwhile, the multi-account is registered to send Cookie requests, and the request frequency limitation aiming at the Cookie is prevented.
After entering the website, the website takes one PHPSISDCookie as the identity ID of the whole life cycle of the user, so that the expiration time is 1 hour, and the automatic extension cannot be realized. Therefore, a Session needs to be set to automatically process the cookie, and as Tor network combines P2P technology and Socks5 technology, Socks5 proxy needs to be configured. In the registration interface, it should be noted that the account number of the user is a user number randomly generated by the darknet chinese forum, and the password needs to contain both upper and lower case characters and numbers in the range of 6-20, as shown in fig. 1.
The function is realized by utilizing a Python random password generation script, then a registration verification code is obtained through an imgcat command line picture browser and is output to a local named code png, and the registration verification code is input into a control terminal after being checked. As shown in fig. 2 and 3.
The registration result is saved to the MySQL database User table, as shown in fig. 4.
Step 2.2: the hidden network website automatically logs in: and simulating login by using a Requests module of Python, creating a Session () automatic management Cookies through a Requests library, submitting the Cookies to a server when a user submits a verification code, sending a login request through a post Session () function, and completing automatic login. A random password generation script is written by Python, and the hidden network Chinese forum login interface is shown in FIG. 5.
And acquiring a login verification code through an imgcat command line picture browser of Python, outputting the login verification code to a local named code. And finally, encapsulating payload of the user registration data, sending a login request through a post session () function, and completing automatic login. As shown in fig. 6 and 7.
After logging in a website by using a browser, a server can automatically create a Session object, and variables in the Session object can be stored in the whole user Session in the Session validity period. Therefore, a Session () is created through the Requests library to automatically manage Cookies. If the verification code exists in the login interface, after the request to the server is completed, the server only sends the identification code of the picture to the client as a Cookie, and only the Cookie is submitted at the same time when the verification code is submitted. The result of the specific test for login according to the registered account is shown in fig. 8.
Step 2.3: IP (Internet protocol) sealing prevention: whether a target site has a plurality of access restriction protection measures such as UA protection, access frequency restriction and the like or not is judged, so that a Tor automatic switching script needs to be written by Python, and Tor automatic switching is realized when the IP is judged to be Ban;
step 2.4: crawling transaction data: the website homepage category and the sub-categories below each category are first viewed by the Tor browser. And acquiring and iteratively classified all types, substituting qeaid, page number page and other parameters into a template URL to capture the maximum page number and the detailed list of the column, and continuously turning pages to iterate the above process to crawl all page numbers and types after the arrow indicates the maximum page number is acquired.
The main page classification is as follows: data resources, service businesses, virtual goods, physical goods, technical skills, film and television pornography, other categories, basic knowledge, private photographs.
And 3, visually presenting the evidence obtaining result. Visualization was achieved using the existing Django integrated framework. In order to help the public security officer to check illegal data of the hidden network in time, the result is displayed visually. The method specifically comprises the following steps: and visually displaying the extracted dark web page screenshot, the electronic data check value and the dark web page address in a Django background. And installing and deploying the Django REST background. The key steps of deployment are as follows:
(1) creating a screen _ notes project, installing a Pymysql database driver, changing database parameters, creating a screen _ db database in MySQL after configuration, and viewing the created table in MySQL after executing a database migration command.
(2) And creating a worker APP. Py defines the model, modifies the model file, and executes the migration command.
(3) And after the URL is configured, starting the local server. The result of extracting the fixed display from the darknet webpage data is shown in fig. 9.
The visual presentation specifically operates to:
1) a data source is added.
A Dashboard is created and MySQL data sources are added. A specific data source configuration is shown in fig. 10.
2) Configuration instrument panel
First, a Table (Table) is created. And writing a specific SQL query statement according to the details table entry of the MySQL. The specific code implementation is as follows:
SELECT uptime,title,sold,price_btc FROM details ORDER BY uptime desc;
second, a Pie Chart (Pie Chart) is created. And writing a specific SQL query statement according to the uptime field in the details table.
3) And displaying the price of the commodity sold by the hidden network Chinese forum seller, and compiling a specific SQL query statement according to the price _ usdt field in the details table. The results are shown in FIG. 11.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the scope of the present invention, which is defined by the appended claims.

Claims (7)

1. A method for extracting and fixing darknet electronic data is characterized by comprising the following steps:
step 1, compiling a crawler of a hidden network website, and crawling the hidden network website by using a browser test frame;
step 2, fixing the trade text information of the hidden network website;
and 3, visually presenting the evidence obtaining result.
2. The darknet electronic data extraction fixing method of claim 1, wherein the obtaining of the domain name address of the darknet website comprises:
(1) google syntax search: searching by utilizing a Google grammar, acquiring a hundred-degree result link related to a dark web domain name, and storing the hundred-degree result link in a local txt document;
(2) the regular expression is as follows: according to the form of the Tor network domain name, collecting the domain name address through a regular expression, and matching by using the following regular expression of URL:
^https?:\/\/(([a-z0-9_-]{1,64}\.){0,4}[a-z0-9=]{16}\.onion)(:|\/|$)
and storing the Tor domain name address collected by the method into url.
3. The darknet electronic data extraction fixing method of claim 1, wherein crawling a darknet website by using a browser test frame in the step 1 specifically comprises: aiming at the website structure of the hidden Net, a hidden Net crawler crawls by using a Selenium automated testing frame, and a Selenium test directly runs in a browser and supports automatic recording action and automatic generation of test scripts of different languages, namely Net, Java and Perl.
4. The darknet electronic data extraction fixing method of claim 1, wherein the darknet webpage crawling in the step 1 specifically comprises:
step 1.1: searching web pages: firstly, accessing a target website input by a user by using Selenium, then searching a webpage according to an address provided by the user, traversing by adopting a breadth-first algorithm in the webpage searching step, taking an input domain name of a main website as an entrance, taking the main website as a first-layer node, then crawling all nodes positioned on a second layer, and crawling layer by layer until an un-crawled queue is empty or a set level is reached to extract an un-accessed URL list to be crawled;
step 1.2: webpage analysis: performing data analysis on the content in the webpage, taking the downloaded webpage document character string as input, extracting data required in the webpage and an URL list to be crawled which is not visited, and then storing the data in a MySQL database;
step 1.3: webpage downloading: adopting a web crawler of a Selenium automated testing framework of Python to perform screen capture fixing on a darknet website, and performing full screen capture on webpage data by using a save _ screen () function in the Selenium;
step 1.4: electronic data preservation: carrying out electronic data integrity check on the extracted screenshot, and calculating an electronic data integrity check value;
step 1.5: and (3) data storage: storing the downloaded screenshot and the integrity check value to the local or a server according to the requirements of the user;
step 1.6: the new URL is extracted from the database-extracted fixed web site table.
5. The darknet electronic data extraction fixing method of claim 4, wherein a hash algorithm is used to calculate the electronic data integrity check value in step 1.4, specifically:
(1) the opening mode of the electronic data file is a binary mode, namely a b mode is used when the file is opened;
(2) using SHA-1 as the way to compute the hash value, the result of SHA-1 is 20 bytes long, typically represented by a 40-bit 16-ary string.
6. The darknet electronic data extraction and fixation method of claim 1, wherein the step 2 specifically comprises:
step 2.1: automatic registration of a hidden network website: configuring according to the rule of website registered accounts, registering multiple accounts and sending Cookie requests, and preventing the request frequency limitation aiming at Cookie;
step 2.2: the hidden network website automatically logs in: utilizing a Requests module of Python to simulate login, creating a Session () automatic management Cookies through a Requests library, submitting the Cookies to a server when a user submits a verification code, sending a login request through a post Session () function, and completing automatic login;
step 2.3: IP (Internet protocol) sealing prevention: whether a target site has a plurality of access restriction protection measures such as UA protection, access frequency restriction and the like or not is judged, so that a Tor automatic switching script needs to be written by Python, and Tor automatic switching is realized when the IP is judged to be Ban;
step 2.4: crawling transaction data: and viewing the classification of the website homepage and the sub-classification below each classification through a Tor browser, acquiring and iteratively classifying all types, substituting qeaid, page number page and other parameters of the types into a template URL to capture the maximum page number and a detailed list of the columns, and continuously turning pages and iterating the process above to crawl all page numbers and types after the arrow indicates the maximum page number.
7. The darknet electronic data extraction and fixation method of claim 1, wherein the step 3 specifically comprises: visually displaying the extracted dark web page screenshot, the electronic data check value and the dark web page address in a Django background, and installing and deploying the Django REST background, wherein the key steps of deployment are as follows:
(1) creating a screen _ notes project, installing a Pymysql database driver, changing database parameters, creating a screen _ db database in MySQL after configuration, and viewing the created table in MySQL after executing a database migration command.
(2) And creating a worker APP. Py defines the model, modifies the model file, and executes the migration command.
(3) And after the URL is configured, starting the local server.
CN202110399112.9A 2021-04-14 2021-04-14 Method for extracting and fixing dark network electronic data Pending CN113032655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110399112.9A CN113032655A (en) 2021-04-14 2021-04-14 Method for extracting and fixing dark network electronic data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110399112.9A CN113032655A (en) 2021-04-14 2021-04-14 Method for extracting and fixing dark network electronic data

Publications (1)

Publication Number Publication Date
CN113032655A true CN113032655A (en) 2021-06-25

Family

ID=76456542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110399112.9A Pending CN113032655A (en) 2021-04-14 2021-04-14 Method for extracting and fixing dark network electronic data

Country Status (1)

Country Link
CN (1) CN113032655A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692593A (en) * 2022-03-21 2022-07-01 中国刑事警察学院 Network information safety monitoring and early warning method
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device
CN115909019A (en) * 2022-10-26 2023-04-04 吉林省吉林祥云信息技术有限公司 Scheduling method in multi-model node scene of identifying code image
CN116074322A (en) * 2023-04-06 2023-05-05 中国人民解放军国防科技大学 High-throughput task scheduling method, system and medium based on intelligent message segmentation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631030A (en) * 2015-12-30 2016-06-01 福建亿榕信息技术有限公司 Universal web crawler login simulation method and system
CN111191097A (en) * 2019-12-20 2020-05-22 天阳宏业科技股份有限公司 Method, device and system for automatically acquiring webpage information by web crawler
CN111797355A (en) * 2020-07-06 2020-10-20 上海弘连网络科技有限公司 Webpage fixed evidence storing method based on customized browser

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631030A (en) * 2015-12-30 2016-06-01 福建亿榕信息技术有限公司 Universal web crawler login simulation method and system
CN111191097A (en) * 2019-12-20 2020-05-22 天阳宏业科技股份有限公司 Method, device and system for automatically acquiring webpage information by web crawler
CN111797355A (en) * 2020-07-06 2020-10-20 上海弘连网络科技有限公司 Webpage fixed evidence storing method based on customized browser

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEIXIN_34095889: "Django + Django REST framework步骤总结", Retrieved from the Internet <URL:https://blog.csdn.net/weixin_34095889/article/details/88704658> *
小生听雨园: "超详细讲解python模拟登录——cookie及session", Retrieved from the Internet <URL:https://blog.csdn.net/weixin_44154094/article/details/114073845> *
汤艳君等: "基于Tor的暗网数据爬虫设计与实现", 信息安全研究, vol. 5, no. 9 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114692593A (en) * 2022-03-21 2022-07-01 中国刑事警察学院 Network information safety monitoring and early warning method
CN114692593B (en) * 2022-03-21 2023-04-07 中国刑事警察学院 Network information safety monitoring and early warning method
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device
CN115909019A (en) * 2022-10-26 2023-04-04 吉林省吉林祥云信息技术有限公司 Scheduling method in multi-model node scene of identifying code image
CN115909019B (en) * 2022-10-26 2024-02-09 吉林省吉林祥云信息技术有限公司 Scheduling method in multi-model node scene for identifying verification code image
CN116074322A (en) * 2023-04-06 2023-05-05 中国人民解放军国防科技大学 High-throughput task scheduling method, system and medium based on intelligent message segmentation
CN116074322B (en) * 2023-04-06 2023-06-02 中国人民解放军国防科技大学 High-throughput task scheduling method, system and medium based on intelligent message segmentation

Similar Documents

Publication Publication Date Title
US20210382949A1 (en) Systems and methods for web content inspection
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN113032655A (en) Method for extracting and fixing dark network electronic data
CN104766014B (en) For detecting the method and system of malice network address
CN112131882A (en) Multi-source heterogeneous network security knowledge graph construction method and device
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
Li et al. Block: a black-box approach for detection of state violation attacks towards web applications
US20050278540A1 (en) System, method, and computer program product for validating an identity claimed by a subject
CN109922052A (en) A kind of malice URL detection method of combination multiple characteristics
EP3343870A1 (en) System and method for detecting phishing web pages field of technology
US20060069671A1 (en) Methodology, system and computer readable medium for analyzing target web-based applications
CN103297394B (en) Website security detection method and device
CN106446123A (en) Webpage verification code element identification method
Singh et al. A survey on different phases of web usage mining for anomaly user behavior investigation
Shyni et al. Phishing detection in websites using parse tree validation
CN110719344B (en) Domain name acquisition method and device, electronic equipment and storage medium
Piñeiro et al. Web architecture for URL-based phishing detection based on Random Forest, Classification Trees, and Support Vector Machine
Qu Research on password detection technology of iot equipment based on wide area network
CN104077353B (en) A kind of method and device of detecting black chain
Allison et al. Building a wide reach corpus for secure parser development
Bergman et al. The Digital Detective's Discourse-A toolset for forensically sound collaborative dark web content annotation and collection
Guo et al. A web crawler detection algorithm based on web page member list
CN114282097A (en) Information identification method and device
CN104063491B (en) A kind of method and device that the detection page is distorted
CN112257100A (en) Method and device for detecting sensitive data protection effect and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination