CN111814068A

CN111814068A - ZeroNet blog and forum text grabbing and analyzing method

Info

Publication number: CN111814068A
Application number: CN202010716026.1A
Authority: CN
Inventors: 过小宇; 丁建伟; 孙恩博; 陈周国; 黎艺泉; 谢相菊
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2020-05-19
Filing date: 2020-07-23
Publication date: 2020-10-23

Abstract

The invention relates to the technical field of information safety, and discloses a method for capturing and analyzing zeroNet blogs and forum texts. The invention obtains the text data of the blog and forum websites by analyzing the local database, and overcomes the defect that the traditional crawler can not obtain all text contents of the zeroNet website. The invention uses the semi-supervised LDA topic model for modeling analysis, can be manually adjusted according to different application situations, and has higher accuracy and flexibility.

Description

ZeroNet blog and forum text grabbing and analyzing method

Technical Field

The invention relates to the technical field of information safety, in particular to a method for capturing and analyzing ZeroNet blogs and forum texts.

Background

With the deepening and darkening of network information data, lawless persons hide personal identities and anonymously publish and propagate non-information by using a network technology. And as a novel hidden network, the ZeroNet adopts a completely distributed network architecture and utilizes a block chain technology to perform signature verification on site contents and user information. Any user node in the network can be used as a client and also provide network service for other nodes. Once a service site constructed by a user is accessed or shared by other online nodes, an instant creator is offline and can still be accessed by other users, and the content of the website cannot be completely deleted theoretically. Meanwhile, the ZeroNet network provides support for the Tor network, so that users of the Tor network are more hidden. Therefore, ZeroNet is full of illegal contents, wherein text contents are mainly concentrated in forum websites and blog websites.

Capturing and analyzing the zeroNet text content and realizing a corresponding analysis system, and currently, research is rarely carried out in academic circles and industrial circles.

ZeroNet is an anti-censored P2P network whose web sites work differently than the common web sites in the internet. In ZeroNet, each user, when visiting a website, downloads or updates website data from a peer node having the website data and stores the website data locally, and then analyzes the stored local data to generate a web page, and may also select a provider of a website service. Due to the special working mode of the ZeroNet and the certain difference of the web page elements among the websites, the text data of all the websites are difficult to obtain by a general crawler method.

Disclosure of Invention

In order to solve the problems, the invention provides a method for capturing and analyzing texts of a ZeroNet blog and a forum, aiming at the specific working principle of ZeroNet, and the method is used for capturing the text contents of the blog and the forum websites in the ZeroNet, modeling and classifying the captured texts. Meanwhile, the type of the new text can be predicted based on the modeling result, so that the monitoring of the zeroNet blog and forum text is realized. Because the zeroNet uses the mode of incremental updating, the invention supports the acquisition of newly issued content after a certain time according to the issuing time of the content, thereby realizing the incremental updating and collection of the content.

The method for capturing and analyzing the zeroNet blog and forum text comprises the steps of firstly calling a browser to perform simulated login to obtain website data, then analyzing a local database to obtain text content, modeling and analyzing the text by using a semi-supervised LDA topic model after the text is obtained, classifying the text, and predicting the category of a new text based on a modeling result so as to realize supervision of the zeroNet blog and forum text.

Further, before the website data is acquired, an initialization process is performed:

and analyzing data of main navigation websites in the ZeroNet, extracting site address data in the data, and constructing a website database.

Further, the acquiring website data includes the following steps:

step 11: opening a ZeroNet application and a browser, and establishing network connection with an initial Tracker node by using a communication protocol of a ZeroNet network Tracker node to initialize a network environment;

step 12: reading blogs and forum websites in a zeroNet website database, simulating an access process in a real environment, performing traversal access on the blogs and the forum websites, uploading local data storage paths of websites with successful access, and marking the local data storage paths as accessed; after delaying, continuously and circularly traversing websites which are not successfully accessed in the last traversal, wherein the delay time can be manually set; after a fixed number of cycles or manual termination.

Further, the parsing the local database to obtain the text content includes the following steps:

step 21: reading a local data storage path of a successfully accessed website, analyzing a website signature set information file contained in the path, and reading a SQL database configuration file path in the file;

step 22: analyzing the SQL configuration file according to the path of the SQL configuration file, analyzing local SQL databases of the blog and forum websites by using the analyzed SQL sentences to obtain the text content of the websites, and performing language identification; dividing the languages into three languages of Chinese, English and other languages and respectively storing the languages in different paths;

step 23: and respectively preprocessing the Chinese and English texts, and removing the low-meaning texts of which the word number is less than the set number after preprocessing to obtain the processed pure texts.

Further, the modeling analysis and classification of the text by using the semi-supervised LDA topic model comprises the following steps:

step 31: respectively calculating the confusion degrees of the processed Chinese and English texts, and returning to a confusion degree curve chart;

step 32: the user sets a proper number of themes based on the confusion curve or sets a number of themes expected by the user; the user can set other parameters or use default parameters;

step 33: after parameter setting is finished, reading a text to be modeled for model initialization, and then starting modeling to generate an initial modeling result;

step 34: a user can select to manually modify the twards file in the modeling result, remove noise words under each topic or add characteristic words to corresponding topics to serve as prior knowledge, and rerun the modeling program; reading the last modeling result and the prior knowledge to initialize the model, so that the feature words under each topic in the prior knowledge have higher probability of falling into the topic; repeating the steps in each circulation operation, enabling a user to freely select the times of the program circulation operation, storing a current modeling result after each program circulation operation is finished, and entering step 35 after the modeling is finished;

step 35: and Bayesian inference is used for the text based on the modeling result, the average value is repeatedly taken three times for each inference, a theme matrix of the text is returned, and the user can select the theme with the highest probability as the classification result to realize text classification, or the theme with the probability exceeding a specified threshold value is used as the label of the text to realize multi-label automatic marking.

Further, the step 35 can infer existing texts, or infer newly acquired texts to predict, so as to implement supervision on ZeroNet blogs and forum texts; in the case of existing modeling results, the user can make inferences directly starting from step 35.

The invention has the beneficial effects that:

aiming at the specific working principle of zeroNet, the method captures the text content of the blog and forum websites, models the captured texts and classifies the texts. Meanwhile, the type of the new text can be predicted based on the modeling result, so that the monitoring of the zeroNet blog and forum text is realized. Because the zeroNet uses the mode of incremental updating, the invention supports obtaining new released content after a certain time according to the releasing time of the content, thereby realizing the incremental updating and the collection of the content, and the invention can also realize that:

(1) according to the invention, the text data of the blog and forum websites are acquired by analyzing the local database, so that the defect that the traditional crawler cannot acquire all text contents of the zeroNet website is overcome;

(2) the invention uses the semi-supervised LDA topic model for modeling analysis, can be manually adjusted according to different application situations, and has higher accuracy and flexibility.

Drawings

FIG. 1 is a block diagram of a text capture and analysis method of the present invention;

FIG. 2 is a text data parsing process flow diagram of the present invention;

FIG. 3 is a text modeling analysis process flow diagram of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments of the present invention will now be described. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention captures and analyzes the texts of the blog and forum websites according to the analysis of the working principle of the zeroNet, and carries out modeling analysis on the texts based on a semi-supervised LDA topic model, as shown in figure 1.

The terms of the present invention are defined as follows:

(1) a Tracker node: representing a service node that provides network access in a ZeroNet network to acquire a cache node for a particular site.

(2) Website: an application website which is created by a user in a ZeroNet network and provides Web services in a file mode is represented.

(3) Website signature set information file: the file contains the site address (public key), the size and the hash value of the site file, the digital signature of the site owner, the timestamp and the like, and provides the downloading indexing function and the content verification function of the site data.

(4) Website text: the title of each blog or post in the blog and forum website and the following comment content.

(5) SQL configuration file: ZeroNet is used to parse the configuration file of the local database, and includes information such as database name, table structure, and SQL statement for parsing each table.

(6) The confusion degree is: perplexity is generally used in natural language processing to measure how well a trained language model is. When the LDA is used for clustering the subjects and words, the number of the subjects is usually determined by selecting the number of the subjects near the inflection point of the confusion curve, and the formula is as follows:

wherein, p (w) refers to the probability of each word appearing in the test set, and specifically, p (w) - Σ is included in the LDA model_zp (z | d) × p (w | z), z, d refer to the individual documents of the trained topic and test set, respectively. The denominator N is all words that appear in the test set without repetition, or the total length of the test set.

(7) Semi-supervised LDA topic model: the LDA topic model is a document topic generation model and is a Bayesian probability model with three layers, namely words, topics and documents. The basic idea is to assume that K hidden topics exist in all documents, each word of one document selects a certain topic with a certain probability, a certain word is selected from the topic with a certain probability, and the hidden topics and the characteristic words thereof are continuously extracted until all words in the documents are traversed.

In LDA, the generation process of a document is defined by the following steps:

(1) assuming that there is a potential topic j, then for the feature words on this topic, their distribution probability vector φ can be calculated by Dirichlet distribution_zi,jThe vector obeys a dirichlet distribution with parameter β.

(2) Determining the document length (i.e. total term number), where the text length N obeys a poisson distribution;

(3) sampling a Dirichlet distribution with a parameter alpha to generate a theme distribution theta of a document i_i；

The following procedure was then repeated N times:

(a) sampling and generating topic z of jth word of document ith from polynomial distribution of topics_i,j；

(b) Polynomial distribution P (ω) of subject selected from the above_i,j|z_i,j(ii) a Beta) to finally generate the word omega_i,j。

The above is the process of generating the text by the LDA model, and when the process is finished, a text containing N words is generated.

The word distribution of the final document is:

where α, β are parameters of the Dirichlet distribution, θ_iFor topic distribution of document i, z_i,jIs the topic of the jth word in document i, phi_zi,jIs z_i,jSubject distribution of (a), ω_i,jIs phi_zi,jThe resulting words are sampled, Φ being the topic-word matrix.

The semi-supervised LDA topic model is different from the unsupervised LDA topic model in that in the step (a), if the program detects that the word is the content in the prior knowledge, the word is endowed with higher weight, so that the word has more probability of falling into the corresponding topic in the prior knowledge, and the prior knowledge plays the constraint role of the topic and partial feature words and does not need to manually construct a constraint set.

(8) twards file: the twords data of the model, namely the prior data, records the part of feature words with the front weight under each topic in the text-topic matrix.

(9) param file: information such as various parameters, the number of texts, the number of words, and the like of the modeling program is recorded.

(10) A wordmap file: all words in the text to be modeled are recorded, one word per action and its number.

(11) A workdict file: each line represents a word and records in which text each word appears and the number of lines in that text.

(12) Characteristic words: words that can embody text or subject matter.

(13) Modeling initialization: the modeling initialization of the first modeling loads the text to be modeled, calculates the total number of words, the total number of documents, the occurrence frequency of the words in each text and other information, leads the information into a program as parameters, and then begins to model the text to be modeled. Later modeling initialization is to read the param file, the wordmap file and the worddict file in the modeling result of the last program to load model data, and simultaneously read the initialization prior knowledge of the twards file.

(14) Text-topic matrix: each line in the matrix corresponds to a text, each column corresponds to a theme, and the probability that the text belongs to each theme is recorded.

(15) Topic-word matrix: each row in the matrix corresponds to a word, each column corresponds to a topic, and the probability that each word belongs to each topic is recorded.

(16) And (3) deducing: after the training model is obtained, the text can be inferred on the basis, and the theme distribution of the text is obtained. The LDA inference process using the Gibbs sampling algorithm is as follows:

1) and traversing each word of the current document, and randomly assigning a theme Z to each word.

2) The current document is rescanned and for each word, its topic number is resampled and updated according to Gibbs samples.

3) Step 2 is iterated until the Gibbs samples converge.

4) And after the iteration is finished, counting the topics of all words in the document to obtain the text-topic distribution of the text.

In a preferred embodiment of the present invention, before acquiring the website data, an initialization process is performed:

In a preferred embodiment of the present invention, the acquiring the website data comprises the steps of:

step 11: opening a ZeroNet application and a browser, and establishing network connection with an initial Tracker node by utilizing a ZeroNet network Tracker node communication protocol (comprising HTTP, UDP and Zero protocol) to initialize a network environment;

In a preferred embodiment of the present invention, as shown in fig. 2, parsing the local database to obtain the text content includes the following steps:

step 23: since the data set text is mainly the chinese and english texts, the present embodiment mainly performs content analysis for the two languages. And respectively preprocessing the Chinese and English texts, and removing the low-meaning texts of which the word number is less than the set number after preprocessing to obtain the processed pure texts.

In a preferred embodiment of the present invention, as shown in fig. 3, modeling analysis and classification of text using a semi-supervised LDA topic model comprises the steps of:

In a preferred embodiment of the present invention, step 35 can infer already existing text, or infer newly acquired text to predict to enable supervision of ZeroNet blogs and forum texts; in the case of existing modeling results, the user can make inferences directly starting from step 35.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for capturing and analyzing texts of a ZeroNet blog and a forum is characterized in that a browser is called to perform simulated login to obtain website data, a local database is analyzed to obtain text contents, a semi-supervised LDA topic model is used to perform modeling analysis and classification on the texts after the texts are obtained, and the category of a new text can be predicted based on a modeling result so as to realize supervision of the texts of the ZeroNet blog and the forum.

2. The method of claim 1, wherein before the website data is obtained, an initialization process is performed:

3. The method of claim 2, wherein the step of obtaining website data comprises the steps of:

4. The method of claim 3, wherein parsing the local database to obtain text content comprises:

5. The method of claim 4, wherein the modeling, analyzing and classifying the text by using the semi-supervised LDA topic model comprises the following steps:

6. The method of claim 5, wherein step 35 is capable of inferring existing text or newly acquired text to predict and enable supervision of ZeroNet blogs and forum text; in the case of existing modeling results, the user can make inferences directly starting from step 35.