CN111814068A - ZeroNet blog and forum text grabbing and analyzing method - Google Patents

ZeroNet blog and forum text grabbing and analyzing method Download PDF

Info

Publication number
CN111814068A
CN111814068A CN202010716026.1A CN202010716026A CN111814068A CN 111814068 A CN111814068 A CN 111814068A CN 202010716026 A CN202010716026 A CN 202010716026A CN 111814068 A CN111814068 A CN 111814068A
Authority
CN
China
Prior art keywords
text
zeronet
modeling
texts
forum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010716026.1A
Other languages
Chinese (zh)
Inventor
过小宇
丁建伟
孙恩博
陈周国
黎艺泉
谢相菊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Publication of CN111814068A publication Critical patent/CN111814068A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Abstract

The invention relates to the technical field of information safety, and discloses a method for capturing and analyzing zeroNet blogs and forum texts. The invention obtains the text data of the blog and forum websites by analyzing the local database, and overcomes the defect that the traditional crawler can not obtain all text contents of the zeroNet website. The invention uses the semi-supervised LDA topic model for modeling analysis, can be manually adjusted according to different application situations, and has higher accuracy and flexibility.

Description

ZeroNet blog and forum text grabbing and analyzing method
Technical Field
The invention relates to the technical field of information safety, in particular to a method for capturing and analyzing ZeroNet blogs and forum texts.
Background
With the deepening and darkening of network information data, lawless persons hide personal identities and anonymously publish and propagate non-information by using a network technology. And as a novel hidden network, the ZeroNet adopts a completely distributed network architecture and utilizes a block chain technology to perform signature verification on site contents and user information. Any user node in the network can be used as a client and also provide network service for other nodes. Once a service site constructed by a user is accessed or shared by other online nodes, an instant creator is offline and can still be accessed by other users, and the content of the website cannot be completely deleted theoretically. Meanwhile, the ZeroNet network provides support for the Tor network, so that users of the Tor network are more hidden. Therefore, ZeroNet is full of illegal contents, wherein text contents are mainly concentrated in forum websites and blog websites.
Capturing and analyzing the zeroNet text content and realizing a corresponding analysis system, and currently, research is rarely carried out in academic circles and industrial circles.
ZeroNet is an anti-censored P2P network whose web sites work differently than the common web sites in the internet. In ZeroNet, each user, when visiting a website, downloads or updates website data from a peer node having the website data and stores the website data locally, and then analyzes the stored local data to generate a web page, and may also select a provider of a website service. Due to the special working mode of the ZeroNet and the certain difference of the web page elements among the websites, the text data of all the websites are difficult to obtain by a general crawler method.
Disclosure of Invention
In order to solve the problems, the invention provides a method for capturing and analyzing texts of a ZeroNet blog and a forum, aiming at the specific working principle of ZeroNet, and the method is used for capturing the text contents of the blog and the forum websites in the ZeroNet, modeling and classifying the captured texts. Meanwhile, the type of the new text can be predicted based on the modeling result, so that the monitoring of the zeroNet blog and forum text is realized. Because the zeroNet uses the mode of incremental updating, the invention supports the acquisition of newly issued content after a certain time according to the issuing time of the content, thereby realizing the incremental updating and collection of the content.
The method for capturing and analyzing the zeroNet blog and forum text comprises the steps of firstly calling a browser to perform simulated login to obtain website data, then analyzing a local database to obtain text content, modeling and analyzing the text by using a semi-supervised LDA topic model after the text is obtained, classifying the text, and predicting the category of a new text based on a modeling result so as to realize supervision of the zeroNet blog and forum text.
Further, before the website data is acquired, an initialization process is performed:
and analyzing data of main navigation websites in the ZeroNet, extracting site address data in the data, and constructing a website database.
Further, the acquiring website data includes the following steps:
step 11: opening a ZeroNet application and a browser, and establishing network connection with an initial Tracker node by using a communication protocol of a ZeroNet network Tracker node to initialize a network environment;
step 12: reading blogs and forum websites in a zeroNet website database, simulating an access process in a real environment, performing traversal access on the blogs and the forum websites, uploading local data storage paths of websites with successful access, and marking the local data storage paths as accessed; after delaying, continuously and circularly traversing websites which are not successfully accessed in the last traversal, wherein the delay time can be manually set; after a fixed number of cycles or manual termination.
Further, the parsing the local database to obtain the text content includes the following steps:
step 21: reading a local data storage path of a successfully accessed website, analyzing a website signature set information file contained in the path, and reading a SQL database configuration file path in the file;
step 22: analyzing the SQL configuration file according to the path of the SQL configuration file, analyzing local SQL databases of the blog and forum websites by using the analyzed SQL sentences to obtain the text content of the websites, and performing language identification; dividing the languages into three languages of Chinese, English and other languages and respectively storing the languages in different paths;
step 23: and respectively preprocessing the Chinese and English texts, and removing the low-meaning texts of which the word number is less than the set number after preprocessing to obtain the processed pure texts.
Further, the modeling analysis and classification of the text by using the semi-supervised LDA topic model comprises the following steps:
step 31: respectively calculating the confusion degrees of the processed Chinese and English texts, and returning to a confusion degree curve chart;
step 32: the user sets a proper number of themes based on the confusion curve or sets a number of themes expected by the user; the user can set other parameters or use default parameters;
step 33: after parameter setting is finished, reading a text to be modeled for model initialization, and then starting modeling to generate an initial modeling result;
step 34: a user can select to manually modify the twards file in the modeling result, remove noise words under each topic or add characteristic words to corresponding topics to serve as prior knowledge, and rerun the modeling program; reading the last modeling result and the prior knowledge to initialize the model, so that the feature words under each topic in the prior knowledge have higher probability of falling into the topic; repeating the steps in each circulation operation, enabling a user to freely select the times of the program circulation operation, storing a current modeling result after each program circulation operation is finished, and entering step 35 after the modeling is finished;
step 35: and Bayesian inference is used for the text based on the modeling result, the average value is repeatedly taken three times for each inference, a theme matrix of the text is returned, and the user can select the theme with the highest probability as the classification result to realize text classification, or the theme with the probability exceeding a specified threshold value is used as the label of the text to realize multi-label automatic marking.
Further, the step 35 can infer existing texts, or infer newly acquired texts to predict, so as to implement supervision on ZeroNet blogs and forum texts; in the case of existing modeling results, the user can make inferences directly starting from step 35.
The invention has the beneficial effects that:
aiming at the specific working principle of zeroNet, the method captures the text content of the blog and forum websites, models the captured texts and classifies the texts. Meanwhile, the type of the new text can be predicted based on the modeling result, so that the monitoring of the zeroNet blog and forum text is realized. Because the zeroNet uses the mode of incremental updating, the invention supports obtaining new released content after a certain time according to the releasing time of the content, thereby realizing the incremental updating and the collection of the content, and the invention can also realize that:
(1) according to the invention, the text data of the blog and forum websites are acquired by analyzing the local database, so that the defect that the traditional crawler cannot acquire all text contents of the zeroNet website is overcome;
(2) the invention uses the semi-supervised LDA topic model for modeling analysis, can be manually adjusted according to different application situations, and has higher accuracy and flexibility.
Drawings
FIG. 1 is a block diagram of a text capture and analysis method of the present invention;
FIG. 2 is a text data parsing process flow diagram of the present invention;
FIG. 3 is a text modeling analysis process flow diagram of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments of the present invention will now be described. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The invention captures and analyzes the texts of the blog and forum websites according to the analysis of the working principle of the zeroNet, and carries out modeling analysis on the texts based on a semi-supervised LDA topic model, as shown in figure 1.
The terms of the present invention are defined as follows:
(1) a Tracker node: representing a service node that provides network access in a ZeroNet network to acquire a cache node for a particular site.
(2) Website: an application website which is created by a user in a ZeroNet network and provides Web services in a file mode is represented.
(3) Website signature set information file: the file contains the site address (public key), the size and the hash value of the site file, the digital signature of the site owner, the timestamp and the like, and provides the downloading indexing function and the content verification function of the site data.
(4) Website text: the title of each blog or post in the blog and forum website and the following comment content.
(5) SQL configuration file: ZeroNet is used to parse the configuration file of the local database, and includes information such as database name, table structure, and SQL statement for parsing each table.
(6) The confusion degree is: perplexity is generally used in natural language processing to measure how well a trained language model is. When the LDA is used for clustering the subjects and words, the number of the subjects is usually determined by selecting the number of the subjects near the inflection point of the confusion curve, and the formula is as follows:
Figure BDA0002598141940000051
wherein, p (w) refers to the probability of each word appearing in the test set, and specifically, p (w) - Σ is included in the LDA modelzp (z | d) × p (w | z), z, d refer to the individual documents of the trained topic and test set, respectively. The denominator N is all words that appear in the test set without repetition, or the total length of the test set.
(7) Semi-supervised LDA topic model: the LDA topic model is a document topic generation model and is a Bayesian probability model with three layers, namely words, topics and documents. The basic idea is to assume that K hidden topics exist in all documents, each word of one document selects a certain topic with a certain probability, a certain word is selected from the topic with a certain probability, and the hidden topics and the characteristic words thereof are continuously extracted until all words in the documents are traversed.
In LDA, the generation process of a document is defined by the following steps:
(1) assuming that there is a potential topic j, then for the feature words on this topic, their distribution probability vector φ can be calculated by Dirichlet distributionzi,jThe vector obeys a dirichlet distribution with parameter β.
(2) Determining the document length (i.e. total term number), where the text length N obeys a poisson distribution;
(3) sampling a Dirichlet distribution with a parameter alpha to generate a theme distribution theta of a document ii
The following procedure was then repeated N times:
(a) sampling and generating topic z of jth word of document ith from polynomial distribution of topicsi,j
(b) Polynomial distribution P (ω) of subject selected from the abovei,j|zi,j(ii) a Beta) to finally generate the word omegai,j
The above is the process of generating the text by the LDA model, and when the process is finished, a text containing N words is generated.
The word distribution of the final document is:
Figure BDA0002598141940000061
where α, β are parameters of the Dirichlet distribution, θiFor topic distribution of document i, zi,jIs the topic of the jth word in document i, phizi,jIs zi,jSubject distribution of (a), ωi,jIs phizi,jThe resulting words are sampled, Φ being the topic-word matrix.
The semi-supervised LDA topic model is different from the unsupervised LDA topic model in that in the step (a), if the program detects that the word is the content in the prior knowledge, the word is endowed with higher weight, so that the word has more probability of falling into the corresponding topic in the prior knowledge, and the prior knowledge plays the constraint role of the topic and partial feature words and does not need to manually construct a constraint set.
(8) twards file: the twords data of the model, namely the prior data, records the part of feature words with the front weight under each topic in the text-topic matrix.
(9) param file: information such as various parameters, the number of texts, the number of words, and the like of the modeling program is recorded.
(10) A wordmap file: all words in the text to be modeled are recorded, one word per action and its number.
(11) A workdict file: each line represents a word and records in which text each word appears and the number of lines in that text.
(12) Characteristic words: words that can embody text or subject matter.
(13) Modeling initialization: the modeling initialization of the first modeling loads the text to be modeled, calculates the total number of words, the total number of documents, the occurrence frequency of the words in each text and other information, leads the information into a program as parameters, and then begins to model the text to be modeled. Later modeling initialization is to read the param file, the wordmap file and the worddict file in the modeling result of the last program to load model data, and simultaneously read the initialization prior knowledge of the twards file.
(14) Text-topic matrix: each line in the matrix corresponds to a text, each column corresponds to a theme, and the probability that the text belongs to each theme is recorded.
(15) Topic-word matrix: each row in the matrix corresponds to a word, each column corresponds to a topic, and the probability that each word belongs to each topic is recorded.
(16) And (3) deducing: after the training model is obtained, the text can be inferred on the basis, and the theme distribution of the text is obtained. The LDA inference process using the Gibbs sampling algorithm is as follows:
1) and traversing each word of the current document, and randomly assigning a theme Z to each word.
2) The current document is rescanned and for each word, its topic number is resampled and updated according to Gibbs samples.
3) Step 2 is iterated until the Gibbs samples converge.
4) And after the iteration is finished, counting the topics of all words in the document to obtain the text-topic distribution of the text.
The method for capturing and analyzing the ZeroNet blog and forum text comprises the steps of firstly calling a browser to perform simulated login to obtain website data, then analyzing a local database to obtain text content, modeling and analyzing the text by using a semi-supervised LDA topic model after the text is obtained, classifying the text, and predicting the category of a new text based on a modeling result so as to realize supervision of the ZeroNet blog and forum text.
In a preferred embodiment of the present invention, before acquiring the website data, an initialization process is performed:
and analyzing data of main navigation websites in the ZeroNet, extracting site address data in the data, and constructing a website database.
In a preferred embodiment of the present invention, the acquiring the website data comprises the steps of:
step 11: opening a ZeroNet application and a browser, and establishing network connection with an initial Tracker node by utilizing a ZeroNet network Tracker node communication protocol (comprising HTTP, UDP and Zero protocol) to initialize a network environment;
step 12: reading blogs and forum websites in a zeroNet website database, simulating an access process in a real environment, performing traversal access on the blogs and the forum websites, uploading local data storage paths of websites with successful access, and marking the local data storage paths as accessed; after delaying, continuously and circularly traversing websites which are not successfully accessed in the last traversal, wherein the delay time can be manually set; after a fixed number of cycles or manual termination.
In a preferred embodiment of the present invention, as shown in fig. 2, parsing the local database to obtain the text content includes the following steps:
step 21: reading a local data storage path of a successfully accessed website, analyzing a website signature set information file contained in the path, and reading a SQL database configuration file path in the file;
step 22: analyzing the SQL configuration file according to the path of the SQL configuration file, analyzing local SQL databases of the blog and forum websites by using the analyzed SQL sentences to obtain the text content of the websites, and performing language identification; dividing the languages into three languages of Chinese, English and other languages and respectively storing the languages in different paths;
step 23: since the data set text is mainly the chinese and english texts, the present embodiment mainly performs content analysis for the two languages. And respectively preprocessing the Chinese and English texts, and removing the low-meaning texts of which the word number is less than the set number after preprocessing to obtain the processed pure texts.
In a preferred embodiment of the present invention, as shown in fig. 3, modeling analysis and classification of text using a semi-supervised LDA topic model comprises the steps of:
step 31: respectively calculating the confusion degrees of the processed Chinese and English texts, and returning to a confusion degree curve chart;
step 32: the user sets a proper number of themes based on the confusion curve or sets a number of themes expected by the user; the user can set other parameters or use default parameters;
step 33: after parameter setting is finished, reading a text to be modeled for model initialization, and then starting modeling to generate an initial modeling result;
step 34: a user can select to manually modify the twards file in the modeling result, remove noise words under each topic or add characteristic words to corresponding topics to serve as prior knowledge, and rerun the modeling program; reading the last modeling result and the prior knowledge to initialize the model, so that the feature words under each topic in the prior knowledge have higher probability of falling into the topic; repeating the steps in each circulation operation, enabling a user to freely select the times of the program circulation operation, storing a current modeling result after each program circulation operation is finished, and entering step 35 after the modeling is finished;
step 35: and Bayesian inference is used for the text based on the modeling result, the average value is repeatedly taken three times for each inference, a theme matrix of the text is returned, and the user can select the theme with the highest probability as the classification result to realize text classification, or the theme with the probability exceeding a specified threshold value is used as the label of the text to realize multi-label automatic marking.
In a preferred embodiment of the present invention, step 35 can infer already existing text, or infer newly acquired text to predict to enable supervision of ZeroNet blogs and forum texts; in the case of existing modeling results, the user can make inferences directly starting from step 35.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A method for capturing and analyzing texts of a ZeroNet blog and a forum is characterized in that a browser is called to perform simulated login to obtain website data, a local database is analyzed to obtain text contents, a semi-supervised LDA topic model is used to perform modeling analysis and classification on the texts after the texts are obtained, and the category of a new text can be predicted based on a modeling result so as to realize supervision of the texts of the ZeroNet blog and the forum.
2. The method of claim 1, wherein before the website data is obtained, an initialization process is performed:
and analyzing data of main navigation websites in the ZeroNet, extracting site address data in the data, and constructing a website database.
3. The method of claim 2, wherein the step of obtaining website data comprises the steps of:
step 11: opening a ZeroNet application and a browser, and establishing network connection with an initial Tracker node by using a communication protocol of a ZeroNet network Tracker node to initialize a network environment;
step 12: reading blogs and forum websites in a zeroNet website database, simulating an access process in a real environment, performing traversal access on the blogs and the forum websites, uploading local data storage paths of websites with successful access, and marking the local data storage paths as accessed; after delaying, continuously and circularly traversing websites which are not successfully accessed in the last traversal, wherein the delay time can be manually set; after a fixed number of cycles or manual termination.
4. The method of claim 3, wherein parsing the local database to obtain text content comprises:
step 21: reading a local data storage path of a successfully accessed website, analyzing a website signature set information file contained in the path, and reading a SQL database configuration file path in the file;
step 22: analyzing the SQL configuration file according to the path of the SQL configuration file, analyzing local SQL databases of the blog and forum websites by using the analyzed SQL sentences to obtain the text content of the websites, and performing language identification; dividing the languages into three languages of Chinese, English and other languages and respectively storing the languages in different paths;
step 23: and respectively preprocessing the Chinese and English texts, and removing the low-meaning texts of which the word number is less than the set number after preprocessing to obtain the processed pure texts.
5. The method of claim 4, wherein the modeling, analyzing and classifying the text by using the semi-supervised LDA topic model comprises the following steps:
step 31: respectively calculating the confusion degrees of the processed Chinese and English texts, and returning to a confusion degree curve chart;
step 32: the user sets a proper number of themes based on the confusion curve or sets a number of themes expected by the user; the user can set other parameters or use default parameters;
step 33: after parameter setting is finished, reading a text to be modeled for model initialization, and then starting modeling to generate an initial modeling result;
step 34: a user can select to manually modify the twards file in the modeling result, remove noise words under each topic or add characteristic words to corresponding topics to serve as prior knowledge, and rerun the modeling program; reading the last modeling result and the prior knowledge to initialize the model, so that the feature words under each topic in the prior knowledge have higher probability of falling into the topic; repeating the steps in each circulation operation, enabling a user to freely select the times of the program circulation operation, storing a current modeling result after each program circulation operation is finished, and entering step 35 after the modeling is finished;
step 35: and Bayesian inference is used for the text based on the modeling result, the average value is repeatedly taken three times for each inference, a theme matrix of the text is returned, and the user can select the theme with the highest probability as the classification result to realize text classification, or the theme with the probability exceeding a specified threshold value is used as the label of the text to realize multi-label automatic marking.
6. The method of claim 5, wherein step 35 is capable of inferring existing text or newly acquired text to predict and enable supervision of ZeroNet blogs and forum text; in the case of existing modeling results, the user can make inferences directly starting from step 35.
CN202010716026.1A 2020-05-19 2020-07-23 ZeroNet blog and forum text grabbing and analyzing method Pending CN111814068A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010426038 2020-05-19
CN2020104260380 2020-05-19

Publications (1)

Publication Number Publication Date
CN111814068A true CN111814068A (en) 2020-10-23

Family

ID=72862368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010716026.1A Pending CN111814068A (en) 2020-05-19 2020-07-23 ZeroNet blog and forum text grabbing and analyzing method

Country Status (1)

Country Link
CN (1) CN111814068A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114915599A (en) * 2022-07-19 2022-08-16 中国电子科技集团公司第三十研究所 Dark website point session identification method and system based on semi-supervised cluster learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744981A (en) * 2014-01-14 2014-04-23 南京汇吉递特网络科技有限公司 System for automatic classification analysis for website based on website content

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
过小宇 等: "基于半监督LDA主题模型的ZeroNet文本内容分析", 《信息技术》 *
郭晗 等: "基于 Freenet 的暗网空间资源探测", 《通信技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114915599A (en) * 2022-07-19 2022-08-16 中国电子科技集团公司第三十研究所 Dark website point session identification method and system based on semi-supervised cluster learning
CN114915599B (en) * 2022-07-19 2022-11-11 中国电子科技集团公司第三十研究所 Dark website point conversation identification method and system based on semi-supervised cluster learning

Similar Documents

Publication Publication Date Title
Olmezogullari et al. Representation of click-stream datasequences for learning user navigational behavior by using embeddings
Gleich et al. Tracking the random surfer: empirically measured teleportation parameters in PageRank
CN104133877B (en) The generation method and device of software label
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN106354818B (en) Social media-based dynamic user attribute extraction method
KR101923780B1 (en) Consistent topic text generation method and text generation apparatus performing the same
WO2023108980A1 (en) Information push method and device based on text adversarial sample
JP6776310B2 (en) User-Real-time feedback information provision methods and systems associated with input content
CN102222098A (en) Method and system for pre-fetching webpage
CN111259220B (en) Data acquisition method and system based on big data
Libovický et al. Multimodal abstractive summarization for open-domain videos
CN108536868A (en) The data processing method of short text data and application on social networks
CN109508448A (en) Short information method, medium, device are generated based on long article and calculate equipment
Bykau et al. Fine-grained controversy detection in Wikipedia
Han et al. Linking social network accounts by modeling user spatiotemporal habits
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN111814068A (en) ZeroNet blog and forum text grabbing and analyzing method
CN110069686A (en) User behavior analysis method, apparatus, computer installation and storage medium
Aziz et al. Social network analytics: natural disaster analysis through twitter
Shu et al. Automatic extraction of web page text information based on network topology coincidence degree
Bhat et al. Browser simulation-based crawler for online social network profile extraction
WO2023048807A1 (en) Hierarchical representation learning of user interest
CN115470328A (en) Open field question-answering method based on knowledge graph and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201023

RJ01 Rejection of invention patent application after publication