CN112149416A

CN112149416A - Method for detecting hot spot academic research topic in distributed academic data warehouse

Info

Publication number: CN112149416A
Application number: CN202010938852.0A
Authority: CN
Inventors: 戴海鹏; 陈贵海; 李猛; 汪笑宇; 夏瑞; 谢榕彪; 于俊
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-12-29
Anticipated expiration: 2040-09-09
Also published as: CN112149416B

Abstract

A method for detecting hot spot academic research topics in a distributed academic data warehouse comprises a data sampling compression coding stage, a transmission stage and a data recovery and detection stage on a central server, wherein the data sampling compression coding stage comprises a data sampling compression coding stage and a data recovery and detection stage; performing multiple sampling on each academic word extracted from the academic document by data sampling compression coding to determine whether the academic word enters each encoding type cuckoo filter in the group, wherein successfully sampled words enter a data encoding stage; the data compression and encoding stage is responsible for scanning all documents in each distributed data warehouse and extracting academic research vocabularies from the documents by utilizing a word segmentation device; the data transmission stage is responsible for transmitting the coded cuckoo filter for recording the compressed data in each distributed data warehouse to the central server; the data recovery and detection stage is to decode and recover the original vocabulary from the encoded cuckoo filter constructed from each distributed data set and estimate the heat degree of the vocabulary on the central server.

Description

Method for detecting hot spot academic research topic in distributed academic data warehouse

Technical Field

The invention relates to data mining, and more particularly: is a framework of methods related to detecting hot spot academic research topics in a distributed academic data repository.

Background

With the steady increase of the level and scale of the domestic academic research in recent years, the number of published academic papers is increasing day by day. For example, a report on scientific and engineering indicators issued by the national science foundation in the united states shows that the academic papers published in china in 2016 have exceeded 42.6 ten thousand, corresponding to 18.6% of the total international number, which exceeds the united states and makes china the first major number of academic papers. However, with the increase of the number of published papers and the continuous divergence of academic research directions, it is more and more difficult to grasp the current academic hotspots and track the corresponding research progress, thereby increasing the difficulty of the novice scientific researchers to follow the academic research frontier; in addition, scientific research projects and funds are difficult to arrange reasonably for scientific research management institutions.

In recent years, there has been some research work beginning to focus on the detection of academic hotspots, with a lower limit: (1) hot topics can only be detected in a centralized academic repository; (2) continuous scientific research file updating cannot be supported; (3) a large amount of network bandwidth and memory resources are required to support the detection process. Considering that the existing academic warehouse deployment mode is distributed deployment and needs to define the requirements of hot research topics in practice, the existing research works have certain limitations and cannot be directly used for detecting the academic research hot topics in the distributed data warehouse, so the existing works cannot solve the target provided by the invention.

Therefore, it is an urgent need to solve the problem of the art to provide a method and a system for detecting hot spot research topics in a distributed data warehouse, effectively reduce the amount of data to be transmitted in a distributed environment, and ensure the accuracy of detecting the hot spot topic.

Disclosure of Invention

The invention aims to: on the premise of keeping low communication traffic, hot-spot research topics are detected in a distributed data warehouse.

In order to achieve the purpose, the technical scheme of the invention is as follows: a method of detecting hot spot academic research topics in a distributed academic data warehouse, characterized by; the method comprises a data sampling compression coding and transmission stage in a distributed data warehouse and a data recovery and detection stage on a central server;

wherein:

the data sampling, compressing and encoding stage is responsible for maintaining a group of encoding type cuckoo filters, multiple sampling is carried out on each academic word extracted from the academic document to determine whether the academic word enters each encoding type cuckoo filter in the group, and the successfully sampled vocabulary enters the data encoding stage;

the data Coding stage is responsible for scanning all documents in each distributed data warehouse, extracting academic research vocabularies from the documents by using a word splitter, compressing and Coding the extracted academic vocabularies and frequency thereof, and recording the compressed and coded academic research vocabularies and the frequency thereof into a storage structure of a Coding Cuckoo Filter (Coding Cuckoo Filter);

the data transmission stage is responsible for transmitting the coded cuckoo filter for recording the compressed data in each distributed data warehouse to the central server;

the data recovery and detection stage is to decode and recover original words from the encoded cuckoo filters constructed from the distributed data sets on the central server, estimate the heat (frequency) of the words, and output hot research topics according to the heat (frequency) requirements of academic topics given by users. On the basis of the coded cuckoo filters sent by the distributed servers, the potential hot research topic vocabularies and the heat degrees of the hot research topic vocabularies are recovered according to the compressed data stored in the distributed servers, the total heat degrees in all distributed data warehouses are calculated, and finally the hot research topics are output according to the total heat degrees.

The method comprises the steps of (1) detecting hot spot academic research topics in a distributed academic data warehouse, (1) maintaining a group of encoding type cuckoo filters in a data storage stage, and determining whether to store each academic vocabulary into each filter or not through multiple sampling. (2) In the data coding and transmission stage, the original data is not stored, but the code, fingerprint information and frequency information of the original data are stored; (2) in the data coding stage, the frequency of each academic vocabulary is sampled and then recorded together with the codes and fingerprints of the academic vocabulary; (3) then, in the data recovery and detection stage, according to the fingerprint information, the codes belonging to the same element are gathered and then decoded to recover the original data; (4) then, in the data recovery and detection stage, according to the fingerprint information, the codes belonging to the research vocabulary are gathered and then decoded to recover the original data;

in the data encoding and transmission stage, the data is firstly subjected to multi-sampling and then is subjected to compression encoding and then is stored in the encoding cuckoo filter.

In the data recovery and detection stage, the heat degree of the academic vocabulary is recovered by a maximum likelihood estimation method.

The invention aims to provide a method for detecting hot research topics in a distributed academic data warehouse, which comprises the following steps: designing a system model for distributed computing detection; the storage capacity of the academic topic words is compressed by using an encoding technology; it is proposed to further reduce data storage and traffic using multisampling techniques; it is proposed to increase the speed of the data processing process using encoded cuckoo filter storage. Specifically, the present invention: 1. designing a hot academic topic detection system model; 2. proposing a distributed scanning academic document, extracting hot words, compressing and coding the hot words and storing the academic words and the hot degrees; 3. the encoded data is stored into an encoded cuckoo filter to accelerate the data processing speed;

the topic of academic hot topic refers to a problem which is researched by a large number of researches in academic research, and the form of the problem is also expressed in the form of words. Firstly, providing a system model for detecting academic hot topics in a distributed academic data warehouse; secondly, in a data sampling stage, a multi-sampling technology is adopted, so that the data storage capacity is reduced, and meanwhile, higher accuracy is kept; in the data coding and transmission stage, in each distributed data warehouse, a coding technology is used for compressing topic words contained in academic files recorded in each distributed data warehouse, and then the coded data is stored into a coding cuckoo filter; in the data transmission stage, compressed data are transmitted to a designated central server, hot topics and occurrence frequency of the hot topics are recovered, and finally all hot academic topics are output according to topic popularity requirements provided by users. The invention provides a method for detecting academic research hot topics in a distributed academic data warehouse for the first time, which effectively reduces a large amount of data communication traffic generated for detecting the academic research hot under a distributed environment, pertinently provides effective theoretical performance guarantee, and can be used for detecting the academic hot topics and calculating topic heat.

The invention has the beneficial effects that: 1. the data volume required to be transmitted by the academic hot topic in the distributed data environment is effectively reduced by using the coded compressed data; 2. the data volume of storage and transmission is further reduced by utilizing a multi-sampling technology; 3. the encoded data is stored into the encoded cuckoo filter and transmitted as a whole, which greatly speeds up the data processing and transmission time.

Drawings

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a flow chart of a data sampling phase;

fig. 3 is a flow chart of data decoding recovery and detection.

Detailed Description

The system architecture of the present invention is shown in fig. 1, and includes a central server and a distributed academic repository. The invention has two stages: (1) data sampling, data compression and encoding stages and transmission stages completed in the distributed data warehouse; (2) and a data decoding and recovering and hot topic detection stage on the central server. The data compression and encoding stage can be further subdivided into 3 steps: data sampling, data coding and fingerprint information acquisition, data storage and data transmission; the data decoding recovery and detection phase can be divided into 2 steps: data decoding recovery and hot word detection.

Stage 1.1: data sampling phase

In the data sampling stage, a group of encoding type cuckoo filters are maintained for each distributed data warehouse, academic documents in the data warehouse are scanned, a word segmentation device is used for extracting academic research words in the academic documents, and finally, multiple sampling determines how the words are stored. The process of multisampling is as follows: (1) the sampling probability of the encoding type cuckoo filter in each group is increased according to the sequence number and shows geometric attenuation, such as: the sampling probability of the first filter is 1%, the second is 0.2%, the third is 0.04%, and so on; (2) each academic vocabulary is independently sampled on all the encoding type cuckoo filters in the group according to the preset sampling frequency of the filters, and the sampling process of a plurality of filters forms multiple sampling. The successfully sampled academic vocabulary will enter the subsequent encoding stage.

Stage 1.2: data encoding phase

In the data encoding stage, the academic vocabulary is firstly compressed, then the vocabulary fingerprint information is acquired, and then the acquired code and the fingerprint information are inserted into the encoding type cuckoo filter, and meanwhile, the counter of the insertion position is increased by 1.

And (3) a compression process: each academic vocabulary has an identification number (ID) that can be obtained directly from the english character or from a binary representation of the chinese code. Since the number is usually long, direct transmission causes excessive traffic. To solve the traffic problem, we first perform a lossy compression (Raptor code) code on the data, as follows:

raptor code coding matrix [ a ]_ij]，1≤jIf l is less than or equal to l, the corresponding length of vocabulary ID is

The coded result of the bit is

The calculation process is as follows:

fingerprint acquisition, given vocabulary ID, and hash function h_f(. The) fingerprint information f (length is p) is obtained as follows:

f＝h_f(ID)％2^pwhere% represents a modulo operation.

After acquiring the code and the fingerprint, inserting the code information and the fingerprint information according to an insertion mode of a common cuckoo filter: (1) computing potentially two insertable data buckets using two hash functions; (2) if the two positions have spaces which can be inserted, the two positions are directly inserted; (3) if there is no space to insert in these two positions, it directly kicks out an element to free up the position to insert, then the element proposed is inserted again by repeating the above process.

Stage 1.3: data transmission

When the encoding stage of each distributed data warehouse is completed, the encoded cuckoo filter storing the compressed data information needs to be sent to a designated central server.

Stage 2.1: data recovery and detection

After all data sets are sent to some central server, we need to extract the compressed data from the encoded cuckoo filters from the different distributed academic repositories, then decode to recover the original data and estimate the heat.

Extracting compressed data: after obtaining the encoded cuckoo filters sent by the servers, the encoded cuckoo filters are arranged and aligned for processing. And traversing all the data buckets for all the cuckoo filters, selecting the current data bucket, then taking out the elements in the current data bucket, and then extracting the elements which are in the same insertion position and have the same fingerprint information and are transmitted from all the distributed data warehouses according to the fingerprint information of the elements to form a same type encoding group. As shown in fig. 3, the traversal encounters element 1, and then all the remaining elements in the same code group are extracted according to element 1.

And (3) decoding: the extracted code is substituted into equation 1 to decode the original vocabulary ID.

And (3) heat estimation: and aiming at the decoded vocabulary ID, calculating an estimated heat value by utilizing maximum likelihood estimation according to the value in the counter of the decoded vocabulary ID and the corresponding sampling probability, and then outputting the decoded vocabulary and the corresponding heat.

Claims

1. A method for detecting hot spot academic research topics in a distributed academic data warehouse is characterized by comprising a data sampling compression coding stage, a transmission stage and a data recovery and detection stage on a central server, wherein the data sampling compression coding stage comprises a data sampling compression coding stage and a data recovery and detection stage;

wherein: the data sampling, compressing and encoding stage is responsible for maintaining a group of encoding type cuckoo filters, multiple sampling is carried out on each academic word extracted from the academic document to determine whether the academic word enters each encoding type cuckoo filter in the group, and the successfully sampled vocabulary enters the data encoding stage;

the data compression and encoding stage is responsible for scanning all documents in each distributed data warehouse, extracting academic research vocabularies from the documents by using a word splitter, compressing and encoding the extracted academic vocabularies and frequency thereof, and recording the compressed and encoded academic research vocabularies and the frequency thereof into a storage structure of a Coding Cuckoo Filter (Coding Cuckoo Filter);

the data recovery and detection stage is to decode and recover original words from the encoded cuckoo filters constructed from the distributed data sets on the central server, estimate the heat (frequency) of the words, and output hot research topics according to the heat (frequency) requirements of academic topics given by users.

2. The method of detecting hot spots academic research topics as claimed in claim 1, wherein hot spots academic research topics are detected in a distributed academic data warehouse, (1) a set of encoded cuckoo filters is maintained during a data storage phase, and whether to store into each filter is determined by multiple sampling for each academic vocabulary; (2) in the data coding and transmission stage, the original data is not stored, but the code, fingerprint information and frequency information of the original data are stored; (2) in the data coding stage, the frequency of each academic vocabulary is sampled and then recorded together with the codes and fingerprints of the academic vocabulary; (3) then, in the data recovery and detection stage, according to the fingerprint information, the codes belonging to the same element are gathered and then decoded to recover the original data; (4) and then, in the data recovery and detection stage, according to the fingerprint information, the codes belonging to the research vocabulary are gathered and then decoded to recover the original data.

3. The method for detecting hot academic research topics as claimed in claim 1, wherein the method framework for detecting persistent network attacks in the distributed network is characterized in that in the data encoding and transmission stage, data is firstly subjected to multi-sampling and then is subjected to compression encoding and then is stored in the process of entering the encoding cuckoo filter.

4. The method of detecting hot academic research topics as claimed in claim 1, wherein during the data recovery and detection phase, the heat of the academic vocabulary is recovered by means of maximum likelihood estimation.

5. The method of claim 1, wherein during the data sampling phase, a set of encoded cuckoo filters is maintained for each distributed data warehouse, and then the academic documents in the data warehouse are scanned and the academic research vocabulary is extracted by a word splitter, and finally the multiple sampling determines how the vocabulary is stored; the process of multisampling is as follows: (1) the sampling probability of the encoding type cuckoo filter in each group is increased according to the sequence number and shows geometric attenuation, such as: the sampling probability of the first filter is 1%, the second is 0.2%, the third is 0.04%, and so on; (2) each academic vocabulary is independently sampled on all the encoding type cuckoo filters in the group according to the preset sampling frequency of the filters, and the sampling process of a plurality of filters forms multiple sampling; the successfully sampled academic vocabulary will enter the subsequent encoding stage.

6. The method of claim 1, wherein in the data encoding stage, the academic vocabulary is compressed first, then the vocabulary fingerprint information is acquired, and then the acquired code and fingerprint information are inserted into the encoded cuckoo filter, and the counter of the insertion position is increased by 1.

7. The method of detecting hot academic research topics as claimed in claim 1, wherein the compression encoding process: each academic vocabulary has an identification number (ID), which can be directly obtained from English characters or binary representation of Chinese coding; since the number is usually long, direct transmission causes excessive traffic; in order to solve the problem of communication traffic, the data is first subjected to a lossy compression (Raptor code) code, and the process is as follows:

raptor code coding matrix [ a ]_ij]J is more than or equal to 1 and less than or equal to l, the corresponding length of vocabulary ID is

The coded result of the bit is

The calculation process is as follows:

f＝h_f(ID)％2^pwhere% represents a modulo operation;

after the code and the fingerprint are obtained, inserting the code information and the fingerprint information according to the insertion mode of a common cuckoo filter: (1) computing potentially two insertable data buckets using two hash functions; (2) if the two positions have spaces which can be inserted, the two positions are directly inserted; (3) if there is no space to insert in these two positions, it directly kicks out an element to free up the position to insert, then the element proposed is inserted again by repeating the above process.

8. The method of detecting a hot academic research topic according to claim 1,

extracting compressed data: after obtaining the coded cuckoo filters sent by each server, arranging and aligning the coded cuckoo filters for processing; traversing all data buckets for all cuckoo filters, selecting the current data bucket, then taking out the elements in the current data bucket, and then extracting the elements which are transmitted by all distributed data warehouses and have the same fingerprint information at the same insertion position according to the fingerprint information of the elements to form a similar encoding group; the traversal encounters element 1, and then all the other elements in the same coding group with the element are extracted according to element 1.

9. The method of detecting a hot academic research topic of claim 1, wherein decoding: substituting the extracted codes into a formula 1 to decode an original vocabulary ID; and (3) heat estimation: and aiming at the decoded vocabulary ID, calculating an estimated heat value by utilizing maximum likelihood estimation according to the value in the counter of the decoded vocabulary ID and the corresponding sampling probability, and then outputting the decoded vocabulary and the corresponding heat.