CN111460046A

CN111460046A - Scientific and technological information clustering method based on big data

Info

Publication number: CN111460046A
Application number: CN202010150066.4A
Authority: CN
Inventors: 丁荣荣
Original assignee: Hefei Haice Science And Technology Information Service Co ltd
Current assignee: Hefei Haice Science And Technology Information Service Co ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-28

Abstract

The invention discloses a scientific and technological information clustering method based on big data, which comprises the following steps: collecting user behavior historical data; analyzing and processing user behavior characteristics; establishing a user behavior feature set; establishing a big data clustering model; clustering the data set by using a clustering model; and pushing the clustered information resources to the user. The method solves the big data processing problem faced by science and technology information clustering by utilizing the parallel computing capability of the high-performance cluster system of the cloud computing, is convenient to develop and mine the big data based on the cloud computing, takes the parallel clustering as a target, shields the bottom layer, improves the processing capability and speed of large-scale data, realizes the function of the cloud computing on cluster analysis in the data mining, avoids simply recommending the science and technology information based on text approximation, and enables science and technology personnel to obtain more comprehensive and accurate information.

Description

Scientific and technological information clustering method based on big data

Technical Field

The invention relates to the technical field of big data processing, in particular to a scientific and technological information clustering method based on big data.

Background

The scientific and technological information is an information carrier for recording scientific and technological activities and scientific and technological knowledge; the method is a main means for recording and transmitting scientific and technical information, and is an important tool for helping people to know objective things, inspire ideas and seek technical support. The scientific and technological information comprises intellectual property, scientific and technological articles, scientific and technological achievements, technical standards, scientific data, information, new products and the like. With the progress of the science and technology level of the society, the data volume of science and technology information is explosively increased. The scientific and technical information data can not be supported by the network technology no matter development or use. However, scientific and technical information on the current network is complicated and has low comprehensiveness and accuracy, so that scientific and technical enterprises and scientific personnel cannot easily obtain real and valuable information directly. There is very big contradiction between science and technology information fragmentation and science and technology personnel time fragmentation, information demand individuation and diversification, and terminal equipment is turned to handheld intelligent terminal by the PC in addition, also leads to the science and technology personnel to also higher and higher to the intelligent demand of science and technology information show and recommendation. If the useless information can be filtered, and various scientific and technical information can be effectively classified and refined, accurate and high-quality information recommendation to scientific and technical enterprises and scientific and technical personnel is realized, and the method becomes increasingly important.

In the prior art, a patent cn201310173534.x provides a method for automatically classifying and screening scientific and technical information, and mainly solves the problems that the existing search technology is based on single words instead of summarizing the whole page, the information retrieval efficiency is improved, and the completeness and reliability of data capture are ensured; the patent CN201410150100.2 provides a heterogeneous data analysis method for scientific and technical information vertical search, and mainly solves the problem of improving the accuracy of vertical search, so that users can more easily obtain information meeting actual requirements. Although the fields of the technologies are relatively similar and the design ideas have characteristics, the methods are all designed for scientific and technological information search, are not designed for scientific and technological information big data processing, and do not meet the requirements for realizing scientific and technological information intelligent clustering recommendation. At present, the recommendation of scientific and technical information is still based on text approximation, and scientific and technical personnel hope to obtain more comprehensive and accurate information, which causes the effect of the recommendation of the scientific and technical information to be unsatisfactory.

Therefore, it is necessary to invent a scientific and technological information clustering method based on big data to solve the above problems.

Disclosure of Invention

The invention aims to provide a scientific and technological information clustering method based on big data to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a scientific and technological information clustering method based on big data is characterized by comprising the following steps:

s1, collecting user behavior historical data: the client side collects user data and uploads the user data to the cloud server; the collected data comprises keywords and browsing behaviors input by a user and personal basic information;

s2, analyzing and processing user behavior characteristics: preprocessing and aggregating user data, filtering out incomplete data and garbage useless data, and storing useful data with complete behavior characteristics into big data;

s3, establishing a user behavior feature set: the system analysis module analyzes the user behavior, extracts behavior information frequently browsed by the user, integrates basic user information and establishes a user behavior feature set;

s4, establishing a big data clustering model: carrying out deep analysis on the user data center by utilizing a deep learning algorithm, a machine learning algorithm and a semantic analysis algorithm, and establishing a big data clustering model algorithm model;

s5, clustering the data set by using the clustering model: searching scientific and technological information resources related to user behaviors and analyzed by the analysis module from the big database, and locally clustering the sub-data;

s6, pushing scientific and technical information: and the data pushing module pushes the information resources after the local clustering to the user.

Preferably, the scientific and technological information can be intellectual property, scientific and technological articles, scientific and technological projects, scientific and technological achievements, technical standards, scientific data, information and new products.

Preferably, the big data clustering model can be one of a k-means model and a MapReduce model.

Preferably, the step S5 includes the following sub-steps:

s51, preprocessing the scientific and technological information original data set;

s52, dividing the data U into M sub-data sets and distributing the M sub-data sets to M Map functions;

s53, locally clustering the sub-data in the Map processing process;

s54, merging the classes with the same key/Value in the Reduce processing process;

s55, if the actual clustering number R is smaller than the clustering number k, the contraction factor parameter needs to be adjusted, and clustering is carried out again until the actual clustering number R is equal to the clustering number k;

s56, if N_New＞N_{Old age}||K_New＞K_{Old age}Then the two data sets are re-partitioned, K ═ K [ [ (K)_New+K_{Old age})/2](ii) a On the contrary, the central point of the K clusters obtained by the non-updated data set is used as K points to form a new data set with the new data source for segmentation, and K is K_{Old age}(ii) a Wherein N is_NewAnd N_{Old age}Respectively representing the number of new data sources and the number of data sources before no update, K_NewAnd K_{Old age}Respectively representing the number of the central points of the new data source and the number of the central points of the data source before updating;

and S57, repeating the stages S53, S54 and S55 until the actual cluster number R is equal to the cluster number k.

Compared with the prior art, the scientific and technological information clustering method based on big data has the following beneficial effects:

the parallel computing capability of a high-performance cluster system of cloud computing is utilized to solve the problem of big data processing faced by clustering; by taking parallel clustering as a target, a new clustering thought and an improved method are provided; the data processing cost of an enterprise is greatly reduced, and meanwhile, the cloud computing-based big data mining development is convenient, so that the bottom layer is shielded; under the parallelization condition, the cloud computing can improve the processing capacity and speed of large-scale data by utilizing original equipment, thereby not only ensuring the fault tolerance, but also increasing nodes; the cloud computing is used for realizing the clustering analysis in data mining. The recommendation of the scientific and technological information avoids the simple recommendation based on text approximation, so that scientific and technological personnel can obtain more comprehensive and accurate information.

Drawings

FIG. 1 is a flow chart of a scientific and technological information clustering method based on big data according to the present invention;

FIG. 2 is a flow chart of clustering a data set by using a clustering model in steps in a scientific and technological information clustering method based on big data according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A scientific and technical information clustering method based on big data according to an embodiment of the present invention is described below with reference to fig. 1.

Further, in the above technical solution, the scientific and technological information may be intellectual property, scientific and technological papers, scientific and technological projects, scientific and technological achievements, technical standards, scientific data, information, and new products.

Further, in the above technical solution, the big data clustering model may be one of a k-means model and a MapReduce model.

Further, in the above technical solution, the step S5 includes the following sub-steps:

s51, preprocessing the scientific and technical information original data set, wherein the basic idea is as follows: firstly, scanning the whole data source, checking whether null values exist or not, and supplementing missing values; the selection of the missing value is supplemented according to the average value of the dimension where the null value is located; secondly, vectorizing and segmenting the data set, distributing the data blocks to nodes after segmentation, distributing the data blocks to M Map functions by each node, setting a threshold value T (distance between each point and each point) and M (the minimum number allowed in each cluster), selecting c points with the farthest distance as representative points for clustering, clustering the points meeting the T requirement into a class, putting the class into a cluster, circulating the process until no point meets the T requirement, then dividing the rest points into a class to form a cluster, and expressing the center of each cluster by (N (the number of all points in the cluster), SUM (the SUM of vectors of all points in each dimension), and SUMSQ (the SUM of squares of components of all points in each dimension)); and finally, checking the number of the points in the finally formed cluster, deleting all the points in the cluster if the number in the cluster is less than M, otherwise forming a data set U, and obtaining a cluster number K.

s53, locally clustering the sub-data in the Map processing process;

s55, if the actual clustering number R is smaller than the clustering number k, adjusting the number c of the representative points and the contraction factors, and clustering again until the actual clustering number R is equal to the clustering number k;

The big data not only has the characteristics of high dimension and mass data, but also has the characteristics of fast data generation and data updating; therefore, based on the characteristics, the algorithm is solved by adopting the following method, and the basic idea is as follows: firstly, preprocessing a new scientific and technological information data source to obtain the number K of a data set U of the new data source, the number K of the central points of clusters and the number N of all data points; secondly, if the number K of the centers of the new data sources is larger than the number K of clusters obtained before updating or the number of points of the new data sources is larger than the number of points of the data sources before updating, the new data sources and the data sources which are not updated are subjected to data set segmentation again; on the contrary, the central points of K clusters obtained by the data set which is not updated are used as K points to form a new data set with the new data source for segmentation; then, distributing the subsets to each child node, distributing the subsets to a plurality of Map functions, and carrying out local clustering; if the first case, then K is chosen to be [ (K)_New+K_{Old age})/2]Otherwise, K is selected as the value of K before updating; the pre-treatment phase is then repeated.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A scientific and technological information clustering method based on big data is characterized by comprising the following steps:

2. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the scientific and technological information can be intellectual property, scientific and technological articles, scientific and technological achievements, technical standards, scientific data, information and new products.

3. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the big data clustering model can be one of a k-means model and a MapReduce model.

4. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the step S5 includes the following sub-steps:

s53, locally clustering the sub-data in the Map processing process;