CN111460046A - Scientific and technological information clustering method based on big data - Google Patents

Scientific and technological information clustering method based on big data Download PDF

Info

Publication number
CN111460046A
CN111460046A CN202010150066.4A CN202010150066A CN111460046A CN 111460046 A CN111460046 A CN 111460046A CN 202010150066 A CN202010150066 A CN 202010150066A CN 111460046 A CN111460046 A CN 111460046A
Authority
CN
China
Prior art keywords
data
clustering
scientific
information
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010150066.4A
Other languages
Chinese (zh)
Inventor
丁荣荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Haice Science And Technology Information Service Co ltd
Original Assignee
Hefei Haice Science And Technology Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Haice Science And Technology Information Service Co ltd filed Critical Hefei Haice Science And Technology Information Service Co ltd
Priority to CN202010150066.4A priority Critical patent/CN111460046A/en
Publication of CN111460046A publication Critical patent/CN111460046A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a scientific and technological information clustering method based on big data, which comprises the following steps: collecting user behavior historical data; analyzing and processing user behavior characteristics; establishing a user behavior feature set; establishing a big data clustering model; clustering the data set by using a clustering model; and pushing the clustered information resources to the user. The method solves the big data processing problem faced by science and technology information clustering by utilizing the parallel computing capability of the high-performance cluster system of the cloud computing, is convenient to develop and mine the big data based on the cloud computing, takes the parallel clustering as a target, shields the bottom layer, improves the processing capability and speed of large-scale data, realizes the function of the cloud computing on cluster analysis in the data mining, avoids simply recommending the science and technology information based on text approximation, and enables science and technology personnel to obtain more comprehensive and accurate information.

Description

Scientific and technological information clustering method based on big data
Technical Field
The invention relates to the technical field of big data processing, in particular to a scientific and technological information clustering method based on big data.
Background
The scientific and technological information is an information carrier for recording scientific and technological activities and scientific and technological knowledge; the method is a main means for recording and transmitting scientific and technical information, and is an important tool for helping people to know objective things, inspire ideas and seek technical support. The scientific and technological information comprises intellectual property, scientific and technological articles, scientific and technological achievements, technical standards, scientific data, information, new products and the like. With the progress of the science and technology level of the society, the data volume of science and technology information is explosively increased. The scientific and technical information data can not be supported by the network technology no matter development or use. However, scientific and technical information on the current network is complicated and has low comprehensiveness and accuracy, so that scientific and technical enterprises and scientific personnel cannot easily obtain real and valuable information directly. There is very big contradiction between science and technology information fragmentation and science and technology personnel time fragmentation, information demand individuation and diversification, and terminal equipment is turned to handheld intelligent terminal by the PC in addition, also leads to the science and technology personnel to also higher and higher to the intelligent demand of science and technology information show and recommendation. If the useless information can be filtered, and various scientific and technical information can be effectively classified and refined, accurate and high-quality information recommendation to scientific and technical enterprises and scientific and technical personnel is realized, and the method becomes increasingly important.
In the prior art, a patent cn201310173534.x provides a method for automatically classifying and screening scientific and technical information, and mainly solves the problems that the existing search technology is based on single words instead of summarizing the whole page, the information retrieval efficiency is improved, and the completeness and reliability of data capture are ensured; the patent CN201410150100.2 provides a heterogeneous data analysis method for scientific and technical information vertical search, and mainly solves the problem of improving the accuracy of vertical search, so that users can more easily obtain information meeting actual requirements. Although the fields of the technologies are relatively similar and the design ideas have characteristics, the methods are all designed for scientific and technological information search, are not designed for scientific and technological information big data processing, and do not meet the requirements for realizing scientific and technological information intelligent clustering recommendation. At present, the recommendation of scientific and technical information is still based on text approximation, and scientific and technical personnel hope to obtain more comprehensive and accurate information, which causes the effect of the recommendation of the scientific and technical information to be unsatisfactory.
Therefore, it is necessary to invent a scientific and technological information clustering method based on big data to solve the above problems.
Disclosure of Invention
The invention aims to provide a scientific and technological information clustering method based on big data to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a scientific and technological information clustering method based on big data is characterized by comprising the following steps:
s1, collecting user behavior historical data: the client side collects user data and uploads the user data to the cloud server; the collected data comprises keywords and browsing behaviors input by a user and personal basic information;
s2, analyzing and processing user behavior characteristics: preprocessing and aggregating user data, filtering out incomplete data and garbage useless data, and storing useful data with complete behavior characteristics into big data;
s3, establishing a user behavior feature set: the system analysis module analyzes the user behavior, extracts behavior information frequently browsed by the user, integrates basic user information and establishes a user behavior feature set;
s4, establishing a big data clustering model: carrying out deep analysis on the user data center by utilizing a deep learning algorithm, a machine learning algorithm and a semantic analysis algorithm, and establishing a big data clustering model algorithm model;
s5, clustering the data set by using the clustering model: searching scientific and technological information resources related to user behaviors and analyzed by the analysis module from the big database, and locally clustering the sub-data;
s6, pushing scientific and technical information: and the data pushing module pushes the information resources after the local clustering to the user.
Preferably, the scientific and technological information can be intellectual property, scientific and technological articles, scientific and technological projects, scientific and technological achievements, technical standards, scientific data, information and new products.
Preferably, the big data clustering model can be one of a k-means model and a MapReduce model.
Preferably, the step S5 includes the following sub-steps:
s51, preprocessing the scientific and technological information original data set;
s52, dividing the data U into M sub-data sets and distributing the M sub-data sets to M Map functions;
s53, locally clustering the sub-data in the Map processing process;
s54, merging the classes with the same key/Value in the Reduce processing process;
s55, if the actual clustering number R is smaller than the clustering number k, the contraction factor parameter needs to be adjusted, and clustering is carried out again until the actual clustering number R is equal to the clustering number k;
s56, if NNew>NOld age||KNew>KOld ageThen the two data sets are re-partitioned, K ═ K [ [ (K)New+KOld age)/2](ii) a On the contrary, the central point of the K clusters obtained by the non-updated data set is used as K points to form a new data set with the new data source for segmentation, and K is KOld age(ii) a Wherein N isNewAnd NOld ageRespectively representing the number of new data sources and the number of data sources before no update, KNewAnd KOld ageRespectively representing the number of the central points of the new data source and the number of the central points of the data source before updating;
and S57, repeating the stages S53, S54 and S55 until the actual cluster number R is equal to the cluster number k.
Compared with the prior art, the scientific and technological information clustering method based on big data has the following beneficial effects:
the parallel computing capability of a high-performance cluster system of cloud computing is utilized to solve the problem of big data processing faced by clustering; by taking parallel clustering as a target, a new clustering thought and an improved method are provided; the data processing cost of an enterprise is greatly reduced, and meanwhile, the cloud computing-based big data mining development is convenient, so that the bottom layer is shielded; under the parallelization condition, the cloud computing can improve the processing capacity and speed of large-scale data by utilizing original equipment, thereby not only ensuring the fault tolerance, but also increasing nodes; the cloud computing is used for realizing the clustering analysis in data mining. The recommendation of the scientific and technological information avoids the simple recommendation based on text approximation, so that scientific and technological personnel can obtain more comprehensive and accurate information.
Drawings
FIG. 1 is a flow chart of a scientific and technological information clustering method based on big data according to the present invention;
FIG. 2 is a flow chart of clustering a data set by using a clustering model in steps in a scientific and technological information clustering method based on big data according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A scientific and technical information clustering method based on big data according to an embodiment of the present invention is described below with reference to fig. 1.
A scientific and technological information clustering method based on big data is characterized by comprising the following steps:
s1, collecting user behavior historical data: the client side collects user data and uploads the user data to the cloud server; the collected data comprises keywords and browsing behaviors input by a user and personal basic information;
s2, analyzing and processing user behavior characteristics: preprocessing and aggregating user data, filtering out incomplete data and garbage useless data, and storing useful data with complete behavior characteristics into big data;
s3, establishing a user behavior feature set: the system analysis module analyzes the user behavior, extracts behavior information frequently browsed by the user, integrates basic user information and establishes a user behavior feature set;
s4, establishing a big data clustering model: carrying out deep analysis on the user data center by utilizing a deep learning algorithm, a machine learning algorithm and a semantic analysis algorithm, and establishing a big data clustering model algorithm model;
s5, clustering the data set by using the clustering model: searching scientific and technological information resources related to user behaviors and analyzed by the analysis module from the big database, and locally clustering the sub-data;
s6, pushing scientific and technical information: and the data pushing module pushes the information resources after the local clustering to the user.
Further, in the above technical solution, the scientific and technological information may be intellectual property, scientific and technological papers, scientific and technological projects, scientific and technological achievements, technical standards, scientific data, information, and new products.
Further, in the above technical solution, the big data clustering model may be one of a k-means model and a MapReduce model.
Further, in the above technical solution, the step S5 includes the following sub-steps:
s51, preprocessing the scientific and technical information original data set, wherein the basic idea is as follows: firstly, scanning the whole data source, checking whether null values exist or not, and supplementing missing values; the selection of the missing value is supplemented according to the average value of the dimension where the null value is located; secondly, vectorizing and segmenting the data set, distributing the data blocks to nodes after segmentation, distributing the data blocks to M Map functions by each node, setting a threshold value T (distance between each point and each point) and M (the minimum number allowed in each cluster), selecting c points with the farthest distance as representative points for clustering, clustering the points meeting the T requirement into a class, putting the class into a cluster, circulating the process until no point meets the T requirement, then dividing the rest points into a class to form a cluster, and expressing the center of each cluster by (N (the number of all points in the cluster), SUM (the SUM of vectors of all points in each dimension), and SUMSQ (the SUM of squares of components of all points in each dimension)); and finally, checking the number of the points in the finally formed cluster, deleting all the points in the cluster if the number in the cluster is less than M, otherwise forming a data set U, and obtaining a cluster number K.
S52, dividing the data U into M sub-data sets and distributing the M sub-data sets to M Map functions;
s53, locally clustering the sub-data in the Map processing process;
s54, merging the classes with the same key/Value in the Reduce processing process;
s55, if the actual clustering number R is smaller than the clustering number k, adjusting the number c of the representative points and the contraction factors, and clustering again until the actual clustering number R is equal to the clustering number k;
s56, if NNew>NOld age||KNew>KOld ageThen the two data sets are re-partitioned, K ═ K [ [ (K)New+KOld age)/2](ii) a On the contrary, the central point of the K clusters obtained by the non-updated data set is used as K points to form a new data set with the new data source for segmentation, and K is KOld age(ii) a Wherein N isNewAnd NOld ageRespectively representing the number of new data sources and the number of data sources before no update, KNewAnd KOld ageRespectively representing the number of the central points of the new data source and the number of the central points of the data source before updating;
and S57, repeating the stages S53, S54 and S55 until the actual cluster number R is equal to the cluster number k.
The big data not only has the characteristics of high dimension and mass data, but also has the characteristics of fast data generation and data updating; therefore, based on the characteristics, the algorithm is solved by adopting the following method, and the basic idea is as follows: firstly, preprocessing a new scientific and technological information data source to obtain the number K of a data set U of the new data source, the number K of the central points of clusters and the number N of all data points; secondly, if the number K of the centers of the new data sources is larger than the number K of clusters obtained before updating or the number of points of the new data sources is larger than the number of points of the data sources before updating, the new data sources and the data sources which are not updated are subjected to data set segmentation again; on the contrary, the central points of K clusters obtained by the data set which is not updated are used as K points to form a new data set with the new data source for segmentation; then, distributing the subsets to each child node, distributing the subsets to a plurality of Map functions, and carrying out local clustering; if the first case, then K is chosen to be [ (K)New+KOld age)/2]Otherwise, K is selected as the value of K before updating; the pre-treatment phase is then repeated.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims (4)

1. A scientific and technological information clustering method based on big data is characterized by comprising the following steps:
s1, collecting user behavior historical data: the client side collects user data and uploads the user data to the cloud server; the collected data comprises keywords and browsing behaviors input by a user and personal basic information;
s2, analyzing and processing user behavior characteristics: preprocessing and aggregating user data, filtering out incomplete data and garbage useless data, and storing useful data with complete behavior characteristics into big data;
s3, establishing a user behavior feature set: the system analysis module analyzes the user behavior, extracts behavior information frequently browsed by the user, integrates basic user information and establishes a user behavior feature set;
s4, establishing a big data clustering model: carrying out deep analysis on the user data center by utilizing a deep learning algorithm, a machine learning algorithm and a semantic analysis algorithm, and establishing a big data clustering model algorithm model;
s5, clustering the data set by using the clustering model: searching scientific and technological information resources related to user behaviors and analyzed by the analysis module from the big database, and locally clustering the sub-data;
s6, pushing scientific and technical information: and the data pushing module pushes the information resources after the local clustering to the user.
2. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the scientific and technological information can be intellectual property, scientific and technological articles, scientific and technological achievements, technical standards, scientific data, information and new products.
3. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the big data clustering model can be one of a k-means model and a MapReduce model.
4. The scientific and technological information clustering method based on big data according to claim 1, characterized in that: the step S5 includes the following sub-steps:
s51, preprocessing the scientific and technological information original data set;
s52, dividing the data U into M sub-data sets and distributing the M sub-data sets to M Map functions;
s53, locally clustering the sub-data in the Map processing process;
s54, merging the classes with the same key/Value in the Reduce processing process;
s55, if the actual clustering number R is smaller than the clustering number k, the contraction factor parameter needs to be adjusted, and clustering is carried out again until the actual clustering number R is equal to the clustering number k;
s56, if NNew>NOld age||KNew>KOld ageThen the two data sets are re-partitioned, K ═ K [ [ (K)New+KOld age)/2](ii) a On the contrary, the central point of the K clusters obtained by the non-updated data set is used as K points to form a new data set with the new data source for segmentation, and K is KOld age(ii) a Wherein N isNewAnd NOld ageRespectively representing the number of new data sources and the number of data sources before no update, KNewAnd KOld ageRespectively representing the number of the central points of the new data source and the number of the central points of the data source before updating;
and S57, repeating the stages S53, S54 and S55 until the actual cluster number R is equal to the cluster number k.
CN202010150066.4A 2020-03-06 2020-03-06 Scientific and technological information clustering method based on big data Pending CN111460046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010150066.4A CN111460046A (en) 2020-03-06 2020-03-06 Scientific and technological information clustering method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010150066.4A CN111460046A (en) 2020-03-06 2020-03-06 Scientific and technological information clustering method based on big data

Publications (1)

Publication Number Publication Date
CN111460046A true CN111460046A (en) 2020-07-28

Family

ID=71682677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010150066.4A Pending CN111460046A (en) 2020-03-06 2020-03-06 Scientific and technological information clustering method based on big data

Country Status (1)

Country Link
CN (1) CN111460046A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113114661A (en) * 2021-04-08 2021-07-13 湘潭大学 Cloud-edge collaborative lightweight data processing method for intelligent building Internet of things equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235827A (en) * 2013-05-13 2013-08-07 济南政和科技有限公司 Method for automatically classifying and screening scientific and technological information
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN103984700A (en) * 2014-04-15 2014-08-13 厦门产业技术研究院 Heterogeneous data analysis method for vertical search of scientific information
WO2018137104A1 (en) * 2017-01-24 2018-08-02 深圳企管加企业服务有限公司 User behavior analysis method and system based on big data mining
CN108363804A (en) * 2018-03-01 2018-08-03 浙江工业大学 Partial model Weighted Fusion Top-N films based on user clustering recommend method
CN109636495A (en) * 2018-09-21 2019-04-16 闽南理工学院 A kind of online recommended method of scientific and technological information based on big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235827A (en) * 2013-05-13 2013-08-07 济南政和科技有限公司 Method for automatically classifying and screening scientific and technological information
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN103984700A (en) * 2014-04-15 2014-08-13 厦门产业技术研究院 Heterogeneous data analysis method for vertical search of scientific information
WO2018137104A1 (en) * 2017-01-24 2018-08-02 深圳企管加企业服务有限公司 User behavior analysis method and system based on big data mining
CN108363804A (en) * 2018-03-01 2018-08-03 浙江工业大学 Partial model Weighted Fusion Top-N films based on user clustering recommend method
CN109636495A (en) * 2018-09-21 2019-04-16 闽南理工学院 A kind of online recommended method of scientific and technological information based on big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张攀: "基于历史上下文挖掘的"科技论文在线"用户行为研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113114661A (en) * 2021-04-08 2021-07-13 湘潭大学 Cloud-edge collaborative lightweight data processing method for intelligent building Internet of things equipment

Similar Documents

Publication Publication Date Title
JP5092165B2 (en) Data construction method and system
CN103778206A (en) Method for providing network service resources
Yassir et al. Sentimental classification analysis of polarity multi-view textual data using data mining techniques.
Mukherjee et al. Bootstrapping semantic annotation for content-rich html documents
CN111460046A (en) Scientific and technological information clustering method based on big data
CN116932612B (en) Basic society governs intelligent data processing system
CN116680090B (en) Edge computing network management method and platform based on big data
Almunirawi et al. A comparative study on serial decision tree classification algorithms in text mining
CN111539465A (en) Internet of things unstructured big data analysis algorithm based on machine learning
CN111708919A (en) Big data processing method and system
CN106775694A (en) A kind of hierarchy classification method of software merit rating code product
Gupta et al. Feature selection: an overview
CN115292361A (en) Method and system for screening distributed energy abnormal data
CN114185875A (en) Big data unified analysis and processing system based on cloud computing
CN113971213A (en) Smart city management public information sharing system
CN111026745A (en) Big data modeling system based on user browsing track pushing
Jia et al. Digital media hotspot mining algorithm implementation with complex systems in the mobile internet environment
Tiwari et al. DBSCAN: An Assessment of Density Based Clustering and It’s Approaches
CN110147482A (en) Method and apparatus for obtaining burst hot spot theme
CN112612870B (en) Unstructured data management method and system
CN113505600B (en) Distributed indexing method of industrial chain based on semantic concept space
Tomar et al. An improved optimized clustering technique for crime detection
Bond et al. A dynamic hyperbolic surface model for responsive data mining
Kompalli Knowledge Discovery Using Data Stream Mining: An Analytical Approach
Samanta et al. A Survey On Data Clustering Approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200728

RJ01 Rejection of invention patent application after publication