CN110377736A - A kind of information cluster method based on R language - Google Patents

A kind of information cluster method based on R language Download PDF

Info

Publication number
CN110377736A
CN110377736A CN201910587823.1A CN201910587823A CN110377736A CN 110377736 A CN110377736 A CN 110377736A CN 201910587823 A CN201910587823 A CN 201910587823A CN 110377736 A CN110377736 A CN 110377736A
Authority
CN
China
Prior art keywords
information
language
data
input data
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910587823.1A
Other languages
Chinese (zh)
Inventor
刘家祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Knight Source Information Technology Co Ltd
Original Assignee
Xiamen Knight Source Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Knight Source Information Technology Co Ltd filed Critical Xiamen Knight Source Information Technology Co Ltd
Priority to CN201910587823.1A priority Critical patent/CN110377736A/en
Publication of CN110377736A publication Critical patent/CN110377736A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Abstract

A kind of information cluster method based on R language, the described method comprises the following steps: S1, being counted to information and establishes information type library;S2, information data to be clustered is obtained, to form input data sample;S3, input data sample is pre-processed, forms the characteristic item set of input data sample;S4, R language server is built;S5, the data in information type library are imported into the R language server memory;S6, it the characteristic item set of the input data sample produced in step 3 is input in R language server carries out clustering;S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information cluster result.The present invention is high-efficient to information cluster and the accurate precision of cluster is good.

Description

A kind of information cluster method based on R language
Technical field
The present invention relates to information cluster technical field more particularly to a kind of information cluster methods based on R language.
Background technique
R language is the data processing, calculating and graphics software system of complete set.Its function includes: data storage and place Reason system;Array operation tool (function is especially powerful in terms of its vector, matrix operation);Completely coherent statistical and analytical tool; Outstanding statistical cartography function;Easy and powerful programming language: outputting and inputting, it can be achieved that branch, following for data can be manipulated Ring, user can customize function.
At work, different information are needed to carry out clustering processing, it is bulkyness due to its information content, it is same or similar Show form of the information in different regions has differences, such as the description of title or word is different from each other, to influence information letter Breath work accurately quickly carries out, and needs by clustering to information, so as to orderly progress working properly;Current information Clustering method inefficient, and cluster discrimination precision is not high, is easy because calculating being normally carried out for erroneous effects work.
To solve the above problems, proposing a kind of information cluster method based on R language in the application.
Summary of the invention
(1) goal of the invention
To solve technical problem present in background technique, the present invention proposes a kind of information cluster method based on R language, And cluster accurate precision high-efficient to information cluster is good.
(2) technical solution
To solve the above problems, the present invention provides a kind of information cluster method based on R language, the method includes with Lower step:
S1, information is counted and establishes information type library;
S2, information data to be clustered is obtained, to form input data sample;
S3, input data sample is pre-processed, forms the characteristic item set of input data sample;
S4, R language server is built;
S5, the data in information type library are imported into the R language server memory;
S6, it the characteristic item set of the input data sample produced in step 3 is input in R language server clusters Analysis;
S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information Cluster result.
Preferably, the information type library established in the step 1 is managed.
Preferably, described that information type library is managed including adding new information type in real time and deleting out-of-date abandon Information type.
Preferably, it is the information data in the phase of history time that information data to be clustered is obtained in the step 2.
Preferably, carrying out pretreatment to input data sample in the step 3 is word segmentation processing, and the word segmentation processing includes When detect in sample information there is symbol, English word and/or number when, judge the symbol, English word and/or number with The degree of correlation of the sample information;
When the degree of correlation for judging the symbol, English word and/or number and the sample information is lower than designated value When, delete the symbol, English word and/or number.
Preferably, input data sample is pre-processed in the step 3, forms the feature item collection of input data sample Conjunction further include have further include detect word segmentation processing after words it is whether identical as the words in preset deactivated table;When detecting point Identical words when word treated words is identical as the words in preset deactivated table, after deleting word segmentation processing.
Preferably, it is specially to compile that the data in information type library are imported into the R language server memory by the step 5 The R language scripts for reading data are write, by calling shell that specified information type library data are loaded into the R language Server memory.
Preferably, it further includes number that the data in information type library are imported into the R language server memory by the step 5 According to update step, specifically: the data not high to requirement of real-time, be arranged timed task, at the appointed time interval triggering number It is operated according to updating, the data updated in information type library is loaded into the R language server memory;To requirement of real-time height Data, write finger daemon, monitor the data update status for specifying table in information type library in real time, and the data of update are same Step is loaded into the R language server memory.
Above-mentioned technical proposal of the invention has following beneficial technical effect: by the statistics to information and establishing information Typelib, convenient is that information cluster plays information searching and borrows foundation, to improve the cluster efficiency and accuracy of information;Acquisition Information data to be clustered forms input data sample convenient for being clustered by R language pair information, by input data sample This is pre-processed, and is removed and is not related to the data of information cluster in input data sample, is realized and is gone to the refining of input data sample It is miscellaneous, to improve the speed to information cluster;By building R language server, in order to which input data sample is realized by R language Cluster;Data in information type library are imported into the R language server memory, and by the feature item collection of input data sample Conjunction is input in R language server, by R language server comparative information typelib, obtains information cluster as a result, can be conducive to Information searching and borrow when being clustered using R language pair information, improve the cluster efficiency and accuracy to information.
Detailed description of the invention
Fig. 1 is a kind of structural schematic diagram of the information cluster method based on R language proposed by the present invention.
Specific embodiment
In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.
As shown in Figure 1, a kind of information cluster method based on R language proposed by the present invention, the method includes following steps It is rapid:
S1, information is counted and establishes information type library;
S2, information data to be clustered is obtained, to form input data sample;
S3, input data sample is pre-processed, forms the characteristic item set of input data sample;
S4, R language server is built;
S5, the data in information type library are imported into the R language server memory;
S6, it the characteristic item set of the input data sample produced in step 3 is input in R language server clusters Analysis;
S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information Cluster result.
In the present invention, pass through the statistics to information and establish information type library, convenient is that information cluster plays information searching With borrow foundation, to improve the cluster efficiency and accuracy of information;Information data to be clustered is acquired, input data sample is formed Convenient for being clustered by R language pair information, by being pre-processed to input data sample, remove in input data sample not It is related to the data of information cluster, realizes and impurity elimination is refined to input data sample, improves the speed to information cluster;By building R language server, in order to which input data sample realizes cluster by R language;Data in information type library are imported into institute R language server memory is stated, and the characteristic item set of input data sample is input in R language server, is taken by R language Be engaged in device comparative information typelib, obtain information cluster as a result, can be conducive to using R language pair information cluster when information searching and It borrows, improves the cluster efficiency and accuracy to information.
In an alternative embodiment, the information type library established in the step 1 is managed.
It should be noted that effectively ensure the accuracy of information type library data by being managed to information type library, To improve the accuracy to information cluster.
In an alternative embodiment, described that information type library is managed including adding new information type in real time With the out-of-date information type abandoned of deletion.
It should be noted that being updated to information type library, being added in time accordingly in time according to the real-time update of information The information type for adding new information type and removal to abandon, to ensure the accuracy and timeliness of information type library data, thus Improve the accuracy to information cluster.
In an alternative embodiment, it is in the phase of history time that information data to be clustered is obtained in the step 2 Information data.
It should be noted that being directed to the cluster of information, the information clustered is usually the Information Number in the phase of history time According to the information data cluster application before the too long time is low, and prolonged information data acquisition will lead to information data Failure, therefore to the cluster of information using the phase of history time as boundary, by multiple information cluster ensure information data when Effect property and accuracy.
In an alternative embodiment, carrying out pretreatment to input data sample in the step 3 is word segmentation processing, institute State word segmentation processing include when detect in sample information there is symbol, English word and/or number when, judge the symbol, English The degree of correlation of word and/or number and the sample information;
When the degree of correlation for judging the symbol, English word and/or number and the sample information is lower than designated value When, delete the symbol, English word and/or number.
It should be noted that being then formed by data characteristics item to participle by segmenting to input data sample It is identified and is filtered, effectively data information uncorrelated to clustering information in filtering removal input data sample, realized to input The refining impurity elimination of data sample, clusters data so as to subsequent.
In an alternative embodiment, input data sample is pre-processed in the step 3, forms input data The characteristic item set of sample further include have further include detect word segmentation processing after words whether with the words in preset deactivated table It is identical;Phase when detecting that the words after word segmentation processing is identical as the words in preset deactivated table, after deleting word segmentation processing Same words.
It should be noted that the result after participle would generally be comprising several meaningless words such as ", cross ", these Words does not only help result, also takes up a large amount of calculating storage resource, it is therefore desirable to be filtered before the computation Fall;In addition, in the actual operation process, there is also some words for interfering and normally clustering, these words can also be aggregated In preset deactivated table, when judging occur above-mentioned vocabulary in sample information, then the above-mentioned vocabulary in the sample information is deleted.
In an alternative embodiment, the data in information type library are imported into the R language and taken by the step 5 Business device memory is specially the R language scripts write for reading data, by calling shell by specified information type library data It is loaded into the R language server memory.
It should be noted that by the way that information type library is loaded into R language server, in order to be played to information cluster Information searching and borrow foundation, improve the cluster efficiency and accuracy to information.
In an alternative embodiment, the data in information type library are imported into the R language and taken by the step 5 Business device memory further includes that data update step, specifically: timed task is arranged in the data not high to requirement of real-time, specified Time interval trigger data update operation, the data updated in information type library are loaded into the R language server memory; The data high to requirement of real-time write finger daemon, monitor the data update status that table is specified in information type library in real time, and The data of update are synchronized and are loaded into the R language server memory.
It should be noted that by information type library monitor in real time, it is ensured that information type library data information timeliness and Accuracy, to improve the accuracy to information cluster.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (8)

1. a kind of information cluster method based on R language, which is characterized in that the described method comprises the following steps:
S1, information is counted and establishes information type library;
S2, information data to be clustered is obtained, to form input data sample;
S3, input data sample is pre-processed, forms the characteristic item set of input data sample;
S4, R language server is built;
S5, the data in information type library are imported into the R language server memory;
S6, the characteristic item set of the input data sample produced in step 3 is input in R language server cluster and is divided Analysis;
S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information cluster As a result.
2. a kind of information cluster method based on R language according to claim 1, which is characterized in that in the step 1 The information type library of foundation is managed.
3. a kind of information cluster method based on R language according to claim 2, which is characterized in that described to info class Type library is managed including adding new information type in real time and deleting the out-of-date information type abandoned.
4. a kind of information cluster method based on R language according to claim 1, which is characterized in that obtained in the step 2 Taking information data to be clustered is the information data in the phase of history time.
5. a kind of information cluster method based on R language according to claim 1, which is characterized in that right in the step 3 Input data sample carry out pretreatment be word segmentation processing, the word segmentation processing include when detect in sample information occur symbol, When English word and/or number, the degree of correlation of the symbol, English word and/or number and the sample information is judged;
When judging that the symbol, English word and/or number and the degree of correlation of the sample information are lower than designated value, delete Except the symbol, English word and/or number.
6. a kind of information cluster method based on R language according to claim 1, which is characterized in that right in the step 3 Input data sample is pre-processed, formed input data sample characteristic item set further include have further include detection word segmentation processing Whether words afterwards is identical as the words in preset deactivated table;Words and preset deactivated table after detecting word segmentation processing In words it is identical when, delete word segmentation processing after identical words.
7. a kind of information cluster method based on R language according to claim 1, which is characterized in that the step 5 will be believed It is specially the R language scripts write for reading data that data in breath typelib, which imported into the R language server memory, is led to It crosses and calls shell that specified information type library data are loaded into the R language server memory.
8. a kind of information cluster method based on R language according to claim 1, which is characterized in that the step 5 will be believed It further includes that data update step that data in breath typelib, which imported into the R language server memory, specifically: real-time is wanted Not high data are sought, timed task is set, trigger data is spaced at the appointed time and updates operation, will be updated in information type library Data be loaded into the R language server memory;The data high to requirement of real-time write finger daemon, real time monitoring letter The data update status for specifying table in typelib is ceased, and the data of update are synchronized and are loaded into the R language server memory.
CN201910587823.1A 2019-07-02 2019-07-02 A kind of information cluster method based on R language Pending CN110377736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910587823.1A CN110377736A (en) 2019-07-02 2019-07-02 A kind of information cluster method based on R language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910587823.1A CN110377736A (en) 2019-07-02 2019-07-02 A kind of information cluster method based on R language

Publications (1)

Publication Number Publication Date
CN110377736A true CN110377736A (en) 2019-10-25

Family

ID=68251533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910587823.1A Pending CN110377736A (en) 2019-07-02 2019-07-02 A kind of information cluster method based on R language

Country Status (1)

Country Link
CN (1) CN110377736A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573333A (en) * 2014-12-22 2015-04-29 长江大学 Method for optimizing of model selection based on clustering analysis
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573333A (en) * 2014-12-22 2015-04-29 长江大学 Method for optimizing of model selection based on clustering analysis
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LITTLE_ROOKIE: "K-means算法原理", 《博客园》 *

Similar Documents

Publication Publication Date Title
JPH08241193A (en) Method for analysis of code segment
CN110175236A (en) Training sample generation method, device and computer equipment for text classification
CN107016018B (en) Database index creation method and device
CN110321383A (en) Big data platform method of data synchronization, device, computer equipment and storage medium
CN108491228A (en) A kind of binary vulnerability Code Clones detection method and system
CN106843941A (en) Information processing method, device and computer equipment
CN105095436A (en) Automatic modeling method for data of data sources
US20240036841A1 (en) Method and Apparatus for Compatibility Detection, Device and Non-transitory computer-readable storage medium
CN112364014A (en) Data query method, device, server and storage medium
CN114817243A (en) Method, device and equipment for establishing database joint index and storage medium
CN114490554A (en) Data synchronization method and device, electronic equipment and storage medium
EP2348403A1 (en) Method and system for analyzing a legacy system based on trails through the legacy system
CN110377736A (en) A kind of information cluster method based on R language
CN112257076A (en) Vulnerability detection method based on random detection algorithm and information aggregation
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN115687352A (en) Storage method and device
CN113001538B (en) Command analysis method and system
CN114924790A (en) Open source component detection method and system based on source code analysis
CN112162978A (en) Data blood margin detection method and device, electronic equipment and readable storage medium
CN106776704A (en) Statistical information collection method and device
CN105224697A (en) Sort method with filtercondition and the device for performing described method
Yano et al. Moderate detection and removal of omnipresent modules in software clustering
CN110688547A (en) Method for customizing call ticket mutual call data analysis model displayed by graph
US10489428B2 (en) Existing system processing specification extractor
CN115346607B (en) DNA sample duplication checking method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191025

RJ01 Rejection of invention patent application after publication