CN110377736A

CN110377736A - A kind of information cluster method based on R language

Info

Publication number: CN110377736A
Application number: CN201910587823.1A
Authority: CN
Inventors: 刘家祥
Original assignee: Xiamen Knight Source Information Technology Co Ltd
Current assignee: Xiamen Knight Source Information Technology Co Ltd
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2019-10-25

Abstract

A kind of information cluster method based on R language, the described method comprises the following steps: S1, being counted to information and establishes information type library；S2, information data to be clustered is obtained, to form input data sample；S3, input data sample is pre-processed, forms the characteristic item set of input data sample；S4, R language server is built；S5, the data in information type library are imported into the R language server memory；S6, it the characteristic item set of the input data sample produced in step 3 is input in R language server carries out clustering；S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information cluster result.The present invention is high-efficient to information cluster and the accurate precision of cluster is good.

Description

A kind of information cluster method based on R language

Technical field

The present invention relates to information cluster technical field more particularly to a kind of information cluster methods based on R language.

Background technique

R language is the data processing, calculating and graphics software system of complete set.Its function includes: data storage and place Reason system；Array operation tool (function is especially powerful in terms of its vector, matrix operation)；Completely coherent statistical and analytical tool； Outstanding statistical cartography function；Easy and powerful programming language: outputting and inputting, it can be achieved that branch, following for data can be manipulated Ring, user can customize function.

At work, different information are needed to carry out clustering processing, it is bulkyness due to its information content, it is same or similar Show form of the information in different regions has differences, such as the description of title or word is different from each other, to influence information letter Breath work accurately quickly carries out, and needs by clustering to information, so as to orderly progress working properly；Current information Clustering method inefficient, and cluster discrimination precision is not high, is easy because calculating being normally carried out for erroneous effects work.

To solve the above problems, proposing a kind of information cluster method based on R language in the application.

Summary of the invention

(1) goal of the invention

To solve technical problem present in background technique, the present invention proposes a kind of information cluster method based on R language, And cluster accurate precision high-efficient to information cluster is good.

(2) technical solution

To solve the above problems, the present invention provides a kind of information cluster method based on R language, the method includes with Lower step:

S1, information is counted and establishes information type library；

S2, information data to be clustered is obtained, to form input data sample；

S3, input data sample is pre-processed, forms the characteristic item set of input data sample；

S4, R language server is built；

S5, the data in information type library are imported into the R language server memory；

S6, it the characteristic item set of the input data sample produced in step 3 is input in R language server clusters Analysis；

S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information Cluster result.

Preferably, the information type library established in the step 1 is managed.

Preferably, described that information type library is managed including adding new information type in real time and deleting out-of-date abandon Information type.

Preferably, it is the information data in the phase of history time that information data to be clustered is obtained in the step 2.

Preferably, carrying out pretreatment to input data sample in the step 3 is word segmentation processing, and the word segmentation processing includes When detect in sample information there is symbol, English word and/or number when, judge the symbol, English word and/or number with The degree of correlation of the sample information；

When the degree of correlation for judging the symbol, English word and/or number and the sample information is lower than designated value When, delete the symbol, English word and/or number.

Preferably, input data sample is pre-processed in the step 3, forms the feature item collection of input data sample Conjunction further include have further include detect word segmentation processing after words it is whether identical as the words in preset deactivated table；When detecting point Identical words when word treated words is identical as the words in preset deactivated table, after deleting word segmentation processing.

Preferably, it is specially to compile that the data in information type library are imported into the R language server memory by the step 5 The R language scripts for reading data are write, by calling shell that specified information type library data are loaded into the R language Server memory.

Preferably, it further includes number that the data in information type library are imported into the R language server memory by the step 5 According to update step, specifically: the data not high to requirement of real-time, be arranged timed task, at the appointed time interval triggering number It is operated according to updating, the data updated in information type library is loaded into the R language server memory；To requirement of real-time height Data, write finger daemon, monitor the data update status for specifying table in information type library in real time, and the data of update are same Step is loaded into the R language server memory.

Above-mentioned technical proposal of the invention has following beneficial technical effect: by the statistics to information and establishing information Typelib, convenient is that information cluster plays information searching and borrows foundation, to improve the cluster efficiency and accuracy of information；Acquisition Information data to be clustered forms input data sample convenient for being clustered by R language pair information, by input data sample This is pre-processed, and is removed and is not related to the data of information cluster in input data sample, is realized and is gone to the refining of input data sample It is miscellaneous, to improve the speed to information cluster；By building R language server, in order to which input data sample is realized by R language Cluster；Data in information type library are imported into the R language server memory, and by the feature item collection of input data sample Conjunction is input in R language server, by R language server comparative information typelib, obtains information cluster as a result, can be conducive to Information searching and borrow when being clustered using R language pair information, improve the cluster efficiency and accuracy to information.

Detailed description of the invention

Fig. 1 is a kind of structural schematic diagram of the information cluster method based on R language proposed by the present invention.

Specific embodiment

In order to make the objectives, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid this is unnecessarily obscured The concept of invention.

As shown in Figure 1, a kind of information cluster method based on R language proposed by the present invention, the method includes following steps It is rapid:

S1, information is counted and establishes information type library；

S2, information data to be clustered is obtained, to form input data sample；

S4, R language server is built；

In the present invention, pass through the statistics to information and establish information type library, convenient is that information cluster plays information searching With borrow foundation, to improve the cluster efficiency and accuracy of information；Information data to be clustered is acquired, input data sample is formed Convenient for being clustered by R language pair information, by being pre-processed to input data sample, remove in input data sample not It is related to the data of information cluster, realizes and impurity elimination is refined to input data sample, improves the speed to information cluster；By building R language server, in order to which input data sample realizes cluster by R language；Data in information type library are imported into institute R language server memory is stated, and the characteristic item set of input data sample is input in R language server, is taken by R language Be engaged in device comparative information typelib, obtain information cluster as a result, can be conducive to using R language pair information cluster when information searching and It borrows, improves the cluster efficiency and accuracy to information.

In an alternative embodiment, the information type library established in the step 1 is managed.

It should be noted that effectively ensure the accuracy of information type library data by being managed to information type library, To improve the accuracy to information cluster.

In an alternative embodiment, described that information type library is managed including adding new information type in real time With the out-of-date information type abandoned of deletion.

It should be noted that being updated to information type library, being added in time accordingly in time according to the real-time update of information The information type for adding new information type and removal to abandon, to ensure the accuracy and timeliness of information type library data, thus Improve the accuracy to information cluster.

In an alternative embodiment, it is in the phase of history time that information data to be clustered is obtained in the step 2 Information data.

It should be noted that being directed to the cluster of information, the information clustered is usually the Information Number in the phase of history time According to the information data cluster application before the too long time is low, and prolonged information data acquisition will lead to information data Failure, therefore to the cluster of information using the phase of history time as boundary, by multiple information cluster ensure information data when Effect property and accuracy.

In an alternative embodiment, carrying out pretreatment to input data sample in the step 3 is word segmentation processing, institute State word segmentation processing include when detect in sample information there is symbol, English word and/or number when, judge the symbol, English The degree of correlation of word and/or number and the sample information；

It should be noted that being then formed by data characteristics item to participle by segmenting to input data sample It is identified and is filtered, effectively data information uncorrelated to clustering information in filtering removal input data sample, realized to input The refining impurity elimination of data sample, clusters data so as to subsequent.

In an alternative embodiment, input data sample is pre-processed in the step 3, forms input data The characteristic item set of sample further include have further include detect word segmentation processing after words whether with the words in preset deactivated table It is identical；Phase when detecting that the words after word segmentation processing is identical as the words in preset deactivated table, after deleting word segmentation processing Same words.

It should be noted that the result after participle would generally be comprising several meaningless words such as ", cross ", these Words does not only help result, also takes up a large amount of calculating storage resource, it is therefore desirable to be filtered before the computation Fall；In addition, in the actual operation process, there is also some words for interfering and normally clustering, these words can also be aggregated In preset deactivated table, when judging occur above-mentioned vocabulary in sample information, then the above-mentioned vocabulary in the sample information is deleted.

In an alternative embodiment, the data in information type library are imported into the R language and taken by the step 5 Business device memory is specially the R language scripts write for reading data, by calling shell by specified information type library data It is loaded into the R language server memory.

It should be noted that by the way that information type library is loaded into R language server, in order to be played to information cluster Information searching and borrow foundation, improve the cluster efficiency and accuracy to information.

In an alternative embodiment, the data in information type library are imported into the R language and taken by the step 5 Business device memory further includes that data update step, specifically: timed task is arranged in the data not high to requirement of real-time, specified Time interval trigger data update operation, the data updated in information type library are loaded into the R language server memory； The data high to requirement of real-time write finger daemon, monitor the data update status that table is specified in information type library in real time, and The data of update are synchronized and are loaded into the R language server memory.

It should be noted that by information type library monitor in real time, it is ensured that information type library data information timeliness and Accuracy, to improve the accuracy to information cluster.

It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of information cluster method based on R language, which is characterized in that the described method comprises the following steps:

S1, information is counted and establishes information type library；

S2, information data to be clustered is obtained, to form input data sample；

S4, R language server is built；

S6, the characteristic item set of the input data sample produced in step 3 is input in R language server cluster and is divided Analysis；

S7, simultaneously comparative information typelib is analyzed by characteristic item of the R language server to input data sample, obtains information cluster As a result.

2. a kind of information cluster method based on R language according to claim 1, which is characterized in that in the step 1 The information type library of foundation is managed.

3. a kind of information cluster method based on R language according to claim 2, which is characterized in that described to info class Type library is managed including adding new information type in real time and deleting the out-of-date information type abandoned.

4. a kind of information cluster method based on R language according to claim 1, which is characterized in that obtained in the step 2 Taking information data to be clustered is the information data in the phase of history time.

5. a kind of information cluster method based on R language according to claim 1, which is characterized in that right in the step 3 Input data sample carry out pretreatment be word segmentation processing, the word segmentation processing include when detect in sample information occur symbol, When English word and/or number, the degree of correlation of the symbol, English word and/or number and the sample information is judged；

When judging that the symbol, English word and/or number and the degree of correlation of the sample information are lower than designated value, delete Except the symbol, English word and/or number.

6. a kind of information cluster method based on R language according to claim 1, which is characterized in that right in the step 3 Input data sample is pre-processed, formed input data sample characteristic item set further include have further include detection word segmentation processing Whether words afterwards is identical as the words in preset deactivated table；Words and preset deactivated table after detecting word segmentation processing In words it is identical when, delete word segmentation processing after identical words.

7. a kind of information cluster method based on R language according to claim 1, which is characterized in that the step 5 will be believed It is specially the R language scripts write for reading data that data in breath typelib, which imported into the R language server memory, is led to It crosses and calls shell that specified information type library data are loaded into the R language server memory.

8. a kind of information cluster method based on R language according to claim 1, which is characterized in that the step 5 will be believed It further includes that data update step that data in breath typelib, which imported into the R language server memory, specifically: real-time is wanted Not high data are sought, timed task is set, trigger data is spaced at the appointed time and updates operation, will be updated in information type library Data be loaded into the R language server memory；The data high to requirement of real-time write finger daemon, real time monitoring letter The data update status for specifying table in typelib is ceased, and the data of update are synchronized and are loaded into the R language server memory.