CN108846142A

CN108846142A - A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN108846142A
Application number: CN201810763151.0A
Authority: CN
Inventors: 曾广移; 李德华; 巩宇; 卢勇; 丁钊; 杨小龙; 梁莉雪; 黄小凤; 王晓翼; 杨宗强
Original assignee: Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd
Current assignee: Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2018-11-20

Abstract

The invention discloses a kind of Text Clustering Methods, applied to the server in distributed type assemblies, including：Obtain target source text to be clustered；Text feature in target source text is extracted using most probable number method, obtains target data；Preset neural network training model is read from own cache；According to neural network training model and neural network algorithm, clustering is carried out to target data, and generate family of files corresponding with target source text.This method is applied to distributed type assemblies, and in cluster process, and the intermediate result that neural network training model generates is stored in the caching of server, thus improves the data volume and efficiency of text cluster；Meanwhile neural network algorithm improves the accuracy of cluster result.Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing disclosed by the invention, similarly have above-mentioned technique effect.

Description

A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing

Technical field

The present invention relates to clustering technique field, more specifically to a kind of Text Clustering Method, device, equipment and can Read storage medium.

Background technique

With the continuous fusion development of computer technology and clustering technique, text cluster, which becomes, carries out effectively text information Ground tissue, abstract and the important means of navigation.

Currently, existing text cluster is generally basede on one-of-a-kind system realization, since machine is limited, therefore its text that can cover It measures less；And since it is in cluster process, the intermediate result of cluster is stored in the hard disk of rear end, causes iterating to calculate When, it requires to read data from hard disk every time, so lessens computation rate, and then lead to the efficiency of text cluster It reduces；Simultaneously as the algorithm of its clustering used is complex, in the slow situation of computation rate, possibly can not Ensure the accuracy of cluster result.

Therefore, the efficiency and accuracy for how improving text cluster are those skilled in the art's problems to be solved.

Summary of the invention

The purpose of the present invention is to provide a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text The efficiency and accuracy of this cluster.

To achieve the above object, the embodiment of the invention provides following technical solutions：

A kind of Text Clustering Method, applied to the server in distributed type assemblies, including：

Obtain target source text to be clustered；

Text feature in the target source text is extracted using most probable number method, obtains target data；

Preset neural network training model is read from own cache；

According to the neural network training model and neural network algorithm, clustering is carried out to the target data, and Generate family of files corresponding with the target source text.

Wherein, the text feature extracted in the target source text using most probable number method obtains target data, packet It includes：

The target source file is pre-processed, and extracts text participle, institute from pretreated target source text Stating text participle includes：Number, date, name and part of speech；

The text feature is extracted from text participle, and maximum probability occurs by most probable number method determination Text feature, the text feature includes：Word weight, word frequency and inverse document frequency.

Wherein, the generation of the neural network training model includes：

Target training text is obtained, and the target training text is normalized；

Based on random number and preset connection value and threshold value, the target training text after normalized is carried out dilute Logistic regression is dredged, target training set is obtained；

Based on the random number, the connection value and the threshold value iterate to calculate the target training and gather, described in generation Neural network training model.

Wherein, generation family of files corresponding with the target source text, including：

The family of files is generated by the cosine value of space angle between vector space model and vector.

Wherein, after generation family of files corresponding with the target source text, further include：

The family of files is visualized.

A kind of text cluster device, applied to the server in distributed type assemblies, including：

Module is obtained, for obtaining target source text to be clustered；

Extraction module obtains number of targets for extracting the text feature in the target source text using most probable number method According to；

Read module, for reading preset neural network training model from own cache；

Cluster module, for according to the neural network training model and neural network algorithm, to the target data into Row clustering, and generate family of files corresponding with the target source text.

Wherein, the extraction module includes：

Pretreatment unit, for being pre-processed to the target source file, and from pretreated target source text Text participle is extracted, the text participle includes：Number, date, name and part of speech；

Extraction unit, for extracting the text feature from text participle, and it is true by the most probable number method The text feature of existing maximum probability is made, the text feature includes：Word weight, word frequency and inverse document frequency.

Wherein, the cluster module is specifically used for：

A kind of text cluster equipment, including：

Memory, for storing computer program；

Processor realizes the step of Text Clustering Method described in above-mentioned any one when for executing the computer program Suddenly.

A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing The step of processor realizes Text Clustering Method described in above-mentioned any one when executing.

By above scheme it is found that the embodiment of the invention provides a kind of Text Clustering Method, it is applied to distributed type assemblies In server, including：Obtain target source text to be clustered；Text in the target source text is extracted using most probable number method Eigen obtains target data；Preset neural network training model is read from own cache；It is instructed according to the neural network Practice model and neural network algorithm, clustering is carried out to the target data, and generate corresponding with the target source text Family of files.

As it can be seen that the method obtains target data, and base by extracting the text feature in the target source text got Caching, neural network training model and the neural network algorithm of server in distributed type assemblies gather target data Alanysis, to generate family of files corresponding with target source text.Wherein, since this method is applied to distributed type assemblies, thus The amount of text that it can cover is larger, therefore expands the data volume of text cluster；Also, since neural network training model is stored in The caching of server, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, no It is disconnected that data are read from caching, data read rates can be improved, and then improve the efficiency of text cluster；Meanwhile this programme is adopted The accuracy of cluster result is improved with neural network algorithm.

If carrying out clustering to enterprise's text using this method, the efficiency of text cluster not only can be improved, and due to enterprise Industry text more standardizes, and the accuracy of obtained cluster result will be promoted significantly, is based on such cluster result, is also convenient for Staff's locating file improves working efficiency.

Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing provided in an embodiment of the present invention, it is also the same to have There is above-mentioned technical effect.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of Text Clustering Method flow chart disclosed by the embodiments of the present invention；

Fig. 2 is another Text Clustering Method flow chart disclosed by the embodiments of the present invention；

Fig. 3 is a kind of text cluster schematic device disclosed by the embodiments of the present invention；

Fig. 4 is a kind of text cluster equipment schematic diagram disclosed by the embodiments of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text The efficiency and accuracy of cluster.

Referring to Fig. 1, a kind of Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including：

S101, target source text to be clustered is obtained；

Specifically, the target source file includes：All kinds of enterprise's text files and network short essay this document.

S102, the text feature in target source text is extracted using most probable number method, obtains target data；

In the present embodiment, when getting target source file to be clustered, target is extracted using most probable number method first Text feature in source text, to obtain target data.

Wherein, a Chinese character string to be slit may be segmented comprising a variety of texts.Such as " disagreement of having complaints " the words, Possible result includes：" having/opinion/disagreement ", " intentionally/see/disagreement ", it is even more.At this point it is possible to will wherein maximum probability Text participle be determined as final text participle.

S103, preset neural network training model is read from own cache；

Specifically, the neural network training model is pre-stored in the caching of each server, it is so poly- in text The intermediate result generated in class process will be also temporarily stored into the caching of each server.It therefore can be constantly from being read in caching Between as a result, data read rates so can be improved, and improve text cluster efficiency in turn.

S104, according to neural network training model and neural network algorithm, clustering is carried out to target data, and generate Family of files corresponding with target source text.

Preferably, poly- based on text provided in this embodiment when the quantity of target source file to be clustered is 1,000,000,000 This 1,000,000,000 target source files, can be divided into multiple file sets by class method, and will divide obtained file set distribute to Server in distributed type assemblies, so that every server carries out clustering to each file set parallel, to improve text The data processing amount and treatment effeciency of this cluster.

Such as：When there are 10 nodes in distributed type assemblies, i.e., in the presence of 10 servers, then by 1,000,000,000 target source documents Part is divided into 10 file sets, and each file set includes 100,000,000 file destinations.Such server just handles 100,000,000 Target source file, and every server parallel processing in distributed type assemblies, can so increase substantially the processing of text cluster Efficiency.It certainly, can be host node and from section according to the different demarcation of respective business by this 10 nodes for the ease of management Point；Wherein, the quantity of host node can be set to 2, when one of catastrophic failure, another can be used as it is spare, to answer To the possible period of want or need.

As it can be seen that present embodiments providing a kind of Text Clustering Method, the method is by extracting the target source document got Text feature in this obtains target data, and the caching based on the server in distributed type assemblies, neural network training model And neural network algorithm carries out clustering to target data, to generate family of files corresponding with target source text.Wherein, Since this method is applied to distributed type assemblies, thus its amount of text that can cover is larger, therefore expands the data volume of text cluster； Also, since neural network training model is stored in the caching of server, the intermediate result generated in cluster process is also stored in The caching of server, therefore in cluster process, data are constantly read from caching, and data read rates, Jin Erti can be improved The high efficiency of text cluster；Meanwhile this programme improves the accuracy of cluster result using neural network algorithm.If using this Method carries out clustering to enterprise's text, the efficiency of text cluster not only can be improved, and since enterprise's text more standardizes, The accuracy of obtained cluster result will be promoted significantly, be based on such cluster result, be also convenient for staff's locating file, Improve working efficiency.

The embodiment of the invention discloses another Text Clustering Methods, and relative to a upper embodiment, the present embodiment is to technology Scheme has made further instruction and optimization.

Referring to fig. 2, another Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including：

S202, target source text to be clustered is obtained；

S202, the text feature in target source text is extracted using most probable number method, obtains target data；

In the present embodiment, the target data is text matrix.

S203, preset neural network training model is read from own cache；

S204, according to neural network training model and neural network algorithm, clustering is carried out to target data；

S205, text corresponding with target source file is generated by the cosine value of space angle between vector space model and vector Part race.

In the present embodiment, it is generated and target source file by the cosine value of space angle between vector space model and vector Corresponding family of files.

Wherein, the basic thought of vector space model is text to be reduced to the weight of characteristic item (i.e. keyword) be point The N-dimensional vector of amount indicates.This model hypothesis word indicates text with vector to uncorrelated between word, to simplify in text Keyword between complex relationship, text is indicated with foolproof vector, so that this model has computability. It should be noted that text refers to various machine readable records in vector space model.

If indicating text with D, characteristic item is indicated with T, then the basic language unit of text content can be represented in text D For T, mainly it is made of word or phrase.Wherein, text D can be with characteristic item set representations：D (T1, T2 ... ..., Tn), Wherein Tk is characteristic item, and meets 1≤k≤n.

Assuming that having tetra- characteristic items of a, b, c, d in a text, then this text is represented by：D (a, b, c, d), for Other texts to compare therewith also will comply with this characteristic item sequence.For the text containing n characteristic item, it will usually Assigning certain weight to each characteristic item indicates its significance level, i.e.,：D=D (T1, W1；T2, W2；..., Tn, Wn), brief note For D=D (W1, W2 ... ..., Wn), this weight vector for being referred to as text D is indicated, wherein Wk is the weight of Tk, 1≤k≤n.

Based on above-mentioned it is assumed that if the weight of a, b, c, d are respectively 30,20,20,10, then the vector of the text is expressed as： D (30,20,20,10).And in vector space model, the content degree of correlation Sim (D1, D2) between two texts D1 and D2 is used The cosine value of angle indicates between vector, and representation formula is：

Wherein, W_1k、W_2kThe weight of text D1 and D2 k-th characteristic item is respectively indicated, θ is the sky of vector D1 and vector D2 Between angle, 1≤k≤n.

It should be noted that being calculated using the above method wait return when carry out text classification during text cluster The degree of correlation of class text and certain classification.

Assuming that the characteristic item of text D1 is a, b, c, d, weight is respectively 30,20,20,10, and the characteristic item of classification C1 is a, C, d, e, weight are respectively 40,30,20,10, then the vector of D1 is expressed as：The vector of D1 (30,20,20,10,0), C1 indicate For：C1 (40,0,30,20,10), then the cosine value of angle is between vector C1 and vector D1：0.86, i.e. text D1 and classification C1 The degree of correlation is 0.86.

Specifically calculating process is：The mould of n-dimensional vector V { v1, v2, v3 ..., vn } is | v |=sqrt (v1*v1+v2*v2 + ...+vn*vn),

The dot product of so two vectors is：M*n=n1*m1+n2*m2+......+nn*mn, similarity are：Sim=(m* N)/(| m | * | n |), wherein the physical significance of similarity is the cosine value of the space angle of two vectors.

As it can be seen that present embodiments providing another Text Clustering Method, the method is by extracting the target source got Text feature in text obtains target data, and the caching based on the server in distributed type assemblies, neural metwork training mould Type and neural network algorithm carry out clustering to target data, to pass through space angle between vector space model and vector Cosine value generate corresponding with target source text family of files.Wherein, since this method is applied to distributed type assemblies, thus it can The amount of text covered is larger, therefore expands the data volume of text cluster；Also, since neural network training model is stored in service The caching of device, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, constantly from Data are read in caching, data read rates can be improved, and then improve the efficiency of text cluster；Meanwhile this programme is using mind The accuracy of cluster result is improved through network algorithm.

Based on above-mentioned any embodiment, it should be noted that described to extract the target source text using most probable number method In text feature, obtain target data, including：

Specifically, the extraction process of text feature can be indicated with following formula：

Therefore：P (W)=P (W₁,W₂,...,W_i)≈P(W₁)×P(W₂)×...×P(W_i), i.e. P (W_i) it is equal to W_iIn corpus The frequency n of middle appearance, the quotient with total word number N in corpus.Wherein, a large amount of texts of sampled and processing are stored in corpus This.

Wherein, P (W | S) indicates the frequency that word occurs in a document, and P (S | W) is the probability for the word that text includes, can Approximatively P (S | W) is regarded as and is constantly equal to 1, because the sentence generated under a kind of any imaginary participle mode is always smart Generate quasi-ly word segmentation result (only the boundary symbol between participle need to be thrown away), and P (S) under various participle modes always Equal, so not influencing to compare.So P (W | S) it is approximately equal to P (W).

When expressing word weight, text can be expressed as vector in vector space model.Wherein, word weight indicates sentence In contribution degree of the word in the sentence, such as："Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to Determine which way is north ", wherein：Important word includes：butterflies,monarchs, scientists,compass；Unessential word includes：most,think,kind,sky；And word weight is exactly to reflect each word Importance measurement.

Word frequency indicates the number that a word occurs in the sentence, for calculating word weight, so the calculating of word weight T is public Formula is：

Wherein, t_fFor word frequency, doc_length is Chinese character string length.

Specifically, the extracting method of word weight includes：It is candidate that the descriptor in participle is obtained by Bayesian formula；It obtains The word frequency of descriptor candidate and position；The word weight of descriptor candidate is solved；The maximum descriptor of word weight is candidate As final word weight.Wherein, the calculation formula of descriptor candidate is：weight_i=α × fre_i+e×loc_i, wherein weight_iBased on write inscription candidate weight, fre_iFor word frequency weight factor, loc_iFor the position weight factor, α be word frequency adjust because Son, e are location factor regulatory factor, are write inscription based on i candidate.

Inverse document frequency indicates the number of the text comprising some word.Usually, if a word is in more texts In occurred, then the word to the contribution degree of text with regard to smaller, i.e., with the word come when distinguishing different texts, discrimination is smaller. The calculation formula of inverse document frequency I is：

Wherein, N is the number that participle occurs, d_fFor document frequency, and the calculation formula can make the range of inverse document frequency Between [0,1].

Based on above-mentioned any embodiment, it should be noted that in the analytic process of text data, in-between data are deposited It is stored in caching, in order to improve the reading efficiency of data.

Based on above-mentioned any embodiment, it should be noted that the generation of the neural network training model includes：

Target training text is obtained, and the target training text is normalized；

Based on above-mentioned any embodiment, it should be noted that described to generate family of files corresponding with the target source text Later, further include：

The family of files is visualized.

Based on above-mentioned any embodiment, it should be noted that use can be constructed using literary clustering method provided in this embodiment In the following server cluster of text cluster.Distributed type assemblies greater than 20 nodes (server) are set, by the clothes in cluster Business device is divided into primary server and from server, and primary server uses mind for managing from server on every server High performance clustering is carried out to source text through network algorithm；Meanwhile cluster process is based on caching and realizes.Wherein, the distribution Formula cluster uses hadoop platform, to improve the compatibility of distributed type assemblies.

A kind of text cluster device provided in an embodiment of the present invention is introduced below, a kind of text described below is poly- Class device can be cross-referenced with a kind of above-described Text Clustering Method.

Referring to Fig. 3, a kind of text cluster device provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including：

Module 301 is obtained, for obtaining target source text to be clustered；

Extraction module 302 obtains target for extracting the text feature in the target source text using most probable number method Data；

Read module 303, for reading preset neural network training model from own cache；

Cluster module 304 is used for according to the neural network training model and neural network algorithm, to the target data Clustering is carried out, and generates family of files corresponding with the target source text.

Wherein, the extraction module includes：

Wherein, the cluster module is specifically used for：

Wherein, further include：The neural network training model generation module, the generation module include：

Acquiring unit is normalized for obtaining target training text, and to the target training text；

Logistic regression unit, for being based on random number and preset connection value and threshold value, to the mesh after normalized It marks training text and carries out sparse logistic regression, obtain target training set；

Computing unit, for being based on the random number, the connection value and the threshold value iterate to calculate the target training Set, generates the neural network training model.

Wherein, further include：

Display module, for visualizing the family of files.

A kind of text cluster equipment provided in an embodiment of the present invention is introduced below, a kind of text described below is poly- Class equipment can be cross-referenced with a kind of above-described Text Clustering Method and device.

Referring to fig. 4, a kind of text cluster equipment provided in an embodiment of the present invention, including：

Memory 401, for storing computer program；

Processor 402 realizes text cluster side described in above-mentioned any embodiment when for executing the computer program The step of method.

A kind of readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, one kind described below is readable to deposit Storage media can be cross-referenced with a kind of above-described Text Clustering Method, device and equipment.

A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing The step of Text Clustering Method as described in above-mentioned any embodiment is realized when processor executes.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of Text Clustering Method, which is characterized in that applied to the server in distributed type assemblies, including：

Obtain target source text to be clustered；

Preset neural network training model is read from own cache；

2. Text Clustering Method according to claim 1, which is characterized in that described to extract the mesh using most probable number method The text feature in source text is marked, target data is obtained, including：

The target source file is pre-processed, and extracts text participle, the text from pretreated target source text This participle includes：Number, date, name and part of speech；

The text feature is extracted from text participle, and the text for maximum probability occur is determined by the most probable number method Eigen, the text feature include：Word weight, word frequency and inverse document frequency.

3. Text Clustering Method according to claim 1, which is characterized in that the generation packet of the neural network training model It includes：

Target training text is obtained, and the target training text is normalized；

Based on random number and preset connection value and threshold value, sparse patrol is carried out to the target training text after normalized It collects and returns, obtain target training set；

Based on the random number, the connection value and the threshold value iterate to calculate the target training set, generate the nerve Network training model.

4. Text Clustering Method according to claim 1, which is characterized in that the generation is corresponding with the target source text Family of files, including：

5. Text Clustering Method according to any one of claims 1-4, which is characterized in that the generation and the target After the corresponding family of files of source text, further include：

The family of files is visualized.

6. a kind of text cluster device, which is characterized in that applied to the server in distributed type assemblies, including：

Module is obtained, for obtaining target source text to be clustered；

Extraction module obtains target data for extracting the text feature in the target source text using most probable number method；

Read module, for reading preset neural network training model from own cache；

Cluster module, for gathering to the target data according to the neural network training model and neural network algorithm Alanysis, and generate family of files corresponding with the target source text.

7. text cluster device according to claim 6, which is characterized in that the extraction module includes：

Pretreatment unit for pre-processing to the target source file, and is extracted from pretreated target source text Text participle, the text participle include：Number, date, name and part of speech；

Extraction unit for extracting the text feature from text participle, and is determined by the most probable number method The text feature of existing maximum probability, the text feature include：Word weight, word frequency and inverse document frequency.

8. text cluster device according to claim 6, which is characterized in that the cluster module is specifically used for：

9. a kind of text cluster equipment, which is characterized in that including：

Memory, for storing computer program；

Processor realizes the text cluster side as described in claim 1-5 any one when for executing the computer program The step of method.

10. a kind of readable storage medium storing program for executing, which is characterized in that be stored with computer program, the meter on the readable storage medium storing program for executing The step of Text Clustering Method as described in claim 1-5 any one is realized when calculation machine program is executed by processor.