CN108846142A - A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing - Google Patents
A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN108846142A CN108846142A CN201810763151.0A CN201810763151A CN108846142A CN 108846142 A CN108846142 A CN 108846142A CN 201810763151 A CN201810763151 A CN 201810763151A CN 108846142 A CN108846142 A CN 108846142A
- Authority
- CN
- China
- Prior art keywords
- text
- target
- neural network
- cluster
- target source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Text Clustering Methods, applied to the server in distributed type assemblies, including:Obtain target source text to be clustered;Text feature in target source text is extracted using most probable number method, obtains target data;Preset neural network training model is read from own cache;According to neural network training model and neural network algorithm, clustering is carried out to target data, and generate family of files corresponding with target source text.This method is applied to distributed type assemblies, and in cluster process, and the intermediate result that neural network training model generates is stored in the caching of server, thus improves the data volume and efficiency of text cluster;Meanwhile neural network algorithm improves the accuracy of cluster result.Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing disclosed by the invention, similarly have above-mentioned technique effect.
Description
Technical field
The present invention relates to clustering technique field, more specifically to a kind of Text Clustering Method, device, equipment and can
Read storage medium.
Background technique
With the continuous fusion development of computer technology and clustering technique, text cluster, which becomes, carries out effectively text information
Ground tissue, abstract and the important means of navigation.
Currently, existing text cluster is generally basede on one-of-a-kind system realization, since machine is limited, therefore its text that can cover
It measures less;And since it is in cluster process, the intermediate result of cluster is stored in the hard disk of rear end, causes iterating to calculate
When, it requires to read data from hard disk every time, so lessens computation rate, and then lead to the efficiency of text cluster
It reduces;Simultaneously as the algorithm of its clustering used is complex, in the slow situation of computation rate, possibly can not
Ensure the accuracy of cluster result.
Therefore, the efficiency and accuracy for how improving text cluster are those skilled in the art's problems to be solved.
Summary of the invention
The purpose of the present invention is to provide a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text
The efficiency and accuracy of this cluster.
To achieve the above object, the embodiment of the invention provides following technical solutions:
A kind of Text Clustering Method, applied to the server in distributed type assemblies, including:
Obtain target source text to be clustered;
Text feature in the target source text is extracted using most probable number method, obtains target data;
Preset neural network training model is read from own cache;
According to the neural network training model and neural network algorithm, clustering is carried out to the target data, and
Generate family of files corresponding with the target source text.
Wherein, the text feature extracted in the target source text using most probable number method obtains target data, packet
It includes:
The target source file is pre-processed, and extracts text participle, institute from pretreated target source text
Stating text participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and maximum probability occurs by most probable number method determination
Text feature, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the generation of the neural network training model includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, the target training text after normalized is carried out dilute
Logistic regression is dredged, target training set is obtained;
Based on the random number, the connection value and the threshold value iterate to calculate the target training and gather, described in generation
Neural network training model.
Wherein, generation family of files corresponding with the target source text, including:
The family of files is generated by the cosine value of space angle between vector space model and vector.
Wherein, after generation family of files corresponding with the target source text, further include:
The family of files is visualized.
A kind of text cluster device, applied to the server in distributed type assemblies, including:
Module is obtained, for obtaining target source text to be clustered;
Extraction module obtains number of targets for extracting the text feature in the target source text using most probable number method
According to;
Read module, for reading preset neural network training model from own cache;
Cluster module, for according to the neural network training model and neural network algorithm, to the target data into
Row clustering, and generate family of files corresponding with the target source text.
Wherein, the extraction module includes:
Pretreatment unit, for being pre-processed to the target source file, and from pretreated target source text
Text participle is extracted, the text participle includes:Number, date, name and part of speech;
Extraction unit, for extracting the text feature from text participle, and it is true by the most probable number method
The text feature of existing maximum probability is made, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
A kind of text cluster equipment, including:
Memory, for storing computer program;
Processor realizes the step of Text Clustering Method described in above-mentioned any one when for executing the computer program
Suddenly.
A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing
The step of processor realizes Text Clustering Method described in above-mentioned any one when executing.
By above scheme it is found that the embodiment of the invention provides a kind of Text Clustering Method, it is applied to distributed type assemblies
In server, including:Obtain target source text to be clustered;Text in the target source text is extracted using most probable number method
Eigen obtains target data;Preset neural network training model is read from own cache;It is instructed according to the neural network
Practice model and neural network algorithm, clustering is carried out to the target data, and generate corresponding with the target source text
Family of files.
As it can be seen that the method obtains target data, and base by extracting the text feature in the target source text got
Caching, neural network training model and the neural network algorithm of server in distributed type assemblies gather target data
Alanysis, to generate family of files corresponding with target source text.Wherein, since this method is applied to distributed type assemblies, thus
The amount of text that it can cover is larger, therefore expands the data volume of text cluster;Also, since neural network training model is stored in
The caching of server, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, no
It is disconnected that data are read from caching, data read rates can be improved, and then improve the efficiency of text cluster;Meanwhile this programme is adopted
The accuracy of cluster result is improved with neural network algorithm.
If carrying out clustering to enterprise's text using this method, the efficiency of text cluster not only can be improved, and due to enterprise
Industry text more standardizes, and the accuracy of obtained cluster result will be promoted significantly, is based on such cluster result, is also convenient for
Staff's locating file improves working efficiency.
Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing provided in an embodiment of the present invention, it is also the same to have
There is above-mentioned technical effect.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of Text Clustering Method flow chart disclosed by the embodiments of the present invention;
Fig. 2 is another Text Clustering Method flow chart disclosed by the embodiments of the present invention;
Fig. 3 is a kind of text cluster schematic device disclosed by the embodiments of the present invention;
Fig. 4 is a kind of text cluster equipment schematic diagram disclosed by the embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text
The efficiency and accuracy of cluster.
Referring to Fig. 1, a kind of Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies
Device, including:
S101, target source text to be clustered is obtained;
Specifically, the target source file includes:All kinds of enterprise's text files and network short essay this document.
S102, the text feature in target source text is extracted using most probable number method, obtains target data;
In the present embodiment, when getting target source file to be clustered, target is extracted using most probable number method first
Text feature in source text, to obtain target data.
Wherein, a Chinese character string to be slit may be segmented comprising a variety of texts.Such as " disagreement of having complaints " the words,
Possible result includes:" having/opinion/disagreement ", " intentionally/see/disagreement ", it is even more.At this point it is possible to will wherein maximum probability
Text participle be determined as final text participle.
S103, preset neural network training model is read from own cache;
Specifically, the neural network training model is pre-stored in the caching of each server, it is so poly- in text
The intermediate result generated in class process will be also temporarily stored into the caching of each server.It therefore can be constantly from being read in caching
Between as a result, data read rates so can be improved, and improve text cluster efficiency in turn.
S104, according to neural network training model and neural network algorithm, clustering is carried out to target data, and generate
Family of files corresponding with target source text.
Preferably, poly- based on text provided in this embodiment when the quantity of target source file to be clustered is 1,000,000,000
This 1,000,000,000 target source files, can be divided into multiple file sets by class method, and will divide obtained file set distribute to
Server in distributed type assemblies, so that every server carries out clustering to each file set parallel, to improve text
The data processing amount and treatment effeciency of this cluster.
Such as:When there are 10 nodes in distributed type assemblies, i.e., in the presence of 10 servers, then by 1,000,000,000 target source documents
Part is divided into 10 file sets, and each file set includes 100,000,000 file destinations.Such server just handles 100,000,000
Target source file, and every server parallel processing in distributed type assemblies, can so increase substantially the processing of text cluster
Efficiency.It certainly, can be host node and from section according to the different demarcation of respective business by this 10 nodes for the ease of management
Point;Wherein, the quantity of host node can be set to 2, when one of catastrophic failure, another can be used as it is spare, to answer
To the possible period of want or need.
As it can be seen that present embodiments providing a kind of Text Clustering Method, the method is by extracting the target source document got
Text feature in this obtains target data, and the caching based on the server in distributed type assemblies, neural network training model
And neural network algorithm carries out clustering to target data, to generate family of files corresponding with target source text.Wherein,
Since this method is applied to distributed type assemblies, thus its amount of text that can cover is larger, therefore expands the data volume of text cluster;
Also, since neural network training model is stored in the caching of server, the intermediate result generated in cluster process is also stored in
The caching of server, therefore in cluster process, data are constantly read from caching, and data read rates, Jin Erti can be improved
The high efficiency of text cluster;Meanwhile this programme improves the accuracy of cluster result using neural network algorithm.If using this
Method carries out clustering to enterprise's text, the efficiency of text cluster not only can be improved, and since enterprise's text more standardizes,
The accuracy of obtained cluster result will be promoted significantly, be based on such cluster result, be also convenient for staff's locating file,
Improve working efficiency.
The embodiment of the invention discloses another Text Clustering Methods, and relative to a upper embodiment, the present embodiment is to technology
Scheme has made further instruction and optimization.
Referring to fig. 2, another Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies
Device, including:
S202, target source text to be clustered is obtained;
S202, the text feature in target source text is extracted using most probable number method, obtains target data;
In the present embodiment, the target data is text matrix.
S203, preset neural network training model is read from own cache;
S204, according to neural network training model and neural network algorithm, clustering is carried out to target data;
S205, text corresponding with target source file is generated by the cosine value of space angle between vector space model and vector
Part race.
In the present embodiment, it is generated and target source file by the cosine value of space angle between vector space model and vector
Corresponding family of files.
Wherein, the basic thought of vector space model is text to be reduced to the weight of characteristic item (i.e. keyword) be point
The N-dimensional vector of amount indicates.This model hypothesis word indicates text with vector to uncorrelated between word, to simplify in text
Keyword between complex relationship, text is indicated with foolproof vector, so that this model has computability.
It should be noted that text refers to various machine readable records in vector space model.
If indicating text with D, characteristic item is indicated with T, then the basic language unit of text content can be represented in text D
For T, mainly it is made of word or phrase.Wherein, text D can be with characteristic item set representations:D (T1, T2 ... ..., Tn),
Wherein Tk is characteristic item, and meets 1≤k≤n.
Assuming that having tetra- characteristic items of a, b, c, d in a text, then this text is represented by:D (a, b, c, d), for
Other texts to compare therewith also will comply with this characteristic item sequence.For the text containing n characteristic item, it will usually
Assigning certain weight to each characteristic item indicates its significance level, i.e.,:D=D (T1, W1;T2, W2;..., Tn, Wn), brief note
For D=D (W1, W2 ... ..., Wn), this weight vector for being referred to as text D is indicated, wherein Wk is the weight of Tk, 1≤k≤n.
Based on above-mentioned it is assumed that if the weight of a, b, c, d are respectively 30,20,20,10, then the vector of the text is expressed as:
D (30,20,20,10).And in vector space model, the content degree of correlation Sim (D1, D2) between two texts D1 and D2 is used
The cosine value of angle indicates between vector, and representation formula is:
Wherein, W1k、W2kThe weight of text D1 and D2 k-th characteristic item is respectively indicated, θ is the sky of vector D1 and vector D2
Between angle, 1≤k≤n.
It should be noted that being calculated using the above method wait return when carry out text classification during text cluster
The degree of correlation of class text and certain classification.
Assuming that the characteristic item of text D1 is a, b, c, d, weight is respectively 30,20,20,10, and the characteristic item of classification C1 is a,
C, d, e, weight are respectively 40,30,20,10, then the vector of D1 is expressed as:The vector of D1 (30,20,20,10,0), C1 indicate
For:C1 (40,0,30,20,10), then the cosine value of angle is between vector C1 and vector D1:0.86, i.e. text D1 and classification C1
The degree of correlation is 0.86.
Specifically calculating process is:The mould of n-dimensional vector V { v1, v2, v3 ..., vn } is | v |=sqrt (v1*v1+v2*v2
+ ...+vn*vn),
The dot product of so two vectors is:M*n=n1*m1+n2*m2+......+nn*mn, similarity are:Sim=(m*
N)/(| m | * | n |), wherein the physical significance of similarity is the cosine value of the space angle of two vectors.
As it can be seen that present embodiments providing another Text Clustering Method, the method is by extracting the target source got
Text feature in text obtains target data, and the caching based on the server in distributed type assemblies, neural metwork training mould
Type and neural network algorithm carry out clustering to target data, to pass through space angle between vector space model and vector
Cosine value generate corresponding with target source text family of files.Wherein, since this method is applied to distributed type assemblies, thus it can
The amount of text covered is larger, therefore expands the data volume of text cluster;Also, since neural network training model is stored in service
The caching of device, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, constantly from
Data are read in caching, data read rates can be improved, and then improve the efficiency of text cluster;Meanwhile this programme is using mind
The accuracy of cluster result is improved through network algorithm.
If carrying out clustering to enterprise's text using this method, the efficiency of text cluster not only can be improved, and due to enterprise
Industry text more standardizes, and the accuracy of obtained cluster result will be promoted significantly, is based on such cluster result, is also convenient for
Staff's locating file improves working efficiency.
Based on above-mentioned any embodiment, it should be noted that described to extract the target source text using most probable number method
In text feature, obtain target data, including:
The target source file is pre-processed, and extracts text participle, institute from pretreated target source text
Stating text participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and maximum probability occurs by most probable number method determination
Text feature, the text feature includes:Word weight, word frequency and inverse document frequency.
Specifically, the extraction process of text feature can be indicated with following formula:
Therefore:P (W)=P (W1,W2,...,Wi)≈P(W1)×P(W2)×...×P(Wi), i.e. P (Wi) it is equal to WiIn corpus
The frequency n of middle appearance, the quotient with total word number N in corpus.Wherein, a large amount of texts of sampled and processing are stored in corpus
This.
Wherein, P (W | S) indicates the frequency that word occurs in a document, and P (S | W) is the probability for the word that text includes, can
Approximatively P (S | W) is regarded as and is constantly equal to 1, because the sentence generated under a kind of any imaginary participle mode is always smart
Generate quasi-ly word segmentation result (only the boundary symbol between participle need to be thrown away), and P (S) under various participle modes always
Equal, so not influencing to compare.So P (W | S) it is approximately equal to P (W).
When expressing word weight, text can be expressed as vector in vector space model.Wherein, word weight indicates sentence
In contribution degree of the word in the sentence, such as:"Most scientists think that butterflies use the
position of the sun in the sky as a kind of compass that allows them to
Determine which way is north ", wherein:Important word includes:butterflies,monarchs,
scientists,compass;Unessential word includes:most,think,kind,sky;And word weight is exactly to reflect each word
Importance measurement.
Word frequency indicates the number that a word occurs in the sentence, for calculating word weight, so the calculating of word weight T is public
Formula is:
Wherein, tfFor word frequency, doc_length is Chinese character string length.
Specifically, the extracting method of word weight includes:It is candidate that the descriptor in participle is obtained by Bayesian formula;It obtains
The word frequency of descriptor candidate and position;The word weight of descriptor candidate is solved;The maximum descriptor of word weight is candidate
As final word weight.Wherein, the calculation formula of descriptor candidate is:weighti=α × frei+e×loci, wherein
weightiBased on write inscription candidate weight, freiFor word frequency weight factor, lociFor the position weight factor, α be word frequency adjust because
Son, e are location factor regulatory factor, are write inscription based on i candidate.
Inverse document frequency indicates the number of the text comprising some word.Usually, if a word is in more texts
In occurred, then the word to the contribution degree of text with regard to smaller, i.e., with the word come when distinguishing different texts, discrimination is smaller.
The calculation formula of inverse document frequency I is:
Wherein, N is the number that participle occurs, dfFor document frequency, and the calculation formula can make the range of inverse document frequency
Between [0,1].
Based on above-mentioned any embodiment, it should be noted that in the analytic process of text data, in-between data are deposited
It is stored in caching, in order to improve the reading efficiency of data.
Based on above-mentioned any embodiment, it should be noted that the generation of the neural network training model includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, the target training text after normalized is carried out dilute
Logistic regression is dredged, target training set is obtained;
Based on the random number, the connection value and the threshold value iterate to calculate the target training and gather, described in generation
Neural network training model.
Based on above-mentioned any embodiment, it should be noted that described to generate family of files corresponding with the target source text
Later, further include:
The family of files is visualized.
Based on above-mentioned any embodiment, it should be noted that use can be constructed using literary clustering method provided in this embodiment
In the following server cluster of text cluster.Distributed type assemblies greater than 20 nodes (server) are set, by the clothes in cluster
Business device is divided into primary server and from server, and primary server uses mind for managing from server on every server
High performance clustering is carried out to source text through network algorithm;Meanwhile cluster process is based on caching and realizes.Wherein, the distribution
Formula cluster uses hadoop platform, to improve the compatibility of distributed type assemblies.
A kind of text cluster device provided in an embodiment of the present invention is introduced below, a kind of text described below is poly-
Class device can be cross-referenced with a kind of above-described Text Clustering Method.
Referring to Fig. 3, a kind of text cluster device provided in an embodiment of the present invention, applied to the service in distributed type assemblies
Device, including:
Module 301 is obtained, for obtaining target source text to be clustered;
Extraction module 302 obtains target for extracting the text feature in the target source text using most probable number method
Data;
Read module 303, for reading preset neural network training model from own cache;
Cluster module 304 is used for according to the neural network training model and neural network algorithm, to the target data
Clustering is carried out, and generates family of files corresponding with the target source text.
Wherein, the extraction module includes:
Pretreatment unit, for being pre-processed to the target source file, and from pretreated target source text
Text participle is extracted, the text participle includes:Number, date, name and part of speech;
Extraction unit, for extracting the text feature from text participle, and it is true by the most probable number method
The text feature of existing maximum probability is made, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
Wherein, further include:The neural network training model generation module, the generation module include:
Acquiring unit is normalized for obtaining target training text, and to the target training text;
Logistic regression unit, for being based on random number and preset connection value and threshold value, to the mesh after normalized
It marks training text and carries out sparse logistic regression, obtain target training set;
Computing unit, for being based on the random number, the connection value and the threshold value iterate to calculate the target training
Set, generates the neural network training model.
Wherein, further include:
Display module, for visualizing the family of files.
A kind of text cluster equipment provided in an embodiment of the present invention is introduced below, a kind of text described below is poly-
Class equipment can be cross-referenced with a kind of above-described Text Clustering Method and device.
Referring to fig. 4, a kind of text cluster equipment provided in an embodiment of the present invention, including:
Memory 401, for storing computer program;
Processor 402 realizes text cluster side described in above-mentioned any embodiment when for executing the computer program
The step of method.
A kind of readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, one kind described below is readable to deposit
Storage media can be cross-referenced with a kind of above-described Text Clustering Method, device and equipment.
A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing
The step of Text Clustering Method as described in above-mentioned any embodiment is realized when processor executes.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (10)
1. a kind of Text Clustering Method, which is characterized in that applied to the server in distributed type assemblies, including:
Obtain target source text to be clustered;
Text feature in the target source text is extracted using most probable number method, obtains target data;
Preset neural network training model is read from own cache;
According to the neural network training model and neural network algorithm, clustering is carried out to the target data, and generate
Family of files corresponding with the target source text.
2. Text Clustering Method according to claim 1, which is characterized in that described to extract the mesh using most probable number method
The text feature in source text is marked, target data is obtained, including:
The target source file is pre-processed, and extracts text participle, the text from pretreated target source text
This participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and the text for maximum probability occur is determined by the most probable number method
Eigen, the text feature include:Word weight, word frequency and inverse document frequency.
3. Text Clustering Method according to claim 1, which is characterized in that the generation packet of the neural network training model
It includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, sparse patrol is carried out to the target training text after normalized
It collects and returns, obtain target training set;
Based on the random number, the connection value and the threshold value iterate to calculate the target training set, generate the nerve
Network training model.
4. Text Clustering Method according to claim 1, which is characterized in that the generation is corresponding with the target source text
Family of files, including:
The family of files is generated by the cosine value of space angle between vector space model and vector.
5. Text Clustering Method according to any one of claims 1-4, which is characterized in that the generation and the target
After the corresponding family of files of source text, further include:
The family of files is visualized.
6. a kind of text cluster device, which is characterized in that applied to the server in distributed type assemblies, including:
Module is obtained, for obtaining target source text to be clustered;
Extraction module obtains target data for extracting the text feature in the target source text using most probable number method;
Read module, for reading preset neural network training model from own cache;
Cluster module, for gathering to the target data according to the neural network training model and neural network algorithm
Alanysis, and generate family of files corresponding with the target source text.
7. text cluster device according to claim 6, which is characterized in that the extraction module includes:
Pretreatment unit for pre-processing to the target source file, and is extracted from pretreated target source text
Text participle, the text participle include:Number, date, name and part of speech;
Extraction unit for extracting the text feature from text participle, and is determined by the most probable number method
The text feature of existing maximum probability, the text feature include:Word weight, word frequency and inverse document frequency.
8. text cluster device according to claim 6, which is characterized in that the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
9. a kind of text cluster equipment, which is characterized in that including:
Memory, for storing computer program;
Processor realizes the text cluster side as described in claim 1-5 any one when for executing the computer program
The step of method.
10. a kind of readable storage medium storing program for executing, which is characterized in that be stored with computer program, the meter on the readable storage medium storing program for executing
The step of Text Clustering Method as described in claim 1-5 any one is realized when calculation machine program is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810763151.0A CN108846142A (en) | 2018-07-12 | 2018-07-12 | A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810763151.0A CN108846142A (en) | 2018-07-12 | 2018-07-12 | A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108846142A true CN108846142A (en) | 2018-11-20 |
Family
ID=64196999
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810763151.0A Pending CN108846142A (en) | 2018-07-12 | 2018-07-12 | A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108846142A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324737A (en) * | 2020-03-23 | 2020-06-23 | 中国电子科技集团公司第三十研究所 | Bag-of-words model-based distributed text clustering method, storage medium and computing device |
CN111522657A (en) * | 2020-04-14 | 2020-08-11 | 北京航空航天大学 | Distributed equipment collaborative deep learning reasoning method |
CN111857097A (en) * | 2020-07-27 | 2020-10-30 | 中国南方电网有限责任公司超高压输电公司昆明局 | Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136337A (en) * | 2013-02-01 | 2013-06-05 | 北京邮电大学 | Distributed knowledge data mining device and mining method used for complex network |
CN104504150A (en) * | 2015-01-09 | 2015-04-08 | 成都布林特信息技术有限公司 | News public opinion monitoring system |
CN105512723A (en) * | 2016-01-20 | 2016-04-20 | 南京艾溪信息科技有限公司 | Artificial neural network calculating device and method for sparse connection |
CN105550222A (en) * | 2015-12-07 | 2016-05-04 | 中国电子科技网络信息安全有限公司 | Distributed storage-based image service system and method |
CN106886613A (en) * | 2017-05-03 | 2017-06-23 | 成都云数未来信息科学有限公司 | A kind of Text Clustering Method of parallelization |
KR101877243B1 (en) * | 2017-04-25 | 2018-07-11 | 한국과학기술원 | Ap apparatus clustering method using neural network based on reinforcement learning and cooperative communicatin apparatus using neural network based on reinforcement learning |
-
2018
- 2018-07-12 CN CN201810763151.0A patent/CN108846142A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136337A (en) * | 2013-02-01 | 2013-06-05 | 北京邮电大学 | Distributed knowledge data mining device and mining method used for complex network |
CN104504150A (en) * | 2015-01-09 | 2015-04-08 | 成都布林特信息技术有限公司 | News public opinion monitoring system |
CN105550222A (en) * | 2015-12-07 | 2016-05-04 | 中国电子科技网络信息安全有限公司 | Distributed storage-based image service system and method |
CN105512723A (en) * | 2016-01-20 | 2016-04-20 | 南京艾溪信息科技有限公司 | Artificial neural network calculating device and method for sparse connection |
KR101877243B1 (en) * | 2017-04-25 | 2018-07-11 | 한국과학기술원 | Ap apparatus clustering method using neural network based on reinforcement learning and cooperative communicatin apparatus using neural network based on reinforcement learning |
CN106886613A (en) * | 2017-05-03 | 2017-06-23 | 成都云数未来信息科学有限公司 | A kind of Text Clustering Method of parallelization |
Non-Patent Citations (2)
Title |
---|
刘珊珊: "基于云计算平台Hadoop的聚类神经网络算法的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
宋杰: "《大数据处理平台能耗优化方法的研究》", 30 November 2016 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111324737A (en) * | 2020-03-23 | 2020-06-23 | 中国电子科技集团公司第三十研究所 | Bag-of-words model-based distributed text clustering method, storage medium and computing device |
CN111522657A (en) * | 2020-04-14 | 2020-08-11 | 北京航空航天大学 | Distributed equipment collaborative deep learning reasoning method |
CN111522657B (en) * | 2020-04-14 | 2022-07-22 | 北京航空航天大学 | Distributed equipment collaborative deep learning reasoning method |
CN111857097A (en) * | 2020-07-27 | 2020-10-30 | 中国南方电网有限责任公司超高压输电公司昆明局 | Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency |
CN111857097B (en) * | 2020-07-27 | 2023-10-31 | 中国南方电网有限责任公司超高压输电公司昆明局 | Industrial control system abnormality diagnosis information identification method based on word frequency and inverse document frequency |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11227118B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN107463605B (en) | Method and device for identifying low-quality news resource, computer equipment and readable medium | |
CN106951422B (en) | Webpage training method and device, and search intention identification method and device | |
CN111191466B (en) | Homonymous author disambiguation method based on network characterization and semantic characterization | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN106462604A (en) | Identifying query intent | |
US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN108846142A (en) | A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing | |
Mohotti et al. | Corpus-based augmented media posts with density-based clustering for community detection | |
CN115062621A (en) | Label extraction method and device, electronic equipment and storage medium | |
Perdana et al. | Instance-based deep transfer learning on cross-domain image captioning | |
US11755671B2 (en) | Projecting queries into a content item embedding space | |
CN113128234B (en) | Method and system for establishing entity recognition model, electronic equipment and medium | |
CN114462673A (en) | Methods, systems, computing devices, and readable media for predicting future events | |
CN113962221A (en) | Text abstract extraction method and device, terminal equipment and storage medium | |
CN111625579B (en) | Information processing method, device and system | |
CN113806641A (en) | Deep learning-based recommendation method and device, electronic equipment and storage medium | |
US20170076219A1 (en) | Prediction of future prominence attributes in data set | |
Jo | Automatic text summarization using string vector based K nearest neighbor | |
US20240168999A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN111476037B (en) | Text processing method and device, computer equipment and storage medium | |
US11893012B1 (en) | Content extraction using related entity group metadata from reference objects | |
Wijaya et al. | Twitter Opinion Mining Analysis of Web-Based Handphone Brand Using Naïve Bayes Classification Method | |
CN102929889B (en) | A kind of method and system for improving community network | |
CN117725555A (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181120 |
|
RJ01 | Rejection of invention patent application after publication |