CN108846142A - A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing - Google Patents

A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing Download PDF

Info

Publication number
CN108846142A
CN108846142A CN201810763151.0A CN201810763151A CN108846142A CN 108846142 A CN108846142 A CN 108846142A CN 201810763151 A CN201810763151 A CN 201810763151A CN 108846142 A CN108846142 A CN 108846142A
Authority
CN
China
Prior art keywords
text
target
neural network
cluster
target source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810763151.0A
Other languages
Chinese (zh)
Inventor
曾广移
李德华
巩宇
卢勇
丁钊
杨小龙
梁莉雪
黄小凤
王晓翼
杨宗强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd
Original Assignee
Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd filed Critical Peak and Frequency Regulation Power Generation Co of China Southern Power Grid Co Ltd
Priority to CN201810763151.0A priority Critical patent/CN108846142A/en
Publication of CN108846142A publication Critical patent/CN108846142A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Text Clustering Methods, applied to the server in distributed type assemblies, including:Obtain target source text to be clustered;Text feature in target source text is extracted using most probable number method, obtains target data;Preset neural network training model is read from own cache;According to neural network training model and neural network algorithm, clustering is carried out to target data, and generate family of files corresponding with target source text.This method is applied to distributed type assemblies, and in cluster process, and the intermediate result that neural network training model generates is stored in the caching of server, thus improves the data volume and efficiency of text cluster;Meanwhile neural network algorithm improves the accuracy of cluster result.Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing disclosed by the invention, similarly have above-mentioned technique effect.

Description

A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing
Technical field
The present invention relates to clustering technique field, more specifically to a kind of Text Clustering Method, device, equipment and can Read storage medium.
Background technique
With the continuous fusion development of computer technology and clustering technique, text cluster, which becomes, carries out effectively text information Ground tissue, abstract and the important means of navigation.
Currently, existing text cluster is generally basede on one-of-a-kind system realization, since machine is limited, therefore its text that can cover It measures less;And since it is in cluster process, the intermediate result of cluster is stored in the hard disk of rear end, causes iterating to calculate When, it requires to read data from hard disk every time, so lessens computation rate, and then lead to the efficiency of text cluster It reduces;Simultaneously as the algorithm of its clustering used is complex, in the slow situation of computation rate, possibly can not Ensure the accuracy of cluster result.
Therefore, the efficiency and accuracy for how improving text cluster are those skilled in the art's problems to be solved.
Summary of the invention
The purpose of the present invention is to provide a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text The efficiency and accuracy of this cluster.
To achieve the above object, the embodiment of the invention provides following technical solutions:
A kind of Text Clustering Method, applied to the server in distributed type assemblies, including:
Obtain target source text to be clustered;
Text feature in the target source text is extracted using most probable number method, obtains target data;
Preset neural network training model is read from own cache;
According to the neural network training model and neural network algorithm, clustering is carried out to the target data, and Generate family of files corresponding with the target source text.
Wherein, the text feature extracted in the target source text using most probable number method obtains target data, packet It includes:
The target source file is pre-processed, and extracts text participle, institute from pretreated target source text Stating text participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and maximum probability occurs by most probable number method determination Text feature, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the generation of the neural network training model includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, the target training text after normalized is carried out dilute Logistic regression is dredged, target training set is obtained;
Based on the random number, the connection value and the threshold value iterate to calculate the target training and gather, described in generation Neural network training model.
Wherein, generation family of files corresponding with the target source text, including:
The family of files is generated by the cosine value of space angle between vector space model and vector.
Wherein, after generation family of files corresponding with the target source text, further include:
The family of files is visualized.
A kind of text cluster device, applied to the server in distributed type assemblies, including:
Module is obtained, for obtaining target source text to be clustered;
Extraction module obtains number of targets for extracting the text feature in the target source text using most probable number method According to;
Read module, for reading preset neural network training model from own cache;
Cluster module, for according to the neural network training model and neural network algorithm, to the target data into Row clustering, and generate family of files corresponding with the target source text.
Wherein, the extraction module includes:
Pretreatment unit, for being pre-processed to the target source file, and from pretreated target source text Text participle is extracted, the text participle includes:Number, date, name and part of speech;
Extraction unit, for extracting the text feature from text participle, and it is true by the most probable number method The text feature of existing maximum probability is made, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
A kind of text cluster equipment, including:
Memory, for storing computer program;
Processor realizes the step of Text Clustering Method described in above-mentioned any one when for executing the computer program Suddenly.
A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing The step of processor realizes Text Clustering Method described in above-mentioned any one when executing.
By above scheme it is found that the embodiment of the invention provides a kind of Text Clustering Method, it is applied to distributed type assemblies In server, including:Obtain target source text to be clustered;Text in the target source text is extracted using most probable number method Eigen obtains target data;Preset neural network training model is read from own cache;It is instructed according to the neural network Practice model and neural network algorithm, clustering is carried out to the target data, and generate corresponding with the target source text Family of files.
As it can be seen that the method obtains target data, and base by extracting the text feature in the target source text got Caching, neural network training model and the neural network algorithm of server in distributed type assemblies gather target data Alanysis, to generate family of files corresponding with target source text.Wherein, since this method is applied to distributed type assemblies, thus The amount of text that it can cover is larger, therefore expands the data volume of text cluster;Also, since neural network training model is stored in The caching of server, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, no It is disconnected that data are read from caching, data read rates can be improved, and then improve the efficiency of text cluster;Meanwhile this programme is adopted The accuracy of cluster result is improved with neural network algorithm.
If carrying out clustering to enterprise's text using this method, the efficiency of text cluster not only can be improved, and due to enterprise Industry text more standardizes, and the accuracy of obtained cluster result will be promoted significantly, is based on such cluster result, is also convenient for Staff's locating file improves working efficiency.
Correspondingly, a kind of text cluster device, equipment and readable storage medium storing program for executing provided in an embodiment of the present invention, it is also the same to have There is above-mentioned technical effect.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of Text Clustering Method flow chart disclosed by the embodiments of the present invention;
Fig. 2 is another Text Clustering Method flow chart disclosed by the embodiments of the present invention;
Fig. 3 is a kind of text cluster schematic device disclosed by the embodiments of the present invention;
Fig. 4 is a kind of text cluster equipment schematic diagram disclosed by the embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing, to improve text The efficiency and accuracy of cluster.
Referring to Fig. 1, a kind of Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including:
S101, target source text to be clustered is obtained;
Specifically, the target source file includes:All kinds of enterprise's text files and network short essay this document.
S102, the text feature in target source text is extracted using most probable number method, obtains target data;
In the present embodiment, when getting target source file to be clustered, target is extracted using most probable number method first Text feature in source text, to obtain target data.
Wherein, a Chinese character string to be slit may be segmented comprising a variety of texts.Such as " disagreement of having complaints " the words, Possible result includes:" having/opinion/disagreement ", " intentionally/see/disagreement ", it is even more.At this point it is possible to will wherein maximum probability Text participle be determined as final text participle.
S103, preset neural network training model is read from own cache;
Specifically, the neural network training model is pre-stored in the caching of each server, it is so poly- in text The intermediate result generated in class process will be also temporarily stored into the caching of each server.It therefore can be constantly from being read in caching Between as a result, data read rates so can be improved, and improve text cluster efficiency in turn.
S104, according to neural network training model and neural network algorithm, clustering is carried out to target data, and generate Family of files corresponding with target source text.
Preferably, poly- based on text provided in this embodiment when the quantity of target source file to be clustered is 1,000,000,000 This 1,000,000,000 target source files, can be divided into multiple file sets by class method, and will divide obtained file set distribute to Server in distributed type assemblies, so that every server carries out clustering to each file set parallel, to improve text The data processing amount and treatment effeciency of this cluster.
Such as:When there are 10 nodes in distributed type assemblies, i.e., in the presence of 10 servers, then by 1,000,000,000 target source documents Part is divided into 10 file sets, and each file set includes 100,000,000 file destinations.Such server just handles 100,000,000 Target source file, and every server parallel processing in distributed type assemblies, can so increase substantially the processing of text cluster Efficiency.It certainly, can be host node and from section according to the different demarcation of respective business by this 10 nodes for the ease of management Point;Wherein, the quantity of host node can be set to 2, when one of catastrophic failure, another can be used as it is spare, to answer To the possible period of want or need.
As it can be seen that present embodiments providing a kind of Text Clustering Method, the method is by extracting the target source document got Text feature in this obtains target data, and the caching based on the server in distributed type assemblies, neural network training model And neural network algorithm carries out clustering to target data, to generate family of files corresponding with target source text.Wherein, Since this method is applied to distributed type assemblies, thus its amount of text that can cover is larger, therefore expands the data volume of text cluster; Also, since neural network training model is stored in the caching of server, the intermediate result generated in cluster process is also stored in The caching of server, therefore in cluster process, data are constantly read from caching, and data read rates, Jin Erti can be improved The high efficiency of text cluster;Meanwhile this programme improves the accuracy of cluster result using neural network algorithm.If using this Method carries out clustering to enterprise's text, the efficiency of text cluster not only can be improved, and since enterprise's text more standardizes, The accuracy of obtained cluster result will be promoted significantly, be based on such cluster result, be also convenient for staff's locating file, Improve working efficiency.
The embodiment of the invention discloses another Text Clustering Methods, and relative to a upper embodiment, the present embodiment is to technology Scheme has made further instruction and optimization.
Referring to fig. 2, another Text Clustering Method provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including:
S202, target source text to be clustered is obtained;
S202, the text feature in target source text is extracted using most probable number method, obtains target data;
In the present embodiment, the target data is text matrix.
S203, preset neural network training model is read from own cache;
S204, according to neural network training model and neural network algorithm, clustering is carried out to target data;
S205, text corresponding with target source file is generated by the cosine value of space angle between vector space model and vector Part race.
In the present embodiment, it is generated and target source file by the cosine value of space angle between vector space model and vector Corresponding family of files.
Wherein, the basic thought of vector space model is text to be reduced to the weight of characteristic item (i.e. keyword) be point The N-dimensional vector of amount indicates.This model hypothesis word indicates text with vector to uncorrelated between word, to simplify in text Keyword between complex relationship, text is indicated with foolproof vector, so that this model has computability. It should be noted that text refers to various machine readable records in vector space model.
If indicating text with D, characteristic item is indicated with T, then the basic language unit of text content can be represented in text D For T, mainly it is made of word or phrase.Wherein, text D can be with characteristic item set representations:D (T1, T2 ... ..., Tn), Wherein Tk is characteristic item, and meets 1≤k≤n.
Assuming that having tetra- characteristic items of a, b, c, d in a text, then this text is represented by:D (a, b, c, d), for Other texts to compare therewith also will comply with this characteristic item sequence.For the text containing n characteristic item, it will usually Assigning certain weight to each characteristic item indicates its significance level, i.e.,:D=D (T1, W1;T2, W2;..., Tn, Wn), brief note For D=D (W1, W2 ... ..., Wn), this weight vector for being referred to as text D is indicated, wherein Wk is the weight of Tk, 1≤k≤n.
Based on above-mentioned it is assumed that if the weight of a, b, c, d are respectively 30,20,20,10, then the vector of the text is expressed as: D (30,20,20,10).And in vector space model, the content degree of correlation Sim (D1, D2) between two texts D1 and D2 is used The cosine value of angle indicates between vector, and representation formula is:
Wherein, W1k、W2kThe weight of text D1 and D2 k-th characteristic item is respectively indicated, θ is the sky of vector D1 and vector D2 Between angle, 1≤k≤n.
It should be noted that being calculated using the above method wait return when carry out text classification during text cluster The degree of correlation of class text and certain classification.
Assuming that the characteristic item of text D1 is a, b, c, d, weight is respectively 30,20,20,10, and the characteristic item of classification C1 is a, C, d, e, weight are respectively 40,30,20,10, then the vector of D1 is expressed as:The vector of D1 (30,20,20,10,0), C1 indicate For:C1 (40,0,30,20,10), then the cosine value of angle is between vector C1 and vector D1:0.86, i.e. text D1 and classification C1 The degree of correlation is 0.86.
Specifically calculating process is:The mould of n-dimensional vector V { v1, v2, v3 ..., vn } is | v |=sqrt (v1*v1+v2*v2 + ...+vn*vn),
The dot product of so two vectors is:M*n=n1*m1+n2*m2+......+nn*mn, similarity are:Sim=(m* N)/(| m | * | n |), wherein the physical significance of similarity is the cosine value of the space angle of two vectors.
As it can be seen that present embodiments providing another Text Clustering Method, the method is by extracting the target source got Text feature in text obtains target data, and the caching based on the server in distributed type assemblies, neural metwork training mould Type and neural network algorithm carry out clustering to target data, to pass through space angle between vector space model and vector Cosine value generate corresponding with target source text family of files.Wherein, since this method is applied to distributed type assemblies, thus it can The amount of text covered is larger, therefore expands the data volume of text cluster;Also, since neural network training model is stored in service The caching of device, the intermediate result generated in cluster process are also stored in the caching of server, therefore in cluster process, constantly from Data are read in caching, data read rates can be improved, and then improve the efficiency of text cluster;Meanwhile this programme is using mind The accuracy of cluster result is improved through network algorithm.
If carrying out clustering to enterprise's text using this method, the efficiency of text cluster not only can be improved, and due to enterprise Industry text more standardizes, and the accuracy of obtained cluster result will be promoted significantly, is based on such cluster result, is also convenient for Staff's locating file improves working efficiency.
Based on above-mentioned any embodiment, it should be noted that described to extract the target source text using most probable number method In text feature, obtain target data, including:
The target source file is pre-processed, and extracts text participle, institute from pretreated target source text Stating text participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and maximum probability occurs by most probable number method determination Text feature, the text feature includes:Word weight, word frequency and inverse document frequency.
Specifically, the extraction process of text feature can be indicated with following formula:
Therefore:P (W)=P (W1,W2,...,Wi)≈P(W1)×P(W2)×...×P(Wi), i.e. P (Wi) it is equal to WiIn corpus The frequency n of middle appearance, the quotient with total word number N in corpus.Wherein, a large amount of texts of sampled and processing are stored in corpus This.
Wherein, P (W | S) indicates the frequency that word occurs in a document, and P (S | W) is the probability for the word that text includes, can Approximatively P (S | W) is regarded as and is constantly equal to 1, because the sentence generated under a kind of any imaginary participle mode is always smart Generate quasi-ly word segmentation result (only the boundary symbol between participle need to be thrown away), and P (S) under various participle modes always Equal, so not influencing to compare.So P (W | S) it is approximately equal to P (W).
When expressing word weight, text can be expressed as vector in vector space model.Wherein, word weight indicates sentence In contribution degree of the word in the sentence, such as:"Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to Determine which way is north ", wherein:Important word includes:butterflies,monarchs, scientists,compass;Unessential word includes:most,think,kind,sky;And word weight is exactly to reflect each word Importance measurement.
Word frequency indicates the number that a word occurs in the sentence, for calculating word weight, so the calculating of word weight T is public Formula is:
Wherein, tfFor word frequency, doc_length is Chinese character string length.
Specifically, the extracting method of word weight includes:It is candidate that the descriptor in participle is obtained by Bayesian formula;It obtains The word frequency of descriptor candidate and position;The word weight of descriptor candidate is solved;The maximum descriptor of word weight is candidate As final word weight.Wherein, the calculation formula of descriptor candidate is:weighti=α × frei+e×loci, wherein weightiBased on write inscription candidate weight, freiFor word frequency weight factor, lociFor the position weight factor, α be word frequency adjust because Son, e are location factor regulatory factor, are write inscription based on i candidate.
Inverse document frequency indicates the number of the text comprising some word.Usually, if a word is in more texts In occurred, then the word to the contribution degree of text with regard to smaller, i.e., with the word come when distinguishing different texts, discrimination is smaller. The calculation formula of inverse document frequency I is:
Wherein, N is the number that participle occurs, dfFor document frequency, and the calculation formula can make the range of inverse document frequency Between [0,1].
Based on above-mentioned any embodiment, it should be noted that in the analytic process of text data, in-between data are deposited It is stored in caching, in order to improve the reading efficiency of data.
Based on above-mentioned any embodiment, it should be noted that the generation of the neural network training model includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, the target training text after normalized is carried out dilute Logistic regression is dredged, target training set is obtained;
Based on the random number, the connection value and the threshold value iterate to calculate the target training and gather, described in generation Neural network training model.
Based on above-mentioned any embodiment, it should be noted that described to generate family of files corresponding with the target source text Later, further include:
The family of files is visualized.
Based on above-mentioned any embodiment, it should be noted that use can be constructed using literary clustering method provided in this embodiment In the following server cluster of text cluster.Distributed type assemblies greater than 20 nodes (server) are set, by the clothes in cluster Business device is divided into primary server and from server, and primary server uses mind for managing from server on every server High performance clustering is carried out to source text through network algorithm;Meanwhile cluster process is based on caching and realizes.Wherein, the distribution Formula cluster uses hadoop platform, to improve the compatibility of distributed type assemblies.
A kind of text cluster device provided in an embodiment of the present invention is introduced below, a kind of text described below is poly- Class device can be cross-referenced with a kind of above-described Text Clustering Method.
Referring to Fig. 3, a kind of text cluster device provided in an embodiment of the present invention, applied to the service in distributed type assemblies Device, including:
Module 301 is obtained, for obtaining target source text to be clustered;
Extraction module 302 obtains target for extracting the text feature in the target source text using most probable number method Data;
Read module 303, for reading preset neural network training model from own cache;
Cluster module 304 is used for according to the neural network training model and neural network algorithm, to the target data Clustering is carried out, and generates family of files corresponding with the target source text.
Wherein, the extraction module includes:
Pretreatment unit, for being pre-processed to the target source file, and from pretreated target source text Text participle is extracted, the text participle includes:Number, date, name and part of speech;
Extraction unit, for extracting the text feature from text participle, and it is true by the most probable number method The text feature of existing maximum probability is made, the text feature includes:Word weight, word frequency and inverse document frequency.
Wherein, the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
Wherein, further include:The neural network training model generation module, the generation module include:
Acquiring unit is normalized for obtaining target training text, and to the target training text;
Logistic regression unit, for being based on random number and preset connection value and threshold value, to the mesh after normalized It marks training text and carries out sparse logistic regression, obtain target training set;
Computing unit, for being based on the random number, the connection value and the threshold value iterate to calculate the target training Set, generates the neural network training model.
Wherein, further include:
Display module, for visualizing the family of files.
A kind of text cluster equipment provided in an embodiment of the present invention is introduced below, a kind of text described below is poly- Class equipment can be cross-referenced with a kind of above-described Text Clustering Method and device.
Referring to fig. 4, a kind of text cluster equipment provided in an embodiment of the present invention, including:
Memory 401, for storing computer program;
Processor 402 realizes text cluster side described in above-mentioned any embodiment when for executing the computer program The step of method.
A kind of readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, one kind described below is readable to deposit Storage media can be cross-referenced with a kind of above-described Text Clustering Method, device and equipment.
A kind of readable storage medium storing program for executing is stored with computer program, the computer program quilt on the readable storage medium storing program for executing The step of Text Clustering Method as described in above-mentioned any embodiment is realized when processor executes.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of Text Clustering Method, which is characterized in that applied to the server in distributed type assemblies, including:
Obtain target source text to be clustered;
Text feature in the target source text is extracted using most probable number method, obtains target data;
Preset neural network training model is read from own cache;
According to the neural network training model and neural network algorithm, clustering is carried out to the target data, and generate Family of files corresponding with the target source text.
2. Text Clustering Method according to claim 1, which is characterized in that described to extract the mesh using most probable number method The text feature in source text is marked, target data is obtained, including:
The target source file is pre-processed, and extracts text participle, the text from pretreated target source text This participle includes:Number, date, name and part of speech;
The text feature is extracted from text participle, and the text for maximum probability occur is determined by the most probable number method Eigen, the text feature include:Word weight, word frequency and inverse document frequency.
3. Text Clustering Method according to claim 1, which is characterized in that the generation packet of the neural network training model It includes:
Target training text is obtained, and the target training text is normalized;
Based on random number and preset connection value and threshold value, sparse patrol is carried out to the target training text after normalized It collects and returns, obtain target training set;
Based on the random number, the connection value and the threshold value iterate to calculate the target training set, generate the nerve Network training model.
4. Text Clustering Method according to claim 1, which is characterized in that the generation is corresponding with the target source text Family of files, including:
The family of files is generated by the cosine value of space angle between vector space model and vector.
5. Text Clustering Method according to any one of claims 1-4, which is characterized in that the generation and the target After the corresponding family of files of source text, further include:
The family of files is visualized.
6. a kind of text cluster device, which is characterized in that applied to the server in distributed type assemblies, including:
Module is obtained, for obtaining target source text to be clustered;
Extraction module obtains target data for extracting the text feature in the target source text using most probable number method;
Read module, for reading preset neural network training model from own cache;
Cluster module, for gathering to the target data according to the neural network training model and neural network algorithm Alanysis, and generate family of files corresponding with the target source text.
7. text cluster device according to claim 6, which is characterized in that the extraction module includes:
Pretreatment unit for pre-processing to the target source file, and is extracted from pretreated target source text Text participle, the text participle include:Number, date, name and part of speech;
Extraction unit for extracting the text feature from text participle, and is determined by the most probable number method The text feature of existing maximum probability, the text feature include:Word weight, word frequency and inverse document frequency.
8. text cluster device according to claim 6, which is characterized in that the cluster module is specifically used for:
The family of files is generated by the cosine value of space angle between vector space model and vector.
9. a kind of text cluster equipment, which is characterized in that including:
Memory, for storing computer program;
Processor realizes the text cluster side as described in claim 1-5 any one when for executing the computer program The step of method.
10. a kind of readable storage medium storing program for executing, which is characterized in that be stored with computer program, the meter on the readable storage medium storing program for executing The step of Text Clustering Method as described in claim 1-5 any one is realized when calculation machine program is executed by processor.
CN201810763151.0A 2018-07-12 2018-07-12 A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing Pending CN108846142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810763151.0A CN108846142A (en) 2018-07-12 2018-07-12 A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810763151.0A CN108846142A (en) 2018-07-12 2018-07-12 A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
CN108846142A true CN108846142A (en) 2018-11-20

Family

ID=64196999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810763151.0A Pending CN108846142A (en) 2018-07-12 2018-07-12 A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing

Country Status (1)

Country Link
CN (1) CN108846142A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324737A (en) * 2020-03-23 2020-06-23 中国电子科技集团公司第三十研究所 Bag-of-words model-based distributed text clustering method, storage medium and computing device
CN111522657A (en) * 2020-04-14 2020-08-11 北京航空航天大学 Distributed equipment collaborative deep learning reasoning method
CN111857097A (en) * 2020-07-27 2020-10-30 中国南方电网有限责任公司超高压输电公司昆明局 Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
CN104504150A (en) * 2015-01-09 2015-04-08 成都布林特信息技术有限公司 News public opinion monitoring system
CN105512723A (en) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 Artificial neural network calculating device and method for sparse connection
CN105550222A (en) * 2015-12-07 2016-05-04 中国电子科技网络信息安全有限公司 Distributed storage-based image service system and method
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization
KR101877243B1 (en) * 2017-04-25 2018-07-11 한국과학기술원 Ap apparatus clustering method using neural network based on reinforcement learning and cooperative communicatin apparatus using neural network based on reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
CN104504150A (en) * 2015-01-09 2015-04-08 成都布林特信息技术有限公司 News public opinion monitoring system
CN105550222A (en) * 2015-12-07 2016-05-04 中国电子科技网络信息安全有限公司 Distributed storage-based image service system and method
CN105512723A (en) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 Artificial neural network calculating device and method for sparse connection
KR101877243B1 (en) * 2017-04-25 2018-07-11 한국과학기술원 Ap apparatus clustering method using neural network based on reinforcement learning and cooperative communicatin apparatus using neural network based on reinforcement learning
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘珊珊: "基于云计算平台Hadoop的聚类神经网络算法的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
宋杰: "《大数据处理平台能耗优化方法的研究》", 30 November 2016 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324737A (en) * 2020-03-23 2020-06-23 中国电子科技集团公司第三十研究所 Bag-of-words model-based distributed text clustering method, storage medium and computing device
CN111522657A (en) * 2020-04-14 2020-08-11 北京航空航天大学 Distributed equipment collaborative deep learning reasoning method
CN111522657B (en) * 2020-04-14 2022-07-22 北京航空航天大学 Distributed equipment collaborative deep learning reasoning method
CN111857097A (en) * 2020-07-27 2020-10-30 中国南方电网有限责任公司超高压输电公司昆明局 Industrial control system abnormity diagnosis information identification method based on word frequency and inverse document frequency
CN111857097B (en) * 2020-07-27 2023-10-31 中国南方电网有限责任公司超高压输电公司昆明局 Industrial control system abnormality diagnosis information identification method based on word frequency and inverse document frequency

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
CN106951422B (en) Webpage training method and device, and search intention identification method and device
CN111191466B (en) Homonymous author disambiguation method based on network characterization and semantic characterization
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN106462604A (en) Identifying query intent
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN108846142A (en) A kind of Text Clustering Method, device, equipment and readable storage medium storing program for executing
Mohotti et al. Corpus-based augmented media posts with density-based clustering for community detection
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning
US11755671B2 (en) Projecting queries into a content item embedding space
CN113128234B (en) Method and system for establishing entity recognition model, electronic equipment and medium
CN114462673A (en) Methods, systems, computing devices, and readable media for predicting future events
CN113962221A (en) Text abstract extraction method and device, terminal equipment and storage medium
CN111625579B (en) Information processing method, device and system
CN113806641A (en) Deep learning-based recommendation method and device, electronic equipment and storage medium
US20170076219A1 (en) Prediction of future prominence attributes in data set
Jo Automatic text summarization using string vector based K nearest neighbor
US20240168999A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN111476037B (en) Text processing method and device, computer equipment and storage medium
US11893012B1 (en) Content extraction using related entity group metadata from reference objects
Wijaya et al. Twitter Opinion Mining Analysis of Web-Based Handphone Brand Using Naïve Bayes Classification Method
CN102929889B (en) A kind of method and system for improving community network
CN117725555A (en) Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181120

RJ01 Rejection of invention patent application after publication