CN107609102A - A kind of short text on-line talking method - Google Patents
A kind of short text on-line talking method Download PDFInfo
- Publication number
- CN107609102A CN107609102A CN201710816052.XA CN201710816052A CN107609102A CN 107609102 A CN107609102 A CN 107609102A CN 201710816052 A CN201710816052 A CN 201710816052A CN 107609102 A CN107609102 A CN 107609102A
- Authority
- CN
- China
- Prior art keywords
- class
- similarity
- short text
- text
- mrow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of short text on-line talking method, for the existing on-line talking method degree of accuracy it is not high the problem of, the application passes through improved short text increment clustering method, clustering processing is carried out to short text, similarity threshold changes with the social short text Number dynamics included in class, adds the flexibility of cluster;And short text semantic similarity is combined, increment cluster is further handled;And reunion class, class merging and class trimming are introduced, solve the problems, such as the problem of intrinsic class off-centring of on-line talking and short text polymerism difference.
Description
Technical field
The invention belongs to Data Mining, more particularly to a kind of short text clustering technology.
Background technology
With the arrival in web2.0 epoch, microblog service has been to be concerned by more and more people, and has attracted a large number of users.Push away
Spy, Sina weibo etc. are wherein most successful cases, according to statistics, push away upper 400,000,000 short-text messages of generation daily of spy.By right
These contents are analyzed, and can obtain many priceless information, and these information contribute to the mechanisms such as auxiliary company, government department
Make important decision.And short text on-line talking is a kind of important of the extract real-time valuable information from these social short texts
Means.
The algorithm of text cluster has a lot, mainly including non-increment text cluster and increment text cluster.Non- increment cluster
Need disposably to cluster total data, cluster data and overall cluster result can not be obtained in batches.Increment cluster then allows more
Batch or even many times data input, it is desirable to which increment clustering method can continue to cluster new data based on former cluster result.It is non-
Increment text cluster can be divided into hierarchical clustering, gather based on partition clustering, based on Density Clustering, based on Grid Clustering, based on model
Class, in addition with self organizing neural network cluster, based on ant colony clustering the methods of.These methods can be used for text cluster, but
Level and based on the cluster of division in text cluster task it is more conventional.Typical increment Text Clustering Method can be divided into two
Kind, a kind of is the clustering algorithm that the non-incremental clustering algorithm of tradition is transformed into increment by some extra calculating, be for second as
The most-often used Single-Pass greedy algorithms each data only handled once in modern increment text cluster.
Hierarchical clustering method be divided into from top and under divide and rule and from the two methods of upper polymerization in bottom.Method of dividing and ruling is initially by all numbers
Strong point is classified as one kind, splits data category according to a certain distance standard, down splits from level to level until meeting to terminate bar
Part or data point all separate.The method of polymerization is then initially to regard each data point as one kind respectively, according to certain similar
The multiple classes of scale brigadier merge, and merge upwards from level to level until data point is all classified as one kind.Hierarchical clustering method the advantages of be
There is clearly data organizational structure, can be by threshold value or the granularity of cluster entropy adjustment cluster, shortcoming is the calculating for having comparison high
Complexity, calculating Time & Space Complexity is.Representative algorithm has BIRCH algorithms, CURE algorithms, CHAMELEON algorithms etc..
Most typical algorithm is k-means in clustering method based on division, and step is k class center of initial random selection
Point, nearest class center is arrived a little by the distribution of nearest principle according to data point to class central point distance, recalculates data in class
The central point of point repeats cluster and class center calculation until result is stable or reaches certain number as new class center.
K-means substantially combines greedy algorithm and EM algorithms.The shortcomings that k-means, is also apparent from:Classification number needs specified but logical
It is difficult often to determine, the selection at initial classes center influences cluster result, all very quick to Outlier Data point based on division and hierarchical clustering
Sense.
Self-organizing Maps (SOM) neutral net is the unsupervised neural network algorithm proposed by Kohonen et al..The nerve
Network is made up of two layers of neuron of input layer and competition layer, and each neuron of competition layer initially assigns less random weight vectors.
Then selection document vector input is concentrated from training sample, document vector is the vector of a fixed dimension, calculates input vector
With the similarity of neuron, the maximum neuron of similarity is won, by the weight of adjustment triumph neuron and its neighbouring neuron,
Make triumph neuron and input data more like, mould different in output layer neuron identification data is made by a large amount of text training
Formula.This method has the characteristics of and clustering result quality insensitive to noise spot is high, but higher than the time complexity of k average.Output layer
Neuron be to have certain topological structure, how to design the topological structure of output layer and how to determine the nerve of output layer
The problem of first number is all to be solved.The topological structure of output layer should be close with the due structure of data class, and determines text
It is that a kind of what kind of topological relation is a problem between class.The neuron number of output layer determines classification number, and k-means
The problem of it is similar, excessive or very few classification number has a strong impact on the result of cluster, and can not know in many cases general poly-
Class number.
The content of the invention
In order to solve the above technical problems, present applicant proposes a kind of short text on-line talking method, pass through improved short essay
This greediness cluster carries out clustering processing to short text, and introduces reunion class, merges class and trim class, further improves online poly-
The accuracy of class.
The technical scheme that the application uses for:A kind of short text on-line talking method, including:
S1, the social short text to acquisition pre-process, and extract text feature;
The text feature includes:Part of speech corresponding to word, word in short text and name entity indicia.
S2, according to text feature, short text similarity is calculated using vector space model;
If the maximum similarity calculated in S3, step S2 is more than the upper limit of the first threshold scope of setting;Then by this article
Originally it is added to known class corresponding to maximum similarity;If the maximum similarity calculated in step S2 is less than the first threshold of setting
The lower limit of scope;Then create a new class;Otherwise step S4 is performed;
S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if
The similarity is more than the maximum of the first threshold of setting;The text is then added to known class corresponding to maximum similarity;If
The similarity being calculated is less than the minimum value of the first threshold of setting;Then create a new class;
S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for phase
Like the short text for spending the lower limit less than first threshold scope, then return to step S2 is reclassified;Will be current if most like class is found
The short text in class is deleted, and is added in most like class, and new class is created if without most like class;
S6, operation is merged to the class after being handled through step S5;
S7, cut operation is carried out to the class handled through step S6.
Further, step S2 is specially:Short text vector is built according to text feature;Calculate short text vector with it is each
Know class center vector and cosine similarity.
Further, step S4 is specially:
Text similarity is calculated based on WordNet, expression is as follows:
Wherein, α is smoothing factor, and β is scale factor of two measurements of adjustable range and depth to similarity contribution, and h is
Two concept nodes it is minimum it is public include node depth, l is the distance of two concept nodes;
Or
Text similarity is calculated based on Word2vec, is specially:By the way that word is characterized into the vector into fixed dimension, then
Text vector is calculated according to word vector, the similitude between text is calculated according to cosine similarity.
Further, step S6 is specially:Calculate the comentropy of the class after being handled through step S5, and any two class
Similarity;If similarity is more than Second Threshold, if calculating the comentropy of the class after two classes merge, if the class after merging
Comentropy size variation value is no more than 0.1, then merges two classes;Otherwise nonjoinder.
Further, step S7 is specially:According to through step S6 processing after all kinds of centre times by whole classes from it is small to
Big sequence, deletes the class beyond class time horizon to failure;
When class total number exceeds the upper limit, the first minor sort is carried out to class from small to large by social short text quantity in class;
The second minor sort is carried out to class from small to large by class renewal time again, finally deletes forward class after the second minor sort successively, directly
It is less than the upper limit to class total number.
Beneficial effects of the present invention:A kind of short text on-line talking method of the present invention, passes through improved short text increment
Clustering method, clustering processing is carried out to short text, similarity threshold changes with the social short text Number dynamics included in class, increases
The flexibility of cluster is added;And short text semantic similarity is combined, increment cluster is further handled;And introduce reunion class,
Class merges and class trimming, solves the problems, such as the problem of intrinsic class off-centring of on-line talking and short text polymerism difference;This Shen
Method please improves the accuracy of on-line talking, contribute to faster, valuable information is more accurately obtained from social media,
Generation economic benefit that can be direct or indirect.
Brief description of the drawings
Fig. 1 is the solution of the present invention flow chart.
Embodiment
For ease of skilled artisan understands that the technology contents of the present invention, enter one to present invention below in conjunction with the accompanying drawings
Step explaination.
It is the protocol procedures figure of the application as shown in Figure 1, the technical scheme of the application is:A kind of short text on-line talking side
Method, including:
S1, the social short text to acquisition pre-process, and extract text feature;
The text feature includes:Part of speech corresponding to word, word in short text and name entity indicia.
S2, according to text feature, short text similarity is calculated using vector space model;
If the maximum similarity calculated in S3, step S2 is more than the maximum of the similarity threshold of setting;Then by this article
Originally it is added to known class corresponding to maximum similarity;If the maximum similarity calculated in step S2 is less than the similarity threshold of setting
The minimum value of value;Then create a new class;Otherwise step S4 is performed;
S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if
The similarity is more than the maximum of the similarity threshold of setting;The text is then added to known class corresponding to maximum similarity;
If the similarity being calculated is less than the minimum value of the similarity threshold of setting;Then create a new class;
S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for phase
Like the short text spent less than threshold value, then return to step S2 is reclassified;If if finding most like class by the short text in current class
Delete, and add in most like class, new class is created if without most like class;
S6, operation is merged to the class after being handled through step S5;
S7, cut operation is carried out to the class handled through step S6.
The step S1 is specially:Social short text is read from stream data one by one, by pretreatment by original short essay
Originally canonical form is converted into, then extracts short essay eigen, specific features are included corresponding to word and word including short text
Part of speech and name entity indicia.
Step S2 is specially:The short essay eigen extracted by step S1, structure short text vector.Short text vector is to add
Power word frequency vector, especially by the part of speech of word, name entity type be weighted, weighting embody word difference part of speech with
And whether it is to name the significance level of entity different;General weight is set:Noun, verb weight are fixed 1.0, and name is real
Name, place name, mechanism name weight are 1.2 in body, and other are such as adjective, adverbial word 0.5.Short text vector is short text similarity
The foundation of calculating, it is also used for the structure weighting short text semantic vector in Semantic Similarity Measurement.
The construction method of class center vector is the cumulative social short text vector gathered in class, therefore class center vector also base
In word frequency vector.When a social short text removes from class, it is necessary to subtracted from class center vector the social short text to
Amount.Class center vector is used to building and weighting class center semantic vector, is the foundation of Similarity Measure.Class center vector is that have length
Degree limitation, because the length of general social short text is restricted.The length of class center vector is typically limited in 25 lists
Word, the content of an event are that can describe that clearly, class center vector length and social activity can also be allowed with the word within 25
Short text vector length is more or less the same, and preventing in class that edge word increases causes the dissimilarity of social short text and class gradually to increase
Add.
Similarity threshold uses a value with each class size variation.When social short text negligible amounts in class, one
As for class theme also unobvious, class center vector is relatively equally distributed on big measure word, the similar social short essay of theme
This and class center similarity are relatively low, and reducing similarity threshold contributes to the similar social short text clustering of theme;And work as Lei Zhong societies
Short text increase is handed over, theme starts clearly, and class center vector starts to be inclined to distribution subject word, and increase similarity threshold is advantageous to arrange
Except the dissimilar social short text of theme.The circular of similarity threshold such as formula (1) piecewise function, threshold value have a upper limit
With a lower limit, the upper limit takes 0.6 in the application, and lower limit takes 0.4, will be similar using the class of 20 social short texts as max-thresholds
Threshold value is uniformly stretched between 0.4 to 0.6.
Similarity Measure is mainly shown as short text and the Similarity Measure of class in cluster.Society is used in VSM methods
The cosine similarity algorithm of short text vector sum class center vector is handed over, and mentioned before containing synthesis in semantic method
WordNet and Word2vec semantic similarity calculation methods.
The cosine similarity with each known class is calculated using VSM methods first.
Step S3 is specially:If whether the maximum similarity for judging to calculate in step S2 is in the similarity threshold model provided
Enclose in (0.46-0.61), if the upper limit of the similarity threshold more than setting;Then the text is added to corresponding to maximum similarity
Known class;If the lower limit of the similarity threshold less than setting;Then create a new class;Otherwise step S4 is performed.
Step S4 is specially:Based on WordNet calculate text similarity method be a kind of semantic method, i.e., with
WordNet calculates the similitude of word as instrument, so as to obtain text similarity.WordNet is opened by Princeton University
The English glossary database based on cognitive science of hair.Substantial amounts of English glossary is organized into term network by it, in term network
Base unit be synset, synset is the lexical set of semantic equivalence, and the word in a synset has phase
Same concept.Synset connects composition term network according to a variety of relations, and noun, verb, adjective, adverbial word are organized into
Essentially independent term network.Currently used method includes following three class:
Li similarities are minimum public similar comprising the calculating of the distance between node and two concepts based on two concept nodes
Degree, computational methods such as formula (2).α is a smoothing factor in formula;β is two measurements of an adjustable range and depth to similar
Spend the scale factor of contribution;H be two concept nodes it is minimum it is public include node depth, l is the distance of two concept nodes.Li phases
Increase like degree with h and increase, increase with l and reduce, depth is deeper, and the explanation public concept comprising node of the minimum is more specific, phase
It is equidistant should be closer to so the measurement be based on shortest-path method than the actual range in abstract aspect in specific aspect
Improve.Li similarity based methods combine depth and distance and flexibility and changeability, are the better methods of Words similarity measurement.
Wu&Palmer similarities concept node depth in a network and two concept nodes according to corresponding to two words
It is minimum it is public calculated comprising the depth that node is last common ancestor node, therefore this method also with node up and down
Position relation, computational methods such as formula (3).
Method based on Word2vec, it is by the way that word is characterized into the vector into fixed dimension, then according to word vector
Text vector is calculated, the similitude between text is finally calculated using cosine similarity.Word2vec is that Google exists
A efficient tool that word is characterized as to real number value vector increased income for 2013, it utilizes the thought of deep learning, can passed through
Training, the vector operation processing to content of text being reduced in K gts, and the similarity in vector space can be with
For representing the similarity on text semantic.The term vector of Word2vec outputs can be used to do many natural language processing phases
The work of pass, for example cluster, look for synonym, part of speech analysis etc..
Step S5 is specially:The application solves the problems, such as class center excursD using the method for reunion class;The drift at class center
Shifting result in that greedy delta cluster result is relevant with text input order, and the phenomenon is primarily present in the incremental process of cluster,
As increasing social short text is got together, class center can gradually change and gather the short text in class in the past
It is no longer similar to class center, thus increment cluster to need after certain amount or separated in time in each class into
Member is adjusted.
Primary operational process is:The similarity of short text and current class center vector in class is calculated each class, for low
Return to step 2 is needed to find more like class again in the short text of the lower limit 0.46 of first threshold scope;If find most phase
Then the target short text in current class is deleted like class, adds in most like class, new class is created if without most like class.
Step S6 is specially:The application is merged using class to solve the problems, such as to introduce outlier caused by weight clustering method, real
Border situation is that some outliers can merge again with existing class, and no matter more outliers pass through how many times reunion class and conjunction
And process can not all be merged into suitable class, it is therefore desirable to these outliers are carried out in time processing be advantageous to improve cluster
When processing speed.
Social short text may cause to describe to only describe master during a theme because text size limitation word feature is few
A part for content is inscribed, the short text of identical theme may be because a little difference comprising word without gathering one during cluster
Rise, but with class social short text quantity gradually increase, the center of each class is gradually offset to the direction of entire subject matter information,
The class of same theme may be allowed more and more similar, sorting procedure is weighed in addition and adjusted by member dissimilar in class, changed
The word tendency of some classes is become, and has generated some outliers, therefore similar class is merged after reunion class be
It is necessary.
The application weighs the similarity of two classes using the comentropy of class, principle be when two classes are similar,
Comentropy will not increase too much when one class is merged into another class.Assuming that B, C are two classes to be combined, comentropy change threshold
Be worth for 0.1, if calculate after B, C merge, the comentropy changing value for merging class is less than threshold value, then illustrates that two classes of B, C are similar,
It can merge;Otherwise it is dissimilar, it is impossible to merge.
The two indices for merging Similarity Class are class similarity and Entropy.Entropy is obtained by class center vector, meter
Calculation method is to normalize class center vector, i.e., per one-dimensional divided by vector sum, allows vector to add up to 1 per one-dimensional, such as formula (4).
The probability distribution occurred with word in normalized class center vector approximation class, the information of class is calculated with approximate probability distribution
Entropy, such as formula (5).Comentropy increase after class merges illustrates that the theme of class becomes indefinite, and entropy reduces the theme of explanation class
More determine, therefore class can be merged in the case where the comentropy increase of class is few.
But except that entropy can also reduce when if one or two of word is partial at the center of class after merging, but such case
The merging of general remark class is incorrect, because one or two of word is typically that a theme can not possibly be summarized to what is come.In order to eliminate
This error situation is, it is necessary to first compare the similarity of two classes, and just permission class merges only when class is similar enough.Therefore
The method that class merges is first to calculate the similarity of two classes, and when class is similar enough (be more than class similarity threshold 0.51, i.e., second
Threshold value) the comentropy change that class merges just is calculated, merge two classes when comentropy increase is little or reduces.Information entropy condition subtracts
It is more prominent rather than fuzzyyer by theme to have lacked mixed noise when class is merged by similarity, and similarity condition reduces by letter
The excessive polymerization of mistake when entropy merges is ceased, two conditions complement each other and complemented one another.
Step S7 is specially:The application is by trimming the noise in class removal cluster.Exist in social short text substantial amounts of
Noise, be present in in general class and outlier in.Quite a lot of social short text unrelated with event has been filtered in pretreatment,
But the noise social activity short text unrelated with event still be present, a portion can turn into outlier, certainly some and event phase
The social short text of pass can also turn into outlier because stating social short text related to other and differing greatly, and these outliers also may be used
With as noise processed.Outlier quantity in short text clustering is more, and the processing to outlier can be accelerated to cluster speed, increase
Add cluster accuracy.
The specific method for trimming class is that class is trimmed after reunion class with class merging treatment, and trimming is divided into two steps,
The first step sorts whole classes according to class centre time from small to large, and deletion exceeds class time horizon to failure.Second step, when class is total
When number exceeds the upper limit, class is sorted according to social short text quantity in class and renewal time, social short text quantity is pressed in class
It is the first ordering rule from small to large, class renewal time, by being the second ordering rule from small to large, it is straight to delete the forward class that sorts
It is less than the upper limit to class quantity.
Trim the following index of class Main Basiss:
(1) class time horizon to failure.When class centre time has illustrated the thing of such description beyond time horizon to failure
Significant period of time is have passed through, loses the value of event detection, can be deleted.The general basis of class time horizon to failure
The time requirement of event detection takes suitable value, and such as 1 day or 2 days proper.
(2) class recent renewal time.Each classification has an independent renewal time, and class recent renewal time is that last time should
Computer time when member changes in class, renewal time represent that the possibility that class updates again is bigger more rearward,
The class of renewal time earlier above should be first deleted when deleting class.
(3) the class total number upper limit.The class total number upper limit controls the total number of class in upper range.Class number too much may be used
It can cause to cluster that speed is excessively slow or the processing beyond computer and storage capacity, the class total number upper limit is according to the reality of hardware performance
Situation determines.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.For ability
For the technical staff in domain, the present invention can have various modifications and variations.Within the spirit and principles of the invention, made
Any modification, equivalent substitution and improvements etc., should be included within scope of the presently claimed invention.
Claims (5)
- A kind of 1. short text on-line talking method, it is characterised in that including:S1, the social short text to acquisition pre-process, and extract text feature;The text feature includes:Part of speech corresponding to word, word in short text and name entity indicia.S2, according to text feature, short text similarity is calculated using vector space model;If the maximum similarity calculated in S3, step S2 is more than the upper limit of the first threshold scope of setting;Then the text is added Enter to known class corresponding to maximum similarity;If the maximum similarity calculated in step S2 is less than the first threshold scope of setting Lower limit;Then create a new class;Otherwise step S4 is performed;S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if the phase Like maximum of the degree more than the first threshold of setting;The text is then added to known class corresponding to maximum similarity;If calculate Obtained similarity is less than the minimum value of the first threshold of setting;Then create a new class;S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for similarity Less than the short text of lower limit of first threshold scope, then return to step S2 is reclassified;If if finding most like class by current class The short text delete, and add in most like class, new class created if without most like class;S6, operation is merged to the class after being handled through step S5;S7, cut operation is carried out to the class handled through step S6.
- 2. a kind of short text on-line talking method according to claim 1, it is characterised in that step S2 is specially:According to Text feature structure short text vector;Calculate vectorial with each known class center vector and cosine similarity of short text.
- 3. a kind of short text on-line talking method according to claim 1, it is characterised in that step S4 is specially:Text similarity is calculated based on WordNet, expression is as follows:<mrow> <msub> <mi>Sim</mi> <mrow> <mi>L</mi> <mi>i</mi> </mrow> </msub> <mo>=</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>a</mi> <mi>l</mi> </mrow> </msup> <mo>&CenterDot;</mo> <mfrac> <mrow> <msup> <mi>e</mi> <mrow> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> <mo>-</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> </mrow> <mrow> <msup> <mi>e</mi> <mrow> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> <mo>+</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> </mrow> </mfrac> </mrow>Wherein, α is smoothing factor, and β is scale factor of two measurements of adjustable range and depth to similarity contribution, and h is two general Read node it is minimum it is public include node depth, l is the distance of two concept nodes;OrText similarity is calculated based on Word2vec, is specially:By the way that word is characterized into the vector into fixed dimension, then basis Word vector calculates text vector, and the similitude between text is calculated according to cosine similarity.
- 4. a kind of short text on-line talking method according to claim 1, it is characterised in that step S6 is specially:Calculate The comentropy of class after being handled through step S5, and the similarity of any two class;If similarity is more than Second Threshold, count If calculating the comentropy of the class after two classes merge, if the comentropy size variation value of the class after merging is no more than 0.1, by two Class merges;Otherwise nonjoinder.
- 5. a kind of short text on-line talking method according to claim 1, it is characterised in that step S7 is specially:According to All kinds of centre times after step S6 processing sort whole classes from small to large, delete the class beyond class time horizon to failure;When class total number exceeds the upper limit, the first minor sort is carried out to class from small to large by social short text quantity in class;Press again Class renewal time carries out the second minor sort to class from small to large, forward class after the second minor sort is finally deleted successively, until class Total number is less than the upper limit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710816052.XA CN107609102A (en) | 2017-09-12 | 2017-09-12 | A kind of short text on-line talking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710816052.XA CN107609102A (en) | 2017-09-12 | 2017-09-12 | A kind of short text on-line talking method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107609102A true CN107609102A (en) | 2018-01-19 |
Family
ID=61063065
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710816052.XA Pending CN107609102A (en) | 2017-09-12 | 2017-09-12 | A kind of short text on-line talking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107609102A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033084A (en) * | 2018-07-26 | 2018-12-18 | 国信优易数据有限公司 | A kind of semantic hierarchies tree constructing method and device |
CN109145114A (en) * | 2018-08-29 | 2019-01-04 | 电子科技大学 | Social networks event detecting method based on Kleinberg presence machine |
CN109189910A (en) * | 2018-09-18 | 2019-01-11 | 哈尔滨工程大学 | A kind of label auto recommending method towards mobile application problem report |
CN109299270A (en) * | 2018-10-30 | 2019-02-01 | 云南电网有限责任公司信息中心 | A kind of text data unsupervised clustering based on convolutional neural networks |
CN109492098A (en) * | 2018-10-24 | 2019-03-19 | 北京工业大学 | Target corpus base construction method based on Active Learning and semantic density |
CN109710762A (en) * | 2018-12-26 | 2019-05-03 | 南京云问网络技术有限公司 | A kind of short text clustering method merging various features weight |
CN110362685A (en) * | 2019-07-22 | 2019-10-22 | 腾讯科技(武汉)有限公司 | Clustering method and cluster equipment |
CN110442726A (en) * | 2019-08-15 | 2019-11-12 | 电子科技大学 | Social media short text on-line talking method based on physical constraints |
CN110765329A (en) * | 2019-10-28 | 2020-02-07 | 北京天融信网络安全技术有限公司 | Data clustering method and electronic equipment |
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN111124578A (en) * | 2019-12-23 | 2020-05-08 | 中国银行股份有限公司 | User interface icon generation method and device |
CN111414479A (en) * | 2020-03-16 | 2020-07-14 | 北京智齿博创科技有限公司 | Label extraction method based on short text clustering technology |
CN112100986A (en) * | 2020-11-10 | 2020-12-18 | 北京捷通华声科技股份有限公司 | Voice text clustering method and device |
CN112579780A (en) * | 2020-12-25 | 2021-03-30 | 青牛智胜(深圳)科技有限公司 | Single-pass based clustering method, system, device and storage medium |
CN113159802A (en) * | 2021-04-15 | 2021-07-23 | 武汉白虹软件科技有限公司 | Algorithm model and system for realizing fraud-related application collection and feature extraction clustering |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106383877A (en) * | 2016-09-12 | 2017-02-08 | 电子科技大学 | On-line short text clustering and topic detection method of social media |
CN106649853A (en) * | 2016-12-30 | 2017-05-10 | 儒安科技有限公司 | Short text clustering method based on deep learning |
-
2017
- 2017-09-12 CN CN201710816052.XA patent/CN107609102A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106383877A (en) * | 2016-09-12 | 2017-02-08 | 电子科技大学 | On-line short text clustering and topic detection method of social media |
CN106649853A (en) * | 2016-12-30 | 2017-05-10 | 儒安科技有限公司 | Short text clustering method based on deep learning |
Non-Patent Citations (5)
Title |
---|
YUHUA LI等: "An Approach for Measuring Semantic", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 * |
张猛 等: "一种基于自动阈值发现的文本聚类方法", 《计算机研究与发展》 * |
方星星等: "基于改进的single-pass网络舆情话题发现研究", 《计算机与数字工程》 * |
赵晓楠等: "基于Single-Pass的军事网络舆情监控系统设计", 《电子设计工程》 * |
邱云飞等: "微博突发话题检测方法研究", 《开发研究与设计技术》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033084A (en) * | 2018-07-26 | 2018-12-18 | 国信优易数据有限公司 | A kind of semantic hierarchies tree constructing method and device |
CN109145114A (en) * | 2018-08-29 | 2019-01-04 | 电子科技大学 | Social networks event detecting method based on Kleinberg presence machine |
CN109145114B (en) * | 2018-08-29 | 2021-08-03 | 电子科技大学 | Social network event detection method based on Kleinberg online state machine |
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN109189910A (en) * | 2018-09-18 | 2019-01-11 | 哈尔滨工程大学 | A kind of label auto recommending method towards mobile application problem report |
CN109492098A (en) * | 2018-10-24 | 2019-03-19 | 北京工业大学 | Target corpus base construction method based on Active Learning and semantic density |
CN109492098B (en) * | 2018-10-24 | 2022-05-06 | 北京工业大学 | Target language material library construction method based on active learning and semantic density |
CN109299270A (en) * | 2018-10-30 | 2019-02-01 | 云南电网有限责任公司信息中心 | A kind of text data unsupervised clustering based on convolutional neural networks |
CN109710762A (en) * | 2018-12-26 | 2019-05-03 | 南京云问网络技术有限公司 | A kind of short text clustering method merging various features weight |
CN109710762B (en) * | 2018-12-26 | 2023-08-01 | 南京云问网络技术有限公司 | Short text clustering method integrating multiple feature weights |
CN110362685A (en) * | 2019-07-22 | 2019-10-22 | 腾讯科技(武汉)有限公司 | Clustering method and cluster equipment |
CN110442726A (en) * | 2019-08-15 | 2019-11-12 | 电子科技大学 | Social media short text on-line talking method based on physical constraints |
CN110442726B (en) * | 2019-08-15 | 2022-03-04 | 电子科技大学 | Social media short text online clustering method based on entity constraint |
CN110765329A (en) * | 2019-10-28 | 2020-02-07 | 北京天融信网络安全技术有限公司 | Data clustering method and electronic equipment |
CN111124578A (en) * | 2019-12-23 | 2020-05-08 | 中国银行股份有限公司 | User interface icon generation method and device |
CN111124578B (en) * | 2019-12-23 | 2023-09-29 | 中国银行股份有限公司 | User interface icon generation method and device |
CN111414479A (en) * | 2020-03-16 | 2020-07-14 | 北京智齿博创科技有限公司 | Label extraction method based on short text clustering technology |
CN111414479B (en) * | 2020-03-16 | 2023-03-21 | 北京智齿博创科技有限公司 | Label extraction method based on short text clustering technology |
CN112100986A (en) * | 2020-11-10 | 2020-12-18 | 北京捷通华声科技股份有限公司 | Voice text clustering method and device |
WO2022100071A1 (en) * | 2020-11-10 | 2022-05-19 | 北京捷通华声科技股份有限公司 | Voice text clustering method and apparatus |
CN112579780A (en) * | 2020-12-25 | 2021-03-30 | 青牛智胜(深圳)科技有限公司 | Single-pass based clustering method, system, device and storage medium |
CN112579780B (en) * | 2020-12-25 | 2022-02-15 | 青牛智胜(深圳)科技有限公司 | Single-pass based clustering method, system, device and storage medium |
CN113159802A (en) * | 2021-04-15 | 2021-07-23 | 武汉白虹软件科技有限公司 | Algorithm model and system for realizing fraud-related application collection and feature extraction clustering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609102A (en) | A kind of short text on-line talking method | |
Babar et al. | Improving performance of text summarization | |
Schneider | Techniques for improving the performance of naive bayes for text classification | |
Weiss et al. | Text mining: predictive methods for analyzing unstructured information | |
CN110298032A (en) | Text classification corpus labeling training system | |
US20150199333A1 (en) | Automatic extraction of named entities from texts | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
Alhutaish et al. | Arabic text classification using k-nearest neighbour algorithm | |
Raś et al. | From data to classification rules and actions | |
Kumar et al. | Legal document summarization using latent dirichlet allocation | |
Hettinger et al. | Genre classification on German novels | |
Ghalehtaki et al. | A combinational method of fuzzy, particle swarm optimization and cellular learning automata for text summarization | |
Utiu et al. | Learning web content extraction with DOM features | |
Yang et al. | A survey on interpretable clustering | |
Debnath et al. | Extractive single document summarization using an archive-based micro genetic-2 | |
CN111859984A (en) | Intention mining method, device, equipment and storage medium | |
JPH06282587A (en) | Automatic classifying method and device for document and dictionary preparing method and device for classification | |
Rukmi et al. | Using k-means++ algorithm for researchers clustering | |
Thabtah et al. | Comparison of rule based classification techniques for the Arabic textual data | |
Ramakrishnan et al. | Hypergraph based clustering for document similarity using FP growth algorithm | |
Papagiannopoulou et al. | Keywords lie far from the mean of all words in local vector space | |
CN114443820A (en) | Text aggregation method and text recommendation method | |
Moulay Lakhdar et al. | Building an extractive Arabic text summarization using a hybrid approach | |
ElGhazaly | Automatic text classification using neural network and statistical approaches | |
Rajkumar et al. | An efficient feature extraction with subset selection model using machine learning techniques for Tamil documents classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180119 |
|
RJ01 | Rejection of invention patent application after publication |