CN109871447A - Clustering method, computer program product and the server system of Chinese comment unsupervised learning - Google Patents
Clustering method, computer program product and the server system of Chinese comment unsupervised learning Download PDFInfo
- Publication number
- CN109871447A CN109871447A CN201910163711.3A CN201910163711A CN109871447A CN 109871447 A CN109871447 A CN 109871447A CN 201910163711 A CN201910163711 A CN 201910163711A CN 109871447 A CN109871447 A CN 109871447A
- Authority
- CN
- China
- Prior art keywords
- candidate
- label
- comment
- tag
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides clustering method, computer program product and the server system of a kind of Chinese comment unsupervised learning, and wherein clustering method includes: acquisition comment data, and arrangement obtains corpus;Comment content information in corpus is pre-processed, and carries out participle and term vector training;Extract candidate label;Disappeared to candidate tag library and is handled again;Candidate label after offseting weight carries out emotion word filtering;Cluster operation based on DBSCAN is signed to the candidate mark after removal invalid tag, obtains the magnitude of all candidate labels, descending arrangement is carried out according to quantity to cluster result;Each cluster magnitude is finally counted, TopN is exported.The present invention proposes the cluster mode based on unsupervised learning, overcome the problems, such as that previous label clustering method is difficult to objective expression comment result, the present invention can carry out autonomous, unsupervised refinement and study according to the actual content of comment and label, provide the cluster result of the true comment result of more objective and reaction.
Description
Technical field
The present invention relates to data minings and processing technology field, in particular to a kind of Chinese comment unsupervised learning
Clustering method, computer program product and server system.
Background technique
Label is carried out often through technological means in the evaluation of commodity or service in electric business platform or forum at present
Extraction and displaying, so that potential user directly obtains the most direct evaluation of product or service.It is existing to generate these marks
There are mainly two types of in the mode of label, one of which is to extract, i.e., based on Statistics extract the highest vocabulary of the frequency of occurrences or
Phrase forms label, and according to the carry out sequence arrangement of the height of frequency, this mode can generate more make an uproar in mark
Sound, and be based only upon the extraction of Statistics, frequently results in very strange result (label), cannot really reflect comment or
The characteristics of product;Another kind is the generation based on preparatory customized label, then carries out searching in comment information again and add up, such as
Fruit occur it is primary then add up 1, the accumulation result of customized label can then be obtained by having inquired all comments, and top n is taken to be arranged
Column obtain final annotation results, this mode generally requires the labour compared, low efficiency when mark, and can only be directed to certainly
The label of definition adds up, for new comment or keyword often without effect.
In conjunction with above two mode, it is all based on the cluster of monitor mode, its main feature is that being difficult to react truth.
Summary of the invention
The purpose of the present invention be intended to the cluster with supervision mode the prior art there are aiming at the problem that, propose that a kind of Chinese is commented
By the clustering method, computer program product and server system of unsupervised learning, the label obtained by Unsupervised clustering,
Independently it can update and learn, and the truth of deeper reaction comment and comment object, so that cluster result
It is more objective.
To achieve the above object, the technical solution adopted in the present invention is as follows:
A kind of clustering method of Chinese comment unsupervised learning, comprising the following steps:
Step 1 obtains the comment data for being directed to a product or service, and arrangement obtains corpus, wraps in the corpus
Containing the comment content information stored in order;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains
For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates
The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
Disclosed another aspect according to the present invention also proposes a kind of computer program product, has the one of instruction including coding
A or multiple non-transitory machine-readable medias, described instruction are performed process when executed by one or more processors,
The process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, and the process includes executing aforementioned stream
Journey.
The disclosed third aspect according to the present invention also proposes a kind of server system, comprising:
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor,
Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition
Process, the process include:
Step 1, the comment data to acquisition, arrangement obtain corpus, comment in the corpus comprising what is stored in order
By content information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains
For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates
The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
In more preferred example, the process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate label filtered
Library, candidate tag library data structure include candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, from first candidate mark
Label start, and choose candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm
Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;Determine again
If tag set is greater than minimum neighbours' number of the setting of definition, the number of label in the tag set is counted as this label
Magnitude, otherwise terminate;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
In conjunction with preceding solution and its implementation, significant beneficial effect of the invention is:
1, propose the cluster mode based on unsupervised learning on the whole, overcome previous cluster with supervision it is simple statistics or
Pre-defined label can not carry out autonomous learning, cause the result of label display beyond expression of words true and objectively comment on result
Problem after participle and candidate label based on ad hoc rules are chosen, is used using the cluster mode of unsupervised learning of the invention
The cluster of unsupervised learning can carry out autonomous, unsupervised (without customized, nothing is preparatory according to the actual content of comment and label
It is specified) refinement and study, final cluster process and result more withdraw deposit objectively comment as a result, study front and back undopes people
For factor and interference intervention;
2, in the data basis of cluster, the tag extraction based on natural language is carried out, using Chinese dependency parsing
Its syntactic structure is explained by the dependence before ingredient in metalanguage unit, is to dominate other with sentence center word aroused in interest
The center compositions of ingredient are principle, and itself is by the domination of other any ingredients, and all subject ingredients are all with certain
Relationship is subordinated to dominator, therefore the extraction of Different Rule can be carried out based on this, such as used in embodiment " noun subject+
5 classes such as the adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " extract rule, high for Chinese comment
The comment content of hair effectively, objectively extracts candidate label;
3, the based process of data further include to the filtering of the emotion word of candidate label, based on the emotion word preferably combined into
The duties such as row matching, so that many invalid and meaningless labels are filtered out, the invalidation and effect for avoiding later data from clustering
Rate is low, and the defect that cluster result caused by avoiding thus is unable to objective reflection comment generates;
4, in the process later period for filtering out invalid tag, also the participle for splitting and combining is carried out at the equalization of term vector
Reason, carried out obtaining tag set based on the cluster of cosine similarity algorithm conducive to the later period, and was carried out finally based on the tag set
Magnitude determine, improve cluster efficiency.
Detailed description of the invention
Fig. 1 is the flow diagram of the clustering method of Chinese comment unsupervised learning of the invention.
Specific embodiment
In order to better understand the technical content of the present invention, special to lift specific embodiment and institute's accompanying drawings is cooperated to be described as follows.
Various aspects with reference to the accompanying drawings to describe the present invention in the disclosure, shown in the drawings of the embodiment of many explanations.
It is not intended to cover all aspects of the invention for embodiment of the disclosure.It should be appreciated that a variety of designs and reality presented hereinbefore
Those of apply example, and describe in more detail below design and embodiment can in many ways in any one come it is real
It applies.
In conjunction with Fig. 1, the clustering method of the Chinese comment unsupervised learning of disclosed embodiment according to the present invention is intended to pair
The product of the company of acquisition or the comment of service are clustered, and obtain being best able to withdraw deposit commenting on the TOPN comment mark of result
Label, with for reference, help user with most fast speed understand in the past to this product perhaps the evaluation of service or nationality with
The follow-up service of product or service is improved, reference is used as.
The statistical (keyword identification and cumulative) that used in the past and customized label (customized keyword) is not
The situation that actually occurs of comment can be covered, lacks scalability, the homogeneity of label substance is serious, and base in the solution of the present invention
Carried out in unsupervised mode, can be learnt and be adjusted in real time according to practical comment content, continuous renewal label and
Label clustering is as a result, provide the cluster result of the true comment result of more objective and reaction.
As shown in connection with fig. 1, the clustering method of unsupervised learning proposed by the present invention, generally comprises following procedure:
Step 1 obtains the comment data for being directed to a product or service, and arrangement obtains corpus, wraps in the corpus
Containing the comment content information stored in order;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains
For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates
The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
The above process, which is realized, as a result, depends on used unsupervised deep learning and natural language processing technique, passes through
The automation that clustering technique and label extraction model complete client's theme comment label is extracted, being capable of comprehensive, objective displaying user
Data mining to the specific subject comment potential profound level of content.
As shown in connection with fig. 1, the exemplary realization of the clustering method of the embodiment of the present invention is more specifically described below.
Step 1: obtaining the comment data for a product or service, arrangement obtains corpus, wraps in the corpus
Containing the comment content information stored in order.
During some concrete implementations, it can be obtained by electric business, customer service and other channels original for one
Product or the comment data of service are illustrated in this example by taking this season clothes " design of scattered small flowers and plants one-piece dress " sold as an example, but this
Field personnel should be appreciated that system is not limited thereto in implementation of the invention.
In comment data, we arrange Chinese comment data, obtain corpus, wherein win in a certain order and
User is arranged to the comment content of " the design of scattered small flowers and plants one-piece dress ", especially word content.Certainly it in other embodiment, can also wrap
Data containing voice remark.Word content can be converted thereof by voice-text conversion.
In some instances, for example, by comment on the time sequencing, all word contents are organized by rows
The corpus of storage, for subsequent processing.
Step 2: the comment content information in corpus being pre-processed, and carries out participle and term vector training, is obtained
For the correspondence term vector of word segmentation result.
In an embodiment of the present invention, the processing of following process mainly is carried out in step 2:
Step 2-1, the word content in corpus is pre-processed, pretreatment here refers in particular to removal and deactivates
Word, for example, take out in the word content " clothes enjoyed a lot " stop words therein " " " very ", and retain and " like clothing
The comment content of clothes ", to reduce subsequent participle, calculating and the index and calculation amount of clustering processing;
Step 2-2, it after removing stop words, to the word content in corpus, is segmented according to the sequence of storage mode
Processing;Such as participle forms the word segmentation result of " liking ", " clothes ";
Step 2-3, term vector training is carried out to participle, obtains the correspondence term vector for word segmentation result.
In the preprocessing process in step 2-2, alternatively, it is being directed to the clustered demand for commenting on content, I
Segmented using hanLP, participle, and be based on word2vec training term vector to word segmentation result, trained term vector is used for here
The subsequent clustering processing based on cosine similarity algorithm.
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library.
In implementation process of the invention, in the data basis of cluster, the tag extraction based on natural language is carried out, is used
Chinese dependency parsing explains its syntactic structure by the dependence before ingredient in metalanguage unit, with sentence center
Word aroused in interest is that the center compositions of domination other compositions are principle, and itself is not by the domination of other any ingredients, Suo Youshou
Governor is all subordinated to dominator with certain relationship, therefore the extraction of Different Rule, such as embodiment institute can be carried out based on this
5 classes such as " the noun subject+adverbial modifier, the noun subjects+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " that uses extract
Rule comment on high-incidence comment content for Chinese, effectively, objectively extract candidate label.
Certainly, in a further embodiment, for different cluster scenes and demand, other extracting rules can be selected
Or their combination.
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label.
In preferred example, to the candidate label in candidate tag library, disappear again based on simhash algorithm, in removal
Hold substantially the same label.
For example, being essentially the mark of the same substantive meaning of expression for comment content " liking clothes ", " liking clothes "
Label, therefore for the ease of subsequent unified clustering processing, one of label is removed, only retains one, so in subsequent progress
When cluster, expressing the substantially label of same meaning will be clustered under same label, provide the computational efficiency of cluster and objective
Property, it avoids confusion and nearly justice repeats to cluster.
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag.
Due in some comment contents, although we are handled according to above-mentioned steps refines outgoing label, and pre-processed,
Duplicate removal, but some comment contents are still had in reality, essence not substantially is carried out to comment object " design of scattered small flowers and plants one-piece dress "
Property comment label, such as the comment content of " I am uncomfortable today ", essence do not react commented on commodity, that is,
For the significant comment of comment object, wherein not including emotion word.Therefore, we wish before clustering processing will be this kind of
Comment filters out, so that the real object of cluster is the actual evaluation content for commodity.
In a preferred embodiment, we will be filtered using combined sentiment word lexicon.
Specifically, it includes following procedure that emotion word filtering is carried out in step 5:
Step 5-1, combined emotion dictionary is set;
Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through
Jieba segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary one by one
Equivalence matching, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word;
If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate
Label, and by all participles of this candidate label, term vector is obtained by the term vector library inquiry of step 1, calculates word
The average value of vector;If not including emotion word, directly filter;
Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, has been handled
Cheng Hou, generates the candidate tag library filtered, and candidate tag library data structure includes candidate tag characters string and candidate label
Character string vector.
In particular it is preferred that we are combined multiple sentiment word lexicons in step 5-1, so that sentiment word lexicon
Range more extend and comprehensively, avoid single emotional dictionary insufficient and mistake filter out it is some should be when carrying out clustering processing
Comment.
For example, added Tsinghua University's Li Jun Chinese on the basis of original emotion vocabulary and pass judgement on adopted dictionary and Dalian
Polytechnics's Chinese emotion vocabulary ontology library (no auxiliary emotional semantic classification), under equal conditions, multiple emotion word table packs are obtained
Obtain better label effect.
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates
The magnitude of label carries out descending arrangement according to quantity to cluster result.
In the embodiment of the present invention, the cluster operation based on DBSCAN specifically includes following procedure:
Step 6-1, candidate label, the candidate tag library that obtaining step 5-4 is obtained are loaded;
Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, from first candidate label
Start, chooses candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm
Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;
If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, count in the tag set
Magnitude of the number of label as this label, otherwise terminates;
Step 6-4, candidate label all in candidate tag library is successively handled according to the processing of above-mentioned steps 6-2,6-3,
Until all candidate label clusterings terminate;
Step 6-5, descending arrangement is carried out by quantity to cluster result according to the magnitude of obtained all labels and label.
In conjunction with implementation process of the invention, in some embodiments, we also propose a kind of computer program product, including
Coding has one or more non-transitory machine-readable medias of instruction, and described instruction makes when executed by one or more processors
The process of obtaining is performed, and the process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, the process packet
It includes and executes the preceding method process that is included, method especially shown in FIG. 1 and aforementioned described in method as shown in connection with fig. 1
Treatment process.
It is noted that Fig. 1 of the present invention and aforementioned processing process described in conjunction with Figure 1, that is, be based on unsupervised
The clustering method of habit, can in local server, local computer system or cloud server embodiment,
It is illustrated by taking the implementation of cloud server as an example below.
Disclosed server system according to the present invention, comprising:
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor,
Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition
Process, aforementioned process include:
Step 1, the comment data to acquisition, arrangement obtain corpus, comment in the corpus comprising what is stored in order
By content information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains
For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates
The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
In especially preferred embodiment, aforementioned process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate label filtered
Library, candidate tag library data structure include candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, from first candidate mark
Label start, and choose candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm
Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;Determine again
If tag set is greater than minimum neighbours' number of the setting of definition, the number of label in the tag set is counted as this label
Magnitude, otherwise terminate;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention
Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause
This, the scope of protection of the present invention is defined by those of the claims.
Claims (10)
1. a kind of clustering method of Chinese comment unsupervised learning, which comprises the following steps:
Step 1 obtains the comment data for being directed to a product or service, and it includes to press in the corpus that arrangement, which obtains corpus,
The comment content information of sequential storage;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, is directed to
The correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidate labels
Magnitude, to cluster result according to quantity carry out descending arrangement;
Step 7, each cluster magnitude of statistics, export TopN.
2. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2
Pretreatment include removal stop words.
3. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2,
It is segmented using hanLP, and word2vec training term vector is based on to word segmentation result.
4. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 3
The label lot rule used includes: the noun subject+adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+describe
Word, 5 class decimation rule of the adverbial modifier obtain candidate label.
5. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 4,
To the candidate label in candidate tag library, the weight that disappears is carried out based on simhash algorithm, removes identical label on content.
6. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 5
Carrying out emotion word filtering specifically includes:
Step 5-1, combined emotion dictionary is set;
Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through into jieba
Segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary to equivalent one by one
Match, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word;
If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate mark
Label, and by all participles of this candidate label, obtain term vector by the term vector library inquiry of step 1, calculate word to
The average value of amount;If not including emotion word, directly filter;
Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, after the completion of processing,
The candidate tag library filtered is generated, candidate tag library data structure includes candidate tag characters string and candidate tag characters string
Vector.
7. the clustering method of Chinese comment unsupervised learning according to claim 6, which is characterized in that in the step 6
Cluster operation the following steps are included:
Step 6-1, candidate label, the candidate tag library of obtaining step 5-4 are loaded;
Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, open from first candidate label
Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity,
Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values;
If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, label in the tag set is counted
Magnitude of the number as this label, otherwise terminate;
Step 6-4, candidate label all in candidate tag library is successively handled according to the processing of above-mentioned steps 6-2,6-3, until
All candidate label clusterings terminate;
Step 6-5, descending arrangement is carried out by quantity to cluster result according to the magnitude of obtained all labels and label.
8. a kind of computer program product has one or more non-transitory machine-readable medias of instruction, the finger including encoding
Order is performed process when executed by one or more processors, and the process is used to execute the Chinese comment number to acquisition
According to Unsupervised clustering processing, the process includes executing any one of preceding claims 1-7 the method to be included
Process.
9. a kind of server system characterized by comprising
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, described
It instructs when being executed by least one described processor to realize the Unsupervised clustering treatment process to the comment data of acquisition,
The process includes:
Step 1, the comment data to acquisition, arrangement obtain corpus, include in the comment stored in order in the corpus
Hold information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, is directed to
The correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidate labels
Magnitude, to cluster result according to quantity carry out descending arrangement;
Step 7, each cluster magnitude of statistics, export TopN.
10. server system according to claim 9, which is characterized in that the process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate tag library filtered,
Candidate tag library data structure includes candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, opened from first candidate label
Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity,
Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values;Determine again such as
Fruit tag set is greater than minimum neighbours' number of the setting of definition, then counts the number of label in the tag set as this label
Otherwise magnitude terminates;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910163711.3A CN109871447A (en) | 2019-03-05 | 2019-03-05 | Clustering method, computer program product and the server system of Chinese comment unsupervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910163711.3A CN109871447A (en) | 2019-03-05 | 2019-03-05 | Clustering method, computer program product and the server system of Chinese comment unsupervised learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109871447A true CN109871447A (en) | 2019-06-11 |
Family
ID=66919802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910163711.3A Pending CN109871447A (en) | 2019-03-05 | 2019-03-05 | Clustering method, computer program product and the server system of Chinese comment unsupervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109871447A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750646A (en) * | 2019-10-16 | 2020-02-04 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN110928981A (en) * | 2019-11-18 | 2020-03-27 | 佰聆数据股份有限公司 | Method, system and storage medium for establishing and perfecting iteration of text label system |
CN112148881A (en) * | 2020-10-22 | 2020-12-29 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN112184323A (en) * | 2020-10-13 | 2021-01-05 | 上海风秩科技有限公司 | Evaluation label generation method and device, storage medium and electronic equipment |
CN112579738A (en) * | 2020-12-23 | 2021-03-30 | 广州博冠信息科技有限公司 | Target object label processing method, device, equipment and storage medium |
CN112818660A (en) * | 2021-01-26 | 2021-05-18 | 山西三友和智慧信息技术股份有限公司 | Product description generation method based on user evaluation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462132A (en) * | 2013-09-23 | 2015-03-25 | 华为技术有限公司 | Comment information display method and device |
CN105550269A (en) * | 2015-12-10 | 2016-05-04 | 复旦大学 | Product comment analyzing method and system with learning supervising function |
WO2016159453A1 (en) * | 2015-03-27 | 2016-10-06 | 주식회사 비주얼다이브 | Method for providing social activity integration service |
CN107633007A (en) * | 2017-08-09 | 2018-01-26 | 五邑大学 | A kind of comment on commodity data label system and method based on stratification AP clusters |
CN108009228A (en) * | 2017-11-27 | 2018-05-08 | 咪咕互动娱乐有限公司 | A kind of method to set up of content tab, device and storage medium |
CN108363725A (en) * | 2018-01-08 | 2018-08-03 | 浙江大学 | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label |
CN109255027A (en) * | 2018-08-27 | 2019-01-22 | 上海宝尊电子商务有限公司 | A kind of method and apparatus of electric business comment sentiment analysis noise reduction |
-
2019
- 2019-03-05 CN CN201910163711.3A patent/CN109871447A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462132A (en) * | 2013-09-23 | 2015-03-25 | 华为技术有限公司 | Comment information display method and device |
WO2016159453A1 (en) * | 2015-03-27 | 2016-10-06 | 주식회사 비주얼다이브 | Method for providing social activity integration service |
CN105550269A (en) * | 2015-12-10 | 2016-05-04 | 复旦大学 | Product comment analyzing method and system with learning supervising function |
CN107633007A (en) * | 2017-08-09 | 2018-01-26 | 五邑大学 | A kind of comment on commodity data label system and method based on stratification AP clusters |
CN108009228A (en) * | 2017-11-27 | 2018-05-08 | 咪咕互动娱乐有限公司 | A kind of method to set up of content tab, device and storage medium |
CN108363725A (en) * | 2018-01-08 | 2018-08-03 | 浙江大学 | A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label |
CN109255027A (en) * | 2018-08-27 | 2019-01-22 | 上海宝尊电子商务有限公司 | A kind of method and apparatus of electric business comment sentiment analysis noise reduction |
Non-Patent Citations (3)
Title |
---|
SHIJING888: "CommentsMining", 《HTTPS://GITHUB.COM/SHIJING888/COMMENTSMINING》 * |
李丕绩;马军;张冬梅;韩晓辉: "用户评论中的标签抽取以及排序", 《中文信息学报》 * |
火贪三刀: "用户评论标签的抽取", 《HTTPS://BLOG.CSDN.NET/SHIJING_0214/ARTICLE/DETAILS/71036808/》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750646A (en) * | 2019-10-16 | 2020-02-04 | 乐山师范学院 | Attribute description extracting method for hotel comment text |
CN110928981A (en) * | 2019-11-18 | 2020-03-27 | 佰聆数据股份有限公司 | Method, system and storage medium for establishing and perfecting iteration of text label system |
CN112184323A (en) * | 2020-10-13 | 2021-01-05 | 上海风秩科技有限公司 | Evaluation label generation method and device, storage medium and electronic equipment |
CN112148881A (en) * | 2020-10-22 | 2020-12-29 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN112148881B (en) * | 2020-10-22 | 2023-09-22 | 北京百度网讯科技有限公司 | Method and device for outputting information |
CN112579738A (en) * | 2020-12-23 | 2021-03-30 | 广州博冠信息科技有限公司 | Target object label processing method, device, equipment and storage medium |
CN112818660A (en) * | 2021-01-26 | 2021-05-18 | 山西三友和智慧信息技术股份有限公司 | Product description generation method based on user evaluation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109871447A (en) | Clustering method, computer program product and the server system of Chinese comment unsupervised learning | |
Tan et al. | RoBERTa-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network | |
CN107992543B (en) | Question-answer interaction method and device, computer equipment and computer readable storage medium | |
Schreuder et al. | Prefix stripping re-revisited | |
CN111125360B (en) | Emotion analysis method and device in game field and model training method and device thereof | |
Wang et al. | Kga: A general machine unlearning framework based on knowledge gap alignment | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN107807960A (en) | Intelligent customer service method, electronic installation and computer-readable recording medium | |
CN103049490B (en) | Between knowledge network node, attribute generates system and the method for generation | |
CN109344403A (en) | A kind of document representation method of enhancing semantic feature insertion | |
CN111198946A (en) | Network news hotspot mining method and device | |
CN109635275A (en) | Literature content retrieval and recognition methods and device | |
Pellegrini et al. | Exploiting Food Embeddings for Ingredient Substitution. | |
CN106156005B (en) | Based on visual classic poetry characteristic analysis method | |
CN110399603A (en) | A kind of text-processing technical method and system based on sense-group division | |
CN105488206B (en) | A kind of Android application evolution recommended method based on crowdsourcing | |
CN115994535A (en) | Text processing method and device | |
Ghaddar et al. | Revisiting pre-trained language models and their evaluation for arabic natural language understanding | |
Tran et al. | CovRelex: A COVID-19 retrieval system with relation extraction | |
CN108595426A (en) | Term vector optimization method based on Chinese character pattern structural information | |
CN113204643B (en) | Entity alignment method, device, equipment and medium | |
Golubev et al. | Transfer learning for improving results on Russian sentiment datasets | |
CN109213988A (en) | Barrage subject distillation method, medium, equipment and system based on N-gram model | |
Anantharaman et al. | SSN_MLRG1@ LT-EDI-ACL2022: Multi-class classification using BERT models for detecting depression signs from social media text | |
CN107590163B (en) | The methods, devices and systems of text feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190611 |
|
RJ01 | Rejection of invention patent application after publication |