CN106844743B - Emotion classification method and device for Uygur language text - Google Patents

Emotion classification method and device for Uygur language text Download PDF

Info

Publication number
CN106844743B
CN106844743B CN201710080052.8A CN201710080052A CN106844743B CN 106844743 B CN106844743 B CN 106844743B CN 201710080052 A CN201710080052 A CN 201710080052A CN 106844743 B CN106844743 B CN 106844743B
Authority
CN
China
Prior art keywords
text
emotion
text set
uygur
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710080052.8A
Other languages
Chinese (zh)
Other versions
CN106844743A (en
Inventor
李响
陈建新
崔力民
马宗达
运凯
景康
赵忠浩
任晴晴
曹进平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201710080052.8A priority Critical patent/CN106844743B/en
Publication of CN106844743A publication Critical patent/CN106844743A/en
Application granted granted Critical
Publication of CN106844743B publication Critical patent/CN106844743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses an emotion classification method and device for Uygur language texts. Wherein, the method comprises the following steps: acquiring a plurality of Uygur language texts; splitting a plurality of Uygur language texts to obtain a first text set and a second text set, wherein the first text set comprises: a first quantity of Uygur language text, the second set of text comprising: a second number of Uygur language texts, the first number being less than the second number; generating an emotion classifier based on the first text set and the corresponding labeling information; and carrying out emotion classification on the second text set by using an emotion classifier to obtain an emotion classification result. The invention solves the technical problems that the emotion classification method of the Uygur language texts in the prior art needs to label emotion of a large amount of Uygur language texts in an artificial mode, and has long processing time and low processing efficiency.

Description

Emotion classification method and device for Uygur language text
Technical Field
The invention relates to the field of network public opinion analysis of minority national languages, in particular to an emotion classification method and device of Uygur language texts.
Background
The internet is rapidly developed, netizens around the world can acquire or release information in the network all the time, and texts with subjective colors are widely spread in the network. The text with the public subjective opinion has great influence on network public opinion and social public opinion, and if we can deeply research the text, the text has great significance.
The minority nationality of China presents a distribution state of large living quarters and small living quarters, a plurality of nationalities have own language, and in order to know the heart sound of the minority nationality, the research on the language of the minority nationality plays an important role in the national group and social disputes. However, since many minority nationalities have their own language and cannot well conduct a large-scale network public opinion research, the network public opinion research of each minority nationality in Xinjiang is still in the preliminary stage at present. Due to the wide-range spread of Uygur language in network public sentiment, the manual treatment is time-consuming and labor-consuming, and the sentiment analysis can be performed in the large area.
The supervised learning method is commonly used for linguistic data needing manual labeling in the emotion classification method, and if a certain amount of labeled linguistic data does not exist, the effect of the classifier is reduced. However, the existing minority languages have few marked linguistic data, which inevitably leads to the forever of supervised learning.
Aiming at the problems that a large amount of Uygur language texts need to be subjected to emotion labeling in a manual mode, the processing time is long and the processing efficiency is low in the method for classifying the emotions of the Uygur language texts in the prior art, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides an emotion classification method and device for Uygur language texts, which are used for at least solving the technical problems that in the prior art, the emotion classification method for the Uygur language texts needs to label emotion of a large number of Uygur language texts in an artificial mode, the processing time is long, and the processing efficiency is low.
According to an aspect of the embodiments of the present invention, there is provided a method for classifying an emotion of an Uygur language text, including: acquiring a plurality of Uygur language texts; splitting a plurality of Uygur language texts to obtain a first text set and a second text set, wherein the first text set comprises: a first quantity of Uygur language text, the second set of text comprising: a second number of Uygur language texts, the first number being less than the second number; generating an emotion classifier based on the first text set and the corresponding labeling information; and carrying out emotion classification on the second text set by using an emotion classifier to obtain an emotion classification result.
Further, splitting the plurality of Uygur language texts to obtain a first text set and a second text set comprises: screening a plurality of Uygur language texts based on a preset screening strategy to obtain a first text set; a second text set is obtained based on Uygur language texts other than the first text set.
Further, based on a preset screening strategy, screening a plurality of Uygur language texts to obtain a first text set comprises: the method comprises the steps of screening texts containing emotion words of preset types from multiple Uygur languages to obtain a first text set.
Further, before the emotion classifier is used for performing emotion classification on the second text set to obtain an emotion classification result, the method further includes: aggregating the second text set by using k-means clustering to obtain a plurality of first set clusters; performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters; and carrying out emotion classification on the plurality of second set clusters by using an emotion classifier to obtain emotion classification results.
Further, aggregating the second text set using k-means clustering to obtain a plurality of first set clusters includes: acquiring a preset number of initial set clusters from the second text set; and calculating the distance between each sample point in the second text set and the center point of each initial set cluster to obtain a plurality of first set clusters.
Further, calculating the distance between each sample point in the second text set and the center point of each initial set cluster to obtain a plurality of first set clusters comprises: step A1, calculating the distance between the current sample point and the center point of each initial set cluster; step A2, storing the current sample point into the corresponding initial set cluster according to the distance between the current sample point and the central point of each initial set cluster to obtain a plurality of new initial set clusters; step A3, calculating the center point of each new initial set cluster; and step A4, taking the next sample point of the current sample point as the current sample point, and executing the steps A1 to A3 in a circulating way until all the sample points in the second text set are classified to obtain a plurality of first set clusters.
Further, performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters includes: step B1, calculating the distance between any two first set clusters to obtain a plurality of first distances; step B2, acquiring two first set clusters corresponding to the minimum distance from the plurality of first distances; step B3, merging the two first set clusters to obtain a plurality of new first set clusters; and step B4, circularly executing the step B1 to the step B3 until the number of the plurality of new first set clusters is the same as the preset layer number, and obtaining a plurality of second set clusters.
Further, performing emotion classification on the second text set by using an emotion classifier, and obtaining an emotion classification result includes: performing emotion classification on the second text set by using an emotion classifier to obtain a first probability; performing emotion classification on the second text set to obtain a second probability; and calculating the product of the first probability and the second probability to obtain the maximum posterior probability of the second text set.
Further, obtaining a plurality of Uygur texts comprises: and crawling a plurality of Uygur language texts by a web crawler.
Further, after obtaining the plurality of pieces of Uygur language text, the method further comprises: preprocessing a plurality of Uygur language texts to obtain a plurality of processed Uygur language texts; and splitting the processed multiple Uygur texts to obtain a first text set and a second text set.
According to another aspect of the embodiments of the present invention, there is also provided an emotion classification apparatus for an Uygur language text, including: the acquisition module is used for acquiring a plurality of Uygur language texts; the system comprises a dividing module and a processing module, wherein the dividing module is used for splitting a plurality of Uygur language texts to obtain a first text set and a second text set, and the first text set comprises: a first quantity of Uygur language text, the second set of text comprising: a second number of Uygur language texts, the first number being less than the second number; the generating module is used for generating an emotion classifier based on the first text set and the corresponding labeling information; and the classification module is used for carrying out emotion classification on the second text set by using the emotion classifier to obtain an emotion classification result.
In the embodiment of the invention, a plurality of Uygur language texts are obtained, the Uygur language texts are split to obtain a first text set and a second text set, an emotion classifier is generated based on the first text set and corresponding labeling information, and the emotion classifier is used for carrying out emotion classification on the second text set to obtain emotion classification results. It is easy to notice that the linguistic data which are less but can embody the classification result can be manually screened out for labeling, so that a better classifier is generated, the disordered prediction can quickly find the own category through the emotion analysis of unsupervised learning, and the characteristic of accurately marking out each category is solved. Therefore, through the scheme provided by the embodiment of the invention, the emotion imbalance phenomenon can be inhibited to a certain extent through supervised learning, and semi-supervised learning is effectively laid, so that the effects of improving the accuracy of emotion analysis of Uygur language, effectively saving manpower and reducing the corpus scale are achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method for emotion classification of Uygur text according to an embodiment of the present invention;
FIG. 2(a) is a diagram of an alternative emotion classification result according to an embodiment of the present invention;
FIG. 2(b) is a schematic diagram of an alternative emotion classification result according to an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative method for emotion classification of Uygur text in accordance with embodiments of the present invention; and
FIG. 4 is a diagram of an emotion classification apparatus for Uygur texts according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for emotion classification in Uygur text, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of an emotion classification method for a uygur text according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, a plurality of Uygur language texts are obtained.
Specifically, the aforementioned Uygur language text may be a corpus that is not labeled manually.
In an alternative, the information of Uygur language issued by Uygur people can be obtained from the network, for example, Uygur language texts can be obtained from microblogs issued by Uygur people, and the Uygur language texts obtained from the network are not expected to be labeled manually.
Step S104, splitting a plurality of Uygur language texts to obtain a first text set and a second text set, wherein the first text set comprises: a first quantity of Uygur language text, the second set of text comprising: a second number of Uygur language texts, the first number being less than the second number.
In an optional scheme, because the existing labeled Uygur corpus is difficult to obtain, a small part of corpus can be screened from a large amount of obtained labeled corpuses as a first text set, namely, a training text set, the corpuses in the training text set are labeled manually, manual operation is reduced, and most of the remaining corpus is used as a second text set, namely, a test text set.
And S106, generating an emotion classifier based on the first text set and the corresponding labeling information.
Specifically, the annotation information may be an emotion type manually annotated.
In an optional scheme, the screened less part of the corpus can be manually labeled, that is, the texts in the first text set are manually labeled, the emotion types of each sample in the second text set are labeled, and then the labeled corpus is used for establishing an emotion classifier with supervised learning.
And S108, carrying out emotion classification on the second text set by using an emotion classifier to obtain an emotion classification result.
In an optional scheme, an emotion classifier can be used for performing unsupervised learning on the remaining large amount of unlabeled corpora, that is, the emotion classifier is used for performing unsupervised learning on the texts in the second text set, so that the obtained large amount of Uygur texts are subjected to emotion classification.
According to the embodiment of the invention, a plurality of Uygur language texts are obtained, the Uygur language texts are split to obtain a first text set and a second text set, an emotion classifier is generated based on the first text set and corresponding labeling information, and the emotion classifier is used for carrying out emotion classification on the second text set to obtain emotion classification results. It is easy to notice that the linguistic data which are less but can embody the classification result can be manually screened out for labeling, so that a better classifier is generated, the disordered prediction can quickly find the own category through the emotion analysis of unsupervised learning, and the characteristic of accurately marking out each category is solved. Therefore, through the scheme provided by the embodiment of the invention, the emotion imbalance phenomenon can be inhibited to a certain extent through supervised learning, and semi-supervised learning is effectively laid, so that the effects of improving the accuracy of emotion analysis of Uygur language, effectively saving manpower and reducing the corpus scale are achieved.
Optionally, in the foregoing embodiment of the present invention, in step S104, splitting the multiple pieces of uygur language texts to obtain the first text set and the second text set includes:
step S1042, based on a preset screening strategy, a plurality of Uygur language texts are screened to obtain a first text set.
Specifically, the preset screening strategy may include multiple active learning strategies, such as classification uncertainty, sample difference, and clustering representativeness, and may screen a large amount of obtained uygur corpora through the classification uncertainty to screen out corpora that are not easy to determine the emotion type; the obtained large amount of Uyghur corpuses can be screened through sample difference, and corpuses with larger emotion type difference are screened out, for example, corpuses with positive emotion types and corpuses with negative emotion types are screened out; the obtained large amount of Uyghur corpuses can be screened through clustering representativeness to screen out corpuses with large emotion type differences, for example, when the corpuses with positive emotion types are screened out, the corpuses containing 'like' texts can be screened out, and the corpuses with the positive emotion types are common and representative when the corpuses containing 'like' texts are screened out.
In an optional scheme, active learning added by supervised learning generally iterates for multiple times, training texts to be manually labeled can be screened out, a training text set, namely a first text set, is added, and a strategy of using is selected again for next screening, wherein each iteration may change the strategy due to data information of the current first text set, and when a certain first text set language quantity is met, the iteration is stopped.
In step S1044, a second text set is obtained according to the other Uygur language texts except the first text set in the Uygur language texts.
In an optional scheme, after determining the corpus in the first text set, the corpus that is not screened out from the obtained large amount of unlabeled corpuses may be used as a test text set, that is, the second text set.
Optionally, in the foregoing embodiment of the present invention, in step S1042, based on a preset filtering policy, the filtering the multiple uygur texts to obtain a first text set includes:
step S10422, selecting a text containing emotion words of a preset type from the multiple uygur languages to obtain a first text set.
Specifically, due to the adhesive property of Uygur, emotional words change their original meaning by adding various affixes. It is also acceptable if the degree of the affective words is only slightly changed, but if negative affixes are added, the meaning is completely changed, i.e. I do not like to
Figure BDA0001225451770000061
I do not want to smile as
Figure BDA0001225451770000062
From these two sentences, it can be seen that words with positive emotion will have completely opposite emotion due to the addition of negative affix. The two situations are that the original words are added with negative affixes, and the negative affixes can also be directly added with negative words to show negative effects, for example,
Figure BDA0001225451770000063
the use of two Uygur words indicates negation.
In an alternative scheme, a corpus containing negative affixes or negative words can be screened from a large number of acquired Uyghur language corpora, and the corpus is stored in a first text set, namely, the emotion words with negative suffixes can be proposed before the space vector is constructed, and the space vector is constructed independently.
Optionally, in the foregoing embodiment of the present invention, before performing emotion classification on the second text set by using an emotion classifier in step S108 to obtain an emotion classification result, the method further includes:
and step S110, aggregating the second text set by using k-means clustering to obtain a plurality of first set clusters.
And step S112, performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters.
And step S114, carrying out emotion classification on the plurality of second set clusters by using the emotion classifier to obtain emotion classification results.
In an alternative scheme, k-means clustering may be performed on data in the second text set to generate k first set clusters, then the k first set clusters are aggregated into k-1 by using an aggregation-type hierarchical clustering from bottom to top, the k-means clusters that have been already aggregated are aggregated, the closest (most similar) set clusters in the set clusters are aggregated to generate a new set cluster, a plurality of second set clusters are obtained, and finally, an emotion classifier established by supervised learning is used to perform emotion classification on the second set clusters to obtain a classification result.
Through the above steps S110 to S114, the second text set can be clustered by using the cohesive hierarchical k-means
Optionally, in the foregoing embodiment of the present invention, in step S110, aggregating the second text set by using k-means clustering to obtain a plurality of first set clusters includes:
step S1102, obtaining a preset number of initial set clusters from the second text set.
Specifically, the preset number may be a preset number k of target class clusters in a k-means clustering algorithm.
Step S1104, calculating a distance between each sample point in the second text set and a center point of each initial set cluster, to obtain a plurality of first set clusters.
In an optional scheme, all the corpora in the second text set can be respectively used as a cluster to perform k-means clustering, k initial points are firstly found in all the corpora, the points are used as initial cluster clusters of the k-means clustering, other corpora are close to the initial points, and k first cluster clusters can be obtained after all the corpora are classified.
Optionally, in the foregoing embodiment of the present invention, in step S1104, calculating a distance between each sample point in the second text set and a center point of each initial set cluster, and obtaining a plurality of first set clusters includes:
step a1, calculate the distance between the current sample point and the center point of each initial set cluster.
Step A2, storing the current sample point into the corresponding initial set cluster according to the distance between the current sample point and the center point of each initial set cluster, and obtaining a plurality of new initial set clusters.
Step A3, calculate the center point of each new initial set cluster.
And step A4, taking the next sample point of the current sample point as the current sample point, and executing the steps A1 to A3 in a circulating way until all the sample points in the second text set are classified to obtain a plurality of first set clusters.
In an alternative scheme, after K initial points are found in all the corpora, the points can be used as initial set clusters of K-means clusters, the distance between each sample class and the K initial points is calculated in sequence, a proper position is found, the set cluster where the points are located is placed, so that the initial points generate new set clusters, the central point of the generated set clusters is recalculated in sequence, the distance is calculated next time, if the samples are not classified, the operation is repeated, classification of all the corpora is completed, and final K set clusters are obtained.
Optionally, in the foregoing embodiment of the present invention, in step S112, performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters includes:
and step B1, calculating the distance between any two first set clusters to obtain a plurality of first distances.
Step B2, two first set clusters corresponding to the minimum distance are obtained from the plurality of first distances.
And step B3, merging the two first set clusters to obtain a plurality of new first set clusters.
And step B4, circularly executing the step B1 to the step B3 until the number of the plurality of new first set clusters is the same as the preset layer number, and obtaining a plurality of second set clusters.
Specifically, the preset number of layers may be a number of layers N of a preset hierarchical clustering method.
In an alternative scheme, hierarchical clustering may be performed on the k sets of clusters that have been divided, that is, taking the existing sets of clusters as individual clusters again, calculating the distance of each class respectively, finding the closest class, that is, the most similar class, and generating a new set of clusters using the two pairs of samples as a representative, that is, combining the two first set of clusters to obtain a plurality of new first set of clusters. Judging whether the number of the new first set clusters is the layer number N or not, and if the number of the first set clusters is equal to N, obtaining a final second set cluster; if the number of first cluster sets is not N, the process is repeated.
Optionally, in the foregoing embodiment of the present invention, in step S108, performing emotion classification on the second text set by using an emotion classifier, and obtaining an emotion classification result includes:
and step S1082, performing emotion classification on the second text set by using an emotion classifier to obtain a first probability.
And step S1084, performing emotion classification on the second text set to obtain a second probability.
And step S1086, calculating the product of the first probability and the second probability to obtain the maximum posterior probability of the second text set.
It should be noted that, as shown in fig. 2(a) and fig. 2(b), the + number indicates the corpus of the positive emotion type, the-number indicates the corpus of the negative emotion type, and the-number indicates the corpus that is not labeled and needs to be subjected to emotion classification, as shown in fig. 2(a), the-number corpus and the + number corpus can be classified into the same category by manual labeling, that is, the-number is classified into the corpus of the positive emotion type; as shown in fig. 2(b), the linguistic data of the mark and the linguistic data of the mark can be classified into the same category by the emotion classifier, i.e., the mark is classified into the linguistic data of the negative emotion type, so that the emotion classifier needs to be optimized to improve the classification accuracy.
In an alternative scheme, the semi-supervised processing may be performed through a record of supervised classification and unsupervised analysis, assuming that the labeled data and the unlabeled data are generally mixed together to be N mixed distributions, the emotion classifier may perform emotion classification on the second text set to obtain a first probability P (y | Nj, x), perform emotion classification on the second text set through a manual labeling mode, estimate a second probability P (Nj | x) through unlabeled data, and then multiply the first probability P (y | Nj, x) and the second probability P (Nj | x) to obtain a maximum posterior probability, where Nj represents a jth mixed component.
Optionally, in the above embodiment of the present invention, in step S102, the obtaining a plurality of uygur texts includes:
in step S1022, a plurality of uygur texts are obtained by crawling by the web crawler.
In an alternative scheme, a plurality of Uygur language texts can be obtained by crawling the Guduk microblog by using crawler software.
Optionally, in the above embodiment of the present invention, after the step S102, acquiring a plurality of pieces of uygur language text, the method further includes:
step S116, preprocessing the multiple Uygur texts to obtain processed multiple Uygur texts.
Step S118, splitting the processed multiple uygur texts to obtain a first text set and a second text set.
In an alternative scheme, Uygur language texts captured by crawler software can be removed from non-text symbols such as images and domain names, and only micro-blog language texts containing comments and published by users are reserved.
Fig. 3 is a flowchart of an optional emotion classification method for a uygur language text according to an embodiment of the present invention, and a preferred embodiment of the present invention is described in detail below with reference to fig. 3, as shown in fig. 3, a web crawler may first crawl comments in a microblog to crawl unlabeled corpora, screen a part of the corpora for manual labeling, and balance emotion of the corpora. Then, an emotion classifier with supervised learning, namely an SVM classifier in FIG. 3, is established by using a small part of the labeled corpus, and then the remaining unlabeled corpus is subjected to unsupervised learning. By combining the unsupervised learning method and the supervised learning method, the emotion classification of the Uygur language can be effectively carried out. Semi-supervised learning is between supervised and unsupervised learning, and can select parts which are more worthy of artificial labeling from a large amount of unlabelled linguistic data for sample labeling, for example, the most worthy trained linguistic data is selected by an active learning strategy for constructing a classifier. It is also possible to combine supervised and unsupervised learning, one to assist one in learning.
Through the scheme provided by the embodiment of the invention, the phenomenon of unbalanced emotion can be restrained to a certain extent through supervised learning; through the emotion analysis of unsupervised learning, the disordered linguistic data can quickly find the own category, and the characteristics of each category can be accurately marked off, so that the semi-supervised learning is effectively paved. Therefore, the emotion of the microblog can be accurately classified based on the emotion analysis of the Uygur language of the semi-supervised learning, the labor can be effectively saved, and the scale of the corpus can be reduced.
Example 2
According to an embodiment of the present invention, an embodiment of an emotion classification apparatus for Uygur language text is provided.
Fig. 4 is a schematic diagram of an emotion classification apparatus for a Uygur language text according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:
an obtaining module 41, configured to obtain multiple pieces of Uygur language texts.
Specifically, the aforementioned Uygur language text may be a corpus that is not labeled manually.
In an alternative, the information of Uygur language issued by Uygur people can be obtained from the network, for example, Uygur language texts can be obtained from microblogs issued by Uygur people, and the Uygur language texts obtained from the network are not expected to be labeled manually.
A dividing module 43, configured to split multiple pieces of uygur language texts to obtain a first text set and a second text set, where the first text set includes: a first quantity of Uygur language text, the second set of text comprising: a second number of Uygur language texts, the first number being less than the second number.
In an optional scheme, because the existing labeled Uygur corpus is difficult to obtain, a small part of corpus can be screened from a large amount of obtained labeled corpuses as a first text set, namely, a training text set, the corpuses in the training text set are labeled manually, manual operation is reduced, and most of the remaining corpus is used as a second text set, namely, a test text set.
And a generating module 45, configured to generate an emotion classifier based on the first text set and the corresponding labeling information.
Specifically, the annotation information may be an emotion type manually annotated.
In an optional scheme, the screened less part of the corpus can be manually labeled, that is, the texts in the first text set are manually labeled, the emotion types of each sample in the second text set are labeled, and then the labeled corpus is used for establishing an emotion classifier with supervised learning.
And the classification module 47 is configured to perform emotion classification on the second text set by using an emotion classifier to obtain an emotion classification result.
In an optional scheme, an emotion classifier can be used for performing unsupervised learning on the remaining large amount of unlabeled corpora, that is, the emotion classifier is used for performing unsupervised learning on the texts in the second text set, so that the obtained large amount of Uygur texts are subjected to emotion classification.
According to the embodiment of the invention, a plurality of Uygur language texts are obtained, the Uygur language texts are split to obtain a first text set and a second text set, an emotion classifier is generated based on the first text set and corresponding labeling information, and the emotion classifier is used for carrying out emotion classification on the second text set to obtain emotion classification results. It is easy to notice that the linguistic data which are less but can embody the classification result can be manually screened out for labeling, so that a better classifier is generated, the disordered prediction can quickly find the own category through the emotion analysis of unsupervised learning, and the characteristic of accurately marking out each category is solved. Therefore, through the scheme provided by the embodiment of the invention, the emotion imbalance phenomenon can be inhibited to a certain extent through supervised learning, and semi-supervised learning is effectively laid, so that the effects of improving the accuracy of emotion analysis of Uygur language, effectively saving manpower and reducing the corpus scale are achieved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. An emotion classification method for Uygur language texts is characterized by comprising the following steps:
acquiring a plurality of Uygur language texts;
splitting the Uygur language texts to obtain a first text set and a second text set, wherein the first text set comprises: a first quantity of Uygur language text, the second set of text comprising: a second amount of Uygur text, the first amount being less than the second amount;
generating an emotion classifier based on the first text set and the corresponding labeling information;
carrying out emotion classification on the second text set by using the emotion classifier to obtain an emotion classification result;
before the emotion classifier is used for performing emotion classification on the second text set to obtain an emotion classification result, the method further comprises the following steps: aggregating the second text set by using k-means clustering to obtain a plurality of first set clusters; performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters; and carrying out emotion classification on the plurality of second set clusters by using the emotion classifier to obtain the emotion classification result.
2. The method of claim 1, wherein splitting the plurality of Uygur language texts to obtain a first text set and a second text set comprises:
screening the Uygur language texts based on a preset screening strategy to obtain a first text set;
and obtaining the second text set according to other Uygur language texts except the first text set in the plurality of Uygur language texts.
3. The method of claim 2, wherein the step of filtering the plurality of Uygur texts based on a predetermined filtering strategy to obtain the first text set comprises:
and screening texts containing preset types of emotion words from the Uygur language texts to obtain the first text set.
4. The method of claim 1, wherein aggregating the second set of text using k-means clustering to obtain a plurality of first set clusters comprises:
acquiring a preset number of initial set clusters from the second text set;
and calculating the distance between each sample point in the second text set and the center point of each initial set cluster to obtain the plurality of first set clusters.
5. The method of claim 4, wherein calculating a distance between each sample point in the second text set and a center point of each initial set cluster to obtain the plurality of first set clusters comprises:
step A1, calculating the distance between the current sample point and the center point of each initial set cluster;
step A2, storing the current sample point into the corresponding initial set cluster according to the distance between the current sample point and the center point of each initial set cluster to obtain a plurality of new initial set clusters;
step A3, calculating the center point of each new initial set cluster;
and step A4, taking the next sample point of the current sample point as the current sample point, and executing the steps A1 to A3 in a circulating manner until all sample points in the second text set are classified, so as to obtain the plurality of first set clusters.
6. The method of claim 1, wherein hierarchically clustering the first plurality of clusters to obtain a second plurality of clusters comprises:
step B1, calculating the distance between any two first set clusters to obtain a plurality of first distances;
step B2, obtaining two first set clusters corresponding to the minimum distance from the plurality of first distances;
step B3, merging the two first set clusters to obtain a plurality of new first set clusters;
and B4, circularly executing the steps B1 to B3 until the number of the new first cluster sets is the same as the preset layer times, and obtaining a plurality of second cluster sets.
7. The method of claim 1, wherein performing emotion classification on the second text set by using the emotion classifier, and obtaining emotion classification results comprises:
performing emotion classification on the second text set by using the emotion classifier to obtain a first probability;
performing emotion classification on the second text set to obtain a second probability; the second probability is used for carrying out emotion classification on the second text set in a manual labeling mode and is estimated through unlabeled data;
and calculating the product of the first probability and the second probability to obtain the maximum posterior probability of the second text set.
8. The method of any one of claims 1 to 7, wherein obtaining a plurality of Uygur texts comprises:
and crawling the plurality of Uygur language texts by a web crawler.
9. The method of claim 8, wherein after obtaining the plurality of Uygur language texts, the method further comprises:
preprocessing the multiple Uygur language texts to obtain processed multiple Uygur language texts;
splitting the processed Uygur sentences to obtain the first text set and the second text set.
10. An emotion classification apparatus for Uygur language text, comprising:
the acquisition module is used for acquiring a plurality of Uygur language texts;
a dividing module, configured to split the multiple pieces of uygur language texts to obtain a first text set and a second text set, where the first text set includes: a first quantity of Uygur language text, the second set of text comprising: a second amount of Uygur text, the first amount being less than the second amount;
the generating module is used for generating an emotion classifier based on the first text set and the corresponding labeling information;
the classification module is used for carrying out emotion classification on the second text set by using the emotion classifier to obtain an emotion classification result;
the device further comprises: a module for aggregating the second text set by using k-means clustering to obtain a plurality of first set clusters before performing emotion classification on the second text set by using the emotion classifier to obtain an emotion classification result; a module for performing hierarchical clustering on the plurality of first set clusters to obtain a plurality of second set clusters; and the module is used for carrying out emotion classification on the plurality of second set clusters by utilizing the emotion classifier to obtain the emotion classification result.
CN201710080052.8A 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text Active CN106844743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710080052.8A CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710080052.8A CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Publications (2)

Publication Number Publication Date
CN106844743A CN106844743A (en) 2017-06-13
CN106844743B true CN106844743B (en) 2020-04-24

Family

ID=59128725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710080052.8A Active CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Country Status (1)

Country Link
CN (1) CN106844743B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241518B (en) * 2017-07-11 2021-01-22 北京交通大学 Network water army detection method based on emotion analysis
CN112101393A (en) * 2019-06-18 2020-12-18 上海电机学院 Wind power plant fan clustering method and device
CN110782876B (en) * 2019-10-21 2022-03-18 华中科技大学 Unsupervised active learning method for voice emotion calculation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 Emotion classifying method and device for text
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system
CN103530286A (en) * 2013-10-31 2014-01-22 苏州大学 Multi-class sentiment classification method
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 Emotion classifying method and device for text
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system
CN103530286A (en) * 2013-10-31 2014-01-22 苏州大学 Multi-class sentiment classification method
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于主动学习的SVM维吾尔语情感分析研究;李响 等;《新疆大学学报(自然科学版)》;20151115;第32卷(第4期);第447-452页 *
李响 等.基于主动学习的SVM维吾尔语情感分析研究.《新疆大学学报(自然科学版)》.2015,第32卷(第4期),第447-452页. *

Also Published As

Publication number Publication date
CN106844743A (en) 2017-06-13

Similar Documents

Publication Publication Date Title
Kontopoulos et al. Ontology-based sentiment analysis of twitter posts
Kumar et al. Analyzing Twitter sentiments through big data
Venugopalan et al. Exploring sentiment analysis on twitter data
EP2866421B1 (en) Method and apparatus for identifying a same user in multiple social networks
CN108363821A (en) A kind of information-pushing method, device, terminal device and storage medium
WO2019084810A1 (en) Information processing method and terminal, and computer storage medium
Kawade et al. Sentiment analysis: machine learning approach
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN105874753A (en) Systems and methods for behavioral segmentation of users in a social data network
Ghalmane et al. Extracting backbones in weighted modular complex networks
Wagner et al. Semantic stability in social tagging streams
CN107436916B (en) Intelligent answer prompting method and device
CN111177559B (en) Text travel service recommendation method and device, electronic equipment and storage medium
Bahamonde et al. Power structure in Chilean news media
Noel et al. Applicability of Latent Dirichlet Allocation to multi-disk search
CN106844743B (en) Emotion classification method and device for Uygur language text
Prata et al. Social data analysis of Brazilian's mood from Twitter
Tabak et al. Comparison of emotion lexicons
Zhu et al. Identification of opinion leaders in social networks based on sentiment analysis: Evidence from an automotive forum
Abuhay et al. Analysis of computational science papers from iccs 2001-2016 using topic modeling and graph theory
Sha et al. Matching user accounts across social networks based on users message
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN108153818B (en) Big data based clustering method
Kumar et al. Fake news detection of Indian and United States election data using machine learning algorithm
CN109241438B (en) Element-based cross-channel hot event discovery method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant