CN106844743A - The sensibility classification method and device of Uighur text - Google Patents

The sensibility classification method and device of Uighur text Download PDF

Info

Publication number
CN106844743A
CN106844743A CN201710080052.8A CN201710080052A CN106844743A CN 106844743 A CN106844743 A CN 106844743A CN 201710080052 A CN201710080052 A CN 201710080052A CN 106844743 A CN106844743 A CN 106844743A
Authority
CN
China
Prior art keywords
text
uighur
collection
text collection
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710080052.8A
Other languages
Chinese (zh)
Other versions
CN106844743B (en
Inventor
李响
陈建新
崔力民
马宗达
运凯
景康
赵忠浩
任晴晴
曹进平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Xinjiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201710080052.8A priority Critical patent/CN106844743B/en
Publication of CN106844743A publication Critical patent/CN106844743A/en
Application granted granted Critical
Publication of CN106844743B publication Critical patent/CN106844743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a kind of sensibility classification method and device of Uighur text.Wherein, the method includes:Obtain a plurality of Uighur text;A plurality of Uighur text is split, the first text collection and the second text collection is obtained, wherein, the first text collection includes:The Uighur text of the first quantity, the second text collection includes:The Uighur text of the second quantity, the first quantity is less than the second quantity;Based on the first text collection and corresponding markup information, emotion classifiers are generated;Emotional semantic classification is carried out to the second text collection using emotion classifiers, emotional semantic classification result is obtained.The sensibility classification method that the present invention solves Uighur text in the prior art needs to carry out Emotion tagging to a large amount of Uighur texts by manual type, the technical problem that process time is long and treatment effeciency is low.

Description

The sensibility classification method and device of Uighur text
Technical field
The present invention relates to minority language Internet public opinion analysis field, in particular to a kind of Uighur text Sensibility classification method and device.
Background technology
Internet development is rapid, and netizen's moment of all parts of the world can in a network obtain or release news, then with master The text for seeing color is propagated on a large scale in a network.This text with public's subjective opinion is to network public-opinion and social carriage By that will produce strong influence, if we can carry out the research of profound level to this, that will highly significant.
The ethnic group of China is presented a distribution lived together over vast areas while some live in individual concentrated communities in small areas, and multiple nationalitys are owned by oneself Language, in order to know more about the heartfelt wishes of ethnic group, research minority language will have weight for national unity and Social Dispute Big effect.But possess the language of oneself because of many ethnic groups, large-scale network public-opinion research can't be well carried out, So the network public-opinion of each ethnic group in Xinjiang region was studied also in the elementary step at present.Uighur is in network public-opinion Propagation on a large scale, light is processed by artificial mode wastes time and energy very much, then sentiment analysis just can be in this body aobvious greatly Hand.
The method of supervised learning is usually used in sensibility classification method needs the language material of artificial mark, if not a certain amount of Language material is marked, the effect of grader will be reduced.But the language material of mark of existing minority language is also very few, this is just Supervised learning will necessarily be made to have too many difficulties to cope with.
Sensibility classification method for Uighur text in the prior art is needed by manual type to a large amount of Uygur Chinese language originally carries out Emotion tagging, the problem that process time is long and treatment effeciency is low, and effective solution is not yet proposed at present.
The content of the invention
A kind of sensibility classification method and device of Uighur text are the embodiment of the invention provides, it is existing at least to solve The sensibility classification method of Uighur text needs to carry out emotion mark to a large amount of Uighur texts by manual type in technology Note, the technical problem that process time is long and treatment effeciency is low.
A kind of one side according to embodiments of the present invention, there is provided sensibility classification method of Uighur text, including: Obtain a plurality of Uighur text;A plurality of Uighur text is split, the first text collection and the second text set is obtained Close, wherein, the first text collection includes:The Uighur text of the first quantity, the second text collection includes:The dimension of the second quantity I your Chinese language sheet, the first quantity is less than the second quantity;Based on the first text collection and corresponding markup information, emotional semantic classification is generated Device;Emotional semantic classification is carried out to the second text collection using emotion classifiers, emotional semantic classification result is obtained.
Further, a plurality of Uighur text is split, obtains the first text collection and the second text collection bag Include:Based on default screening strategy, a plurality of Uighur text is screened, obtain the first text collection;Me is tieed up according to a plurality of Other Uighur texts in your Chinese language sheet in addition to the first text collection, obtain the second text collection.
Further, based on default screening strategy, a plurality of Uighur text is screened, obtains the first text collection Including:The text of emotion word of the screening comprising preset kind, obtains the first text collection from a plurality of Uighur text.
Further, emotional semantic classification is being carried out to the second text collection using emotion classifiers, is obtaining emotional semantic classification result Before, the above method also includes:Clustered using k-means and the second text collection is polymerized, obtain multiple first set clusters; Hierarchical clustering is carried out to multiple first set clusters, multiple second set clusters are obtained;Using emotion classifiers to multiple second sets Cluster carries out emotional semantic classification, obtains emotional semantic classification result.
Further, clustered using k-means and the second text collection is polymerized, obtain multiple first set cluster bags Include:From the second text collection, the initial sets cluster of predetermined number is obtained;Calculate in the second text collection each sample point with it is every The distance of the central point of individual initial sets cluster, obtains multiple first set clusters.
Further, the distance of the central point of each sample point and each initial sets cluster in the second text collection is calculated, Obtaining multiple first set clusters includes:Step A1, calculates the distance of current sample point and the central point of each initial sets cluster;Step Rapid A2, according to current sample point and the distance of the central point of each initial sets cluster, current sample point is stored in corresponding initial In set cluster, multiple new initial sets clusters are obtained;Step A3, calculates the central point of each new initial sets cluster;Step A4, Using next sample point of current sample point as current sample point, and execution step A1 to step A3 is circulated, until the second text All sample points complete to sort out in this set, obtain multiple first set clusters.
Further, hierarchical clustering is carried out to multiple first set clusters, obtaining multiple second set clusters includes:Step B1, The distance of any two first set cluster is calculated, multiple first distances are obtained;Step B2, obtains minimum from multiple first distances Apart from corresponding two first set clusters;Step B3, two first set clusters are merged, and obtain multiple new first sets Cluster;Step B4, circulation performs step B1 to step B3, until the quantity and default hierachy number phase of multiple new first set clusters Together, multiple second set clusters are obtained.
Further, emotional semantic classification is carried out to the second text collection using emotion classifiers, obtains emotional semantic classification result bag Include:Emotional semantic classification is carried out to the second text collection using emotion classifiers, the first probability is obtained;Market are entered to the second text collection Sense classification, obtains the second probability;The product of the first probability and the second probability is calculated, the maximum a posteriori for obtaining the second text collection is general Rate.
Further, obtaining a plurality of Uighur text includes:Crawled by web crawlers and obtain a plurality of Uygur's Chinese language This.
Further, after a plurality of Uighur text is obtained, the above method also includes:To a plurality of Uighur text Pre-processed, a plurality of Uighur text after being processed;A plurality of Uighur text after to treatment splits, and obtains To the first text collection and the second text collection.
Another aspect according to embodiments of the present invention, additionally provides a kind of emotional semantic classification device of Uighur text, bag Include:Acquisition module, for obtaining a plurality of Uighur text;Division module, for being split to a plurality of Uighur text, The first text collection and the second text collection are obtained, wherein, the first text collection includes:The Uighur text of the first quantity, Second text collection includes:The Uighur text of the second quantity, the first quantity is less than the second quantity;Generation module, for base In the first text collection and corresponding markup information, emotion classifiers are generated;Sort module, for using emotion classifiers to the Two text collections carry out emotional semantic classification, obtain emotional semantic classification result.
In embodiments of the present invention, a plurality of Uighur text is obtained, a plurality of Uighur text is split, obtained First text collection and the second text collection, based on the first text collection and corresponding markup information, generate emotion classifiers, profit Emotional semantic classification is carried out to the second text collection with emotion classifiers, emotional semantic classification result is obtained.It is easily noted that, Ke Yitong Cross and manually filter out language material less but that classification results can be embodied and be labeled, so that more preferable grader is produced, and it is possible to Rambling expectation is quickly found out the classification of oneself by the sentiment analysis of unsupervised learning, accurately mark off every The characteristics of individual classification, the sensibility classification method for solving Uighur text in the prior art is needed by manual type to a large amount of Uighur text carries out Emotion tagging, the technical problem that process time is long and treatment effeciency is low.Therefore, it is above-mentioned by the present invention The scheme that embodiment is provided, can have certain suppression by supervised learning to emotion energy imbalance, effective for half supervises Educational inspector practises and laying the groundwork, and so as to reach the degree of accuracy of lifting Uighur sentiment analysis, effectively saves manpower, reduces language material scale Effect.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this hair Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the sensibility classification method of Uighur text according to embodiments of the present invention;
Fig. 2 (a) is the schematic diagram of a kind of optional emotional semantic classification result according to embodiments of the present invention;
Fig. 2 (b) is the schematic diagram of the optional emotional semantic classification result of another kind according to embodiments of the present invention;
Fig. 3 is the flow chart of the sensibility classification method of a kind of optional Uighur text according to embodiments of the present invention; And
Fig. 4 is a kind of schematic diagram of the emotional semantic classification device of Uighur text according to embodiments of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, should all belong to the model of present invention protection Enclose.
It should be noted that term " first ", " in description and claims of this specification and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so using Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.Additionally, term " comprising " and " having " and their any deformation, it is intended that cover Lid is non-exclusive to be included, for example, the process, method, system, product or the equipment that contain series of steps or unit are not necessarily limited to Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or other intrinsic steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the sensibility classification method of Uighur text is, it is necessary to say It is bright, can be held in the such as one group computer system of computer executable instructions the step of the flow of accompanying drawing is illustrated OK, and, although show logical order in flow charts, but in some cases, can be with different from order herein Perform shown or described step.
Fig. 1 is a kind of flow chart of the sensibility classification method of Uighur text according to embodiments of the present invention, such as Fig. 1 institutes Show, the method comprises the following steps:
Step S102, obtains a plurality of Uighur text.
Specifically, above-mentioned Uighur text can be the language material for not carrying out artificial mark.
In a kind of optional scheme, the Uighur information of Uygur nationality common people issue, example can be obtained from network Such as, Uighur text can be got from the microblogging of Uygur nationality common people issue, the Uygur's Chinese language for getting from network This is all the expectation for not carrying out artificial mark.
Step S104, splits to a plurality of Uighur text, obtains the first text collection and the second text collection, its In, the first text collection includes:The Uighur text of the first quantity, the second text collection includes:The Uygur of the second quantity Chinese language sheet, the first quantity is less than the second quantity.
In a kind of optional scheme, due to the existing more difficult acquisition of Uygur's language material for having marked, can be from obtaining to this In the language material of a large amount of marks got, screen fewer parts language material as the first text collection, i.e., as training text set, and Language material in training text set is manually marked, manual operation is reduced, and using remaining most of language material as second Text collection, i.e., as test text set.
Step S106, based on the first text collection and corresponding markup information, generates emotion classifiers.
Specifically, above-mentioned markup information can be the affective style of artificial mark.
In a kind of optional scheme, the fewer parts language material for filtering out can manually be marked, i.e., to the first text Text in this set is manually marked, and the affective style of each sample, then uses mark in the second text collection of mark Language material afterwards sets up the emotion classifiers of supervised learning.
Step S108, emotional semantic classification is carried out using emotion classifiers to the second text collection, obtains emotional semantic classification result.
In a kind of optional scheme, it is possible to use emotion classifiers carry out unsupervised to remaining a large amount of un-annotated datas Study, i.e., carry out unsupervised learning to the text in the second text collection using emotion classifiers, realizes a large amount of to what is got Uighur text carries out emotional semantic classification.
According to the above embodiment of the present invention, a plurality of Uighur text is obtained, a plurality of Uighur text is split, The first text collection and the second text collection are obtained, based on the first text collection and corresponding markup information, emotional semantic classification is generated Device, emotional semantic classification is carried out using emotion classifiers to the second text collection, obtains emotional semantic classification result.It is easily noted that, Language material less but that classification results can be embodied can be gone out by artificial screening to be labeled, so that more preferable grader is produced, and And, rambling expectation can be made to be quickly found out the classification of oneself by the sentiment analysis of unsupervised learning, accurately The characteristics of marking off each classification, solving the sensibility classification method of Uighur text in the prior art needs by artificial side Formula carries out Emotion tagging to a large amount of Uighur texts, the technical problem that process time is long and treatment effeciency is low.Therefore, by this The scheme that invention above-described embodiment is provided, can have certain suppression, effectively by supervised learning to emotion energy imbalance Lay the groundwork for semi-supervised learning, so as to reach the degree of accuracy of lifting Uighur sentiment analysis, effectively save manpower, reduce language The effect of gauge mould.
Alternatively, in the above embodiment of the present invention, step S104 splits to a plurality of Uighur text, obtains First text collection and the second text collection include:
Step S1042, based on default screening strategy, screens to a plurality of Uighur text, obtains the first text set Close.
Specifically, above-mentioned default screening strategy can include various active learning strategies, for example, classification ambiguity, The a large amount of Uygur's language materials for getting can be screened by sample variation and cluster representative by classification ambiguity, Filter out the language material for being not easy to determine affective style;The a large amount of Uygur's language materials for getting can be carried out by sample variation Screening, filters out the language material that affective style differs greatly, for example, filtering out the language material and negative emotion type of positive emotion type Language material;The a large amount of Uygur's language materials for getting can be screened by cluster representative, filter out affective style difference Larger language material, for example, when the language material of positive emotion type is filtered out, the language material comprising " liking " text can be filtered out, Language material relatively conventional and representative in positive emotion type during language material comprising " liking " text.
In a kind of optional scheme, supervised learning add Active Learning typically all can iteration repeatedly, can need The training text for carrying out artificial mark is screened, and adds training text set, i.e. the first text collection, and selection again is used Which kind of strategy carries out next round screening, and each iteration may change strategy because of the data message of current first text collection Use, when certain first text collection language material amount is met, then stop iteration.
Step S1044, according to other Uighur texts in a plurality of Uighur text in addition to the first text collection, Obtain the second text collection.
In a kind of optional scheme, it is determined that after language material in the first text collection, can be a large amount of by what is got In un-annotated data, the language material not filtered out is used as test text set, i.e. the second text collection.
Alternatively, in the above embodiment of the present invention, step S1042, based on default screening strategy, to a plurality of Uighur Text is screened, and obtaining the first text collection includes:
Step S10422, the text of emotion word of the screening comprising preset kind, obtains the from a plurality of Uighur text One text collection.
Specifically, due to the tackness characteristic of Uighur, emotion word can be caused to change it originally by adding various affixes The meaning.If simply the slight degree for changing emotion word, can also receive, but if adding negative affixe, then can be complete Change its meaning, for example, I do not like forI be not desired to laugh at forBy this two it can be seen that originally be with positive emotion word because Having added negative affixe can obtain antipodal emotion.Above-mentioned both of these case is to add negative affixe to former word, can also be directly Plus negative word represents negative effect, for example,Two Uighur words have been used to illustrate negative.
In a kind of optional scheme, can be filtered out comprising negative word from a large amount of Uighur language materials for getting Sew or negative word language material, be stored in the first text collection, i.e., build space vector before can it is this with negative after The emotion word proposition sewed, individually builds space vector.
Alternatively, in the above embodiment of the present invention, in step S108, the second text collection is entered using emotion classifiers Row emotional semantic classification, before obtaining emotional semantic classification result, the method also includes:
Step S110, is clustered using k-means and the second text collection is polymerized, and obtains multiple first set clusters.
Multiple first set clusters are carried out hierarchical clustering by step S112, obtain multiple second set clusters.
Multiple second set clusters are carried out emotional semantic classification by step S114 using emotion classifiers, obtain emotional semantic classification result.
In a kind of optional scheme, the data in the second text collection can be carried out with k-means clusters, produce k First set cluster, then follows bottom-to-top method to condense for k=k- using Agglomerative hierarchical clustering to k first set cluster 1, it is polymerized to having carried out k-means, the set cluster of nearest (most close) in set cluster is condensed, produce one Individual new set cluster, obtains multiple second set clusters, finally using the emotion classifiers of supervised learning foundation to second set Cluster carries out emotional semantic classification, obtains classification results.
By above-mentioned steps S110 to step S114, it is possible to use coagulation type level k-means is clustered, by the second text set Close
Alternatively, in the above embodiment of the present invention, step S110 clusters to enter the second text collection using k-means Row polymerization, obtaining multiple first set clusters includes:
Step S1102, from the second text collection, obtains the initial sets cluster of predetermined number.
Specifically, above-mentioned predetermined number can be the target class number of clusters mesh k pre-set in k-means clustering algorithms.
Step S1104, calculates the distance of the central point of each sample point and each initial sets cluster in the second text collection, Obtain multiple first set clusters.
In a kind of optional scheme, all language materials in the second text collection can respectively be treated as a class cluster to carry out K-means is clustered, and finds k initial point in all language materials first, these initial set clusters clustered as k-means, Other language materials can be drawn close to these initial points, after all of language material is all sorted out, k first set can be obtained Cluster.
Alternatively, in the above embodiment of the present invention, step S1104, calculate the second text collection in each sample point with The distance of the central point of each initial sets cluster, obtaining multiple first set clusters includes:
Step A1, calculates the distance of current sample point and the central point of each initial sets cluster.
Step A2, according to current sample point and the distance of the central point of each initial sets cluster, current sample point is stored in In corresponding initial sets cluster, multiple new initial sets clusters are obtained.
Step A3, calculates the central point of each new initial sets cluster.
Step A4, using next sample point of current sample point as current sample point, and circulates execution step A1 to step Rapid A3, until all sample points complete to sort out in the second text collection, obtains multiple first set clusters.
In a kind of optional scheme, after finding k initial point in all language materials, can be using these points as k- The initial set cluster of means clusters, calculates the distance of each sample class and this K initial point successively, finds correct position, is put into The set cluster at place, such initial point just generates new set cluster, recalculates central point to the set cluster for generating successively, so as to Next time calculates distance, if sample has not been sorted out, repeatedly aforesaid operations, complete the classification of all language materials, obtain final k Set cluster.
Alternatively, in the above embodiment of the present invention, step S112 carries out hierarchical clustering to multiple first set clusters, obtains Include to multiple second set clusters:
Step B1, calculates the distance of any two first set cluster, obtains multiple first distances.
Step B2, the corresponding two first set clusters of minimum range are obtained from multiple first distances.
Step B3, two first set clusters are merged, and obtain multiple new first set clusters.
Step B4, circulation performs step B1 to step B3, until the quantity and default level of multiple new first set clusters Number is identical, obtains multiple second set clusters.
Specifically, above-mentioned default hierachy number can be the hierachy number N of default hierarchical clustering method.
In a kind of optional scheme, hierarchical clustering can be performed to k divided set cluster, i.e., again existing Set cluster treats as indivedual classes, and the distance of each classification is calculated respectively, the i.e. most like classification of the classification of minimum distance is found, with this The two pairs of samples are that representative generates a new set cluster, will two first set clusters merge, obtain multiple new first Set cluster.Whether the quantity for judging new first set cluster is hierachy number N, if the quantity=N of first set cluster, obtains most Whole second set cluster;If the quantity of first set cluster is not N, above-mentioned process is repeated.
Alternatively, in the above embodiment of the present invention, step S108 is carried out using emotion classifiers to the second text collection Emotional semantic classification, obtaining emotional semantic classification result includes:
Step S1082, emotional semantic classification is carried out using emotion classifiers to the second text collection, obtains the first probability.
Step S1084, emotional semantic classification is carried out to the second text collection, obtains the second probability.
Step S1086, calculates the product of the first probability and the second probability, and the maximum a posteriori for obtaining the second text collection is general Rate.
Herein it should be noted that as shown in Fig. 2 (a) and Fig. 2 (b) ,+number represent the language material of positive emotion type ,-number table Show the language material of negative emotion type, No. * is not mark the language material for needing to carry out emotional semantic classification, shown in such as Fig. 2 (a), by artificial Mark, can by the language material of No. * with+number language material point in same class, will No. * language material for being categorized as positive emotion type;Such as Shown in Fig. 2 (b), by emotion classifiers can by the language material of No. * with-number language material point in same class, will No. * be categorized as bearing The language material of face affective style, accordingly, it would be desirable to be optimized to emotion classifiers, improves the classification degree of accuracy.
In a kind of optional scheme, semi-supervised place can be carried out with reference to recording a demerit for unsupervised analysis by Supervised classification Reason, it is assumed that typically mark data together with unmarked data mixing as N number of mixed distribution, then can be by feelings Sense grader emotional semantic classification is carried out to the second text collection, obtain the first probability P (y | Nj, x), and by artificial notation methods pair Second text collection carries out emotional semantic classification, the second probability P (Nj | x) is estimated by unlabeled data, then by the first probability P (y | Nj, x) being multiplied obtains maximum a posteriori probability with the second probability P (Nj | x), and wherein Nj represents j-th blending constituent.
Alternatively, in the above embodiment of the present invention, step S102, obtaining a plurality of Uighur text includes:
Step S1022, is crawled by web crawlers and obtains a plurality of Uighur text.
In a kind of optional scheme, can be captured with reptile software from above-mentioned Guduk microbloggings, be obtained a plurality of dimension I your Chinese language sheet.
Alternatively, in the above embodiment of the present invention, after step S102, a plurality of Uighur text of acquisition, the party Method also includes:
Step S116, pre-processes to a plurality of Uighur text, a plurality of Uighur text after being processed.
Step S118, to treatment after a plurality of Uighur text split, obtain the first text collection and second text This set.
In a kind of optional scheme, the Uighur text that will can be arrived by reptile software grabs, then removal is schemed The non-textual symbol such as picture, domain name, only retains the microblogging speech text delivered containing comment and user.
Fig. 3 is the flow chart of the sensibility classification method of a kind of optional Uighur text according to embodiments of the present invention, A kind of preferred embodiment of the invention is described in detail with reference to Fig. 3, as shown in figure 3, can be climbed by network first Worm crawls the comment in microblogging to crawl un-annotated data, and screen fraction language material is manually marked, and emotion to language material is entered Row Balance Treatment.Then SVM points in the emotion classifiers that language material sets up supervised learning, i.e. Fig. 3 has been marked using small part Class device, then carries out unsupervised learning to remaining un-annotated data.Method by combining unsupervised and supervised learning, can Emotional semantic classification effectively is carried out to Uighur.Semi-supervised learning is that between having between supervision and unsupervised learning, it can be with The part for selecting more worth artificial mark from the language material not marked largely carries out sample mark, such as Policy Filtering of Active Learning Going out the language material of most worth training carries out the structure of grader.Supervision and unsupervised learning can also be combined with, one aids in one Learnt.
The scheme provided by the above embodiment of the present invention, can be had by supervised learning to the unbalanced phenomenon of emotion Certain suppression;Rambling language material is set to be quickly found out the classification of oneself by the sentiment analysis of unsupervised learning, also can be compared with The characteristics of accurately to mark off each classification, effectively for semi-supervised learning is laid the groundwork.Therefore, the dimension based on semi-supervised learning I accurately can classify at your language sentiment analysis to microblog emotional, and can effectively save manpower, reduce language material Scale.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of embodiment of the emotional semantic classification device of Uighur text.
Fig. 4 is a kind of schematic diagram of the emotional semantic classification device of Uighur text according to embodiments of the present invention, such as Fig. 4 institutes Show, the device includes:
Acquisition module 41, for obtaining a plurality of Uighur text.
Specifically, above-mentioned Uighur text can be the language material for not carrying out artificial mark.
In a kind of optional scheme, the Uighur information of Uygur nationality common people issue, example can be obtained from network Such as, Uighur text can be got from the microblogging of Uygur nationality common people issue, the Uygur's Chinese language for getting from network This is all the expectation for not carrying out artificial mark.
Division module 43, for being split to a plurality of Uighur text, obtains the first text collection and the second text Set, wherein, the first text collection includes:The Uighur text of the first quantity, the second text collection includes:Second quantity Uighur text, the first quantity is less than the second quantity.
In a kind of optional scheme, due to the existing more difficult acquisition of Uygur's language material for having marked, can be from obtaining to this In the language material of a large amount of marks got, screen fewer parts language material as the first text collection, i.e., as training text set, and Language material in training text set is manually marked, manual operation is reduced, and using remaining most of language material as second Text collection, i.e., as test text set.
Generation module 45, for based on the first text collection and corresponding markup information, generating emotion classifiers.
Specifically, above-mentioned markup information can be the affective style of artificial mark.
In a kind of optional scheme, the fewer parts language material for filtering out can manually be marked, i.e., to the first text Text in this set is manually marked, and the affective style of each sample, then uses mark in the second text collection of mark Language material afterwards sets up the emotion classifiers of supervised learning.
Sort module 47, for carrying out emotional semantic classification to the second text collection using emotion classifiers, obtains emotional semantic classification As a result.
In a kind of optional scheme, it is possible to use emotion classifiers carry out unsupervised to remaining a large amount of un-annotated datas Study, i.e., carry out unsupervised learning to the text in the second text collection using emotion classifiers, realizes a large amount of to what is got Uighur text carries out emotional semantic classification.
According to the above embodiment of the present invention, a plurality of Uighur text is obtained, a plurality of Uighur text is split, The first text collection and the second text collection are obtained, based on the first text collection and corresponding markup information, emotional semantic classification is generated Device, emotional semantic classification is carried out using emotion classifiers to the second text collection, obtains emotional semantic classification result.It is easily noted that, Language material less but that classification results can be embodied can be gone out by artificial screening to be labeled, so that more preferable grader is produced, and And, rambling expectation can be made to be quickly found out the classification of oneself by the sentiment analysis of unsupervised learning, accurately The characteristics of marking off each classification, solving the sensibility classification method of Uighur text in the prior art needs by artificial side Formula carries out Emotion tagging to a large amount of Uighur texts, the technical problem that process time is long and treatment effeciency is low.Therefore, by this The scheme that invention above-described embodiment is provided, can have certain suppression, effectively by supervised learning to emotion energy imbalance Lay the groundwork for semi-supervised learning, so as to reach the degree of accuracy of lifting Uighur sentiment analysis, effectively save manpower, reduce language The effect of gauge mould.
The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other Mode is realized.Wherein, device embodiment described above is only schematical, such as division of described unit, Ke Yiwei A kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be the INDIRECT COUPLING or communication link of unit or module by some interfaces Connect, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be according to the actual needs selected to realize the purpose of this embodiment scheme.
In addition, during each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or use When, can store in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part for being contributed to prior art in other words or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are used to so that a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the invention whole or Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (11)

1. a kind of sensibility classification method of Uighur text, it is characterised in that including:
Obtain a plurality of Uighur text;
The a plurality of Uighur text is split, the first text collection and the second text collection is obtained, wherein, described One text collection includes:The Uighur text of the first quantity, second text collection includes:The Uighur of the second quantity Text, first quantity is less than second quantity;
Based on first text collection and corresponding markup information, emotion classifiers are generated;
Emotional semantic classification is carried out to second text collection using the emotion classifiers, emotional semantic classification result is obtained.
2. method according to claim 1, it is characterised in that split to a plurality of Uighur text, obtained First text collection and the second text collection include:
Based on default screening strategy, a plurality of Uighur text is screened, obtain first text collection;
According to other Uighur texts in a plurality of Uighur text in addition to first text collection, institute is obtained State the second text collection.
3. method according to claim 2, it is characterised in that based on default screening strategy, to a plurality of Uighur Text is screened, and obtaining first text collection includes:
The text of emotion word of the screening comprising preset kind, obtains first text set from a plurality of Uighur text Close.
4. method according to claim 1, it is characterised in that using the emotion classifiers to second text set Conjunction carries out emotional semantic classification, and before obtaining emotional semantic classification result, methods described also includes:
Clustered using k-means and second text collection is polymerized, obtain multiple first set clusters;
Hierarchical clustering is carried out to the multiple first set cluster, multiple second set clusters are obtained;
Emotional semantic classification is carried out to the multiple second set cluster using the emotion classifiers, the emotional semantic classification result is obtained.
5. method according to claim 4, it is characterised in that cluster to enter second text collection using k-means Row polymerization, obtaining multiple first set clusters includes:
From second text collection, the initial sets cluster of predetermined number is obtained;
Each sample point and the distance of the central point of each initial sets cluster in second text collection are calculated, obtains described many Individual first set cluster.
6. method according to claim 5, it is characterised in that calculate in second text collection each sample point with it is every The distance of the central point of individual initial sets cluster, obtaining the multiple first set cluster includes:
Step A1, calculates the distance of current sample point and the central point of each initial sets cluster;
Step A2, according to the distance of the current sample point and the central point of each initial sets cluster, by the current sample This point is stored in corresponding initial sets cluster, obtains multiple new initial sets clusters;
Step A3, calculates the central point of each new initial sets cluster;
Step A4, using next sample point of the current sample point as the current sample point, and circulates the execution step Rapid A1 to the step A3, until all sample points complete to sort out in second text collection, obtains the multiple first collection Close cluster.
7. method according to claim 4, it is characterised in that hierarchical clustering is carried out to the multiple first set cluster, is obtained Include to multiple second set clusters:
Step B1, calculates the distance of any two first set cluster, obtains multiple first distances;
Step B2, obtains the corresponding two first set clusters of minimum range from the multiple first distance;
Step B3, described two first set clusters are merged, and obtain multiple new first set clusters;
Step B4, circulation performs the step B1 to the step B3, until the quantity of the multiple new first set cluster with Default hierachy number is identical, obtains the multiple second set cluster.
8. method according to claim 1, it is characterised in that using the emotion classifiers to second text collection Emotional semantic classification is carried out, obtaining emotional semantic classification result includes:
Emotional semantic classification is carried out to second text collection using the emotion classifiers, the first probability is obtained;
Emotional semantic classification is carried out to second text collection, the second probability is obtained;
The product of first probability and second probability is calculated, the maximum a posteriori probability of second text collection is obtained.
9. method as claimed in any of claims 1 to 8, it is characterised in that obtain a plurality of Uighur text bag Include:
Crawled by web crawlers and obtain a plurality of Uighur text.
10. method according to claim 9, it is characterised in that after a plurality of Uighur text is obtained, methods described Also include:
The a plurality of Uighur text is pre-processed, a plurality of Uighur text after being processed;
A plurality of Uygur's sentence after to the treatment splits, and obtains first text collection and second text set Close.
A kind of 11. emotional semantic classification devices of Uighur text, it is characterised in that including:
Acquisition module, for obtaining a plurality of Uighur text;
Division module, for being split to a plurality of Uighur text, obtains the first text collection and the second text set Close, wherein, first text collection includes:The Uighur text of the first quantity, second text collection includes:Second The Uighur text of quantity, first quantity is less than second quantity;
Generation module, for based on first text collection and corresponding markup information, generating emotion classifiers;
Sort module, for carrying out emotional semantic classification to second text collection using the emotion classifiers, obtains emotion point Class result.
CN201710080052.8A 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text Active CN106844743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710080052.8A CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710080052.8A CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Publications (2)

Publication Number Publication Date
CN106844743A true CN106844743A (en) 2017-06-13
CN106844743B CN106844743B (en) 2020-04-24

Family

ID=59128725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710080052.8A Active CN106844743B (en) 2017-02-14 2017-02-14 Emotion classification method and device for Uygur language text

Country Status (1)

Country Link
CN (1) CN106844743B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis
CN110782876A (en) * 2019-10-21 2020-02-11 华中科技大学 Unsupervised active learning method for voice emotion calculation
CN112101393A (en) * 2019-06-18 2020-12-18 上海电机学院 Wind power plant fan clustering method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 Emotion classifying method and device for text
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system
CN103530286A (en) * 2013-10-31 2014-01-22 苏州大学 Multi-class sentiment classification method
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682124A (en) * 2012-05-16 2012-09-19 苏州大学 Emotion classifying method and device for text
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system
CN103530286A (en) * 2013-10-31 2014-01-22 苏州大学 Multi-class sentiment classification method
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李响 等: "基于主动学习的SVM维吾尔语情感分析研究", 《新疆大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241518A (en) * 2017-07-11 2019-01-18 北京交通大学 A kind of detection network navy method based on sentiment analysis
CN109241518B (en) * 2017-07-11 2021-01-22 北京交通大学 Network water army detection method based on emotion analysis
CN112101393A (en) * 2019-06-18 2020-12-18 上海电机学院 Wind power plant fan clustering method and device
CN110782876A (en) * 2019-10-21 2020-02-11 华中科技大学 Unsupervised active learning method for voice emotion calculation

Also Published As

Publication number Publication date
CN106844743B (en) 2020-04-24

Similar Documents

Publication Publication Date Title
CN110750656B (en) Multimedia detection method based on knowledge graph
Cole et al. Document retrieval for e-mail search and discovery using formal concept analysis
Kumar et al. Knowledge discovery from database using an integration of clustering and classification
CN108256104A (en) Internet site compressive classification method based on multidimensional characteristic
CN108776774A (en) A kind of human facial expression recognition method based on complexity categorization of perception algorithm
CN105912684B (en) The cross-media retrieval method of view-based access control model feature and semantic feature
Saraçoğlu et al. A fuzzy clustering approach for finding similar documents using a novel similarity measure
CN108154198A (en) Knowledge base entity normalizing method, system, terminal and computer readable storage medium
CN110196945B (en) Microblog user age prediction method based on LSTM and LeNet fusion
CN107832412B (en) Publication clustering method based on literature citation relation
CN108897778A (en) A kind of image labeling method based on multi-source big data analysis
CN106844743A (en) The sensibility classification method and device of Uighur text
CN108446964A (en) A kind of user's recommendation method based on mobile flow DPI data
CN104008177B (en) Rule base structure optimization and generation method and system towards linguistic indexing of pictures
Kanda et al. A deep learning-based recognition technique for plant leaf classification
Akagi et al. Explainable deep learning reproduces a ‘professional eye’on the diagnosis of internal disorders in persimmon fruit
Chamoso et al. Profile generation system using artificial intelligence for information recovery and analysis
Song et al. Texture analysis by genetic programming
Kuppusamy et al. A personalized web page content filtering model based on segmentation
CN107368610A (en) Big text CRF and rule classification method and system based on full text
Laturnus et al. MorphoPy: A python package for feature extraction of neural morphologies.
CN106775694A (en) A kind of hierarchy classification method of software merit rating code product
CN106446531A (en) Family tree construction method based on prior decision model
Thepade et al. Decision fusion-based approach for content-based image classification
CN106156256A (en) A kind of user profile classification transmitting method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant