CN107977352A

CN107977352A - Information processor and method

Info

Publication number: CN107977352A
Application number: CN201610921729.1A
Authority: CN
Inventors: 孟遥; 陈大军; 张波
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2018-05-01

Abstract

This disclosure relates to information processor and method.Information processor includes：Language material acquiring unit, it obtains the corpus of text from internet, wherein, the corpus of text includes training corpus and un-annotated data；Term vector training unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k；Term vector dimensionality reduction unit, the matrix of the term vector composition of its all word to each sentence in the training corpus carry out dimensionality reduction；And normalization unit, its matrix to dimensionality reduction are normalized, to obtain normalized term vector feature.Fixed dimension effectively by term vector dimensionality reduction and can be normalized to according to the information processor of the disclosure, then obtain normalized term vector feature.

Description

Information processor and method

Technical field

This disclosure relates to the technical field of information processing, more particularly to the apparatus and method of mood word classification.

Background technology

This part provides the background information related with the disclosure, this is not necessarily the prior art.

With the continuous development of artificial intelligence technology, affection computation plays more and more important work in human-computer interaction With.Traditional Emotion identification task is based primarily upon the methods of mood dictionary, rule, and the dependence to mood dictionary is larger.This both led The limitation of coverage has been caused to consume more times again.

The content of the invention

This part provides the general summary of the disclosure, rather than its four corner or the comprehensive of its whole feature drape over one's shoulders Dew.

The purpose of the disclosure is to provide a kind of information processor and information processing method, it effectively drops term vector Tie up and normalize to fixed dimension, then obtain normalized term vector feature.

According to the one side of the disclosure, there is provided a kind of information processor, the device include：Language material acquiring unit, its The corpus of text from internet is obtained, wherein the corpus of text includes training corpus and un-annotated data；Term vector is instructed Practice unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k；Term vector dimensionality reduction unit, its The matrix formed to the term vector of all words of each sentence in the training corpus carries out dimensionality reduction；And normalization unit, Its matrix to dimensionality reduction is normalized, to obtain normalized term vector feature.

According to another aspect of the present disclosure, there is provided a kind of information processing method, this method include：Acquisition comes from internet Corpus of text, wherein the corpus of text includes training corpus and un-annotated data；Word is trained for the training corpus Vector, wherein the term vector of each word is tieed up for k；The term vector of all words of each sentence in the training corpus is formed Matrix carries out dimensionality reduction；And the matrix of dimensionality reduction is normalized, to obtain normalized term vector feature.

According to another aspect of the present disclosure, there is provided a kind of program product, the program product include the machine being stored therein Device readable instruction code, wherein, described instruction code can make the computer perform root when being read by computer and being performed According to the information processing method of the disclosure.

According to another aspect of the present disclosure, there is provided a kind of machinable medium, carries according to the disclosure thereon Program product.

Using the information processor and method according to the disclosure, to the word of all words of each sentence in training corpus The matrix of vector composition carries out dimensionality reduction and the matrix to dimensionality reduction is normalized, so that it is special to obtain normalized term vector Sign.Thus, fixed dimension effectively by term vector dimensionality reduction and can be normalized to according to the information processor of the disclosure and method Number, to obtain normalized term vector feature, so as to be conducive to the classification of mood word.

From in the description provided here, further applicability region will become obvious.Description in this summary and Specific examples are intended merely to the purpose of signal, and are not intended to be limited to the scope of the present disclosure.

Brief description of the drawings

Attached drawing described here is intended merely to the purpose of the signal of selected embodiment and not all possible implementation, and not It is intended to limitation the scope of the present disclosure.In the accompanying drawings：

Fig. 1 is the block diagram according to the information processor of embodiment of the disclosure；

Fig. 2 a to Fig. 2 c schematically show term vector feature structure；

Fig. 3 is the block diagram according to the information processor of another embodiment of the present disclosure；

Fig. 4 schematically shows the part in seed mood dictionary；

Fig. 5 schematically shows term vector and word vector characteristics form；

Fig. 6 is the block diagram according to a part for the information processor of embodiment of the disclosure；

Fig. 7 is the block diagram according to the information processor of the another embodiment of the disclosure；

Fig. 8 is the flow chart according to the information processing method of embodiment of the disclosure；And

Fig. 9 is that the universal personal that can wherein realize information processor and method in accordance with an embodiment of the present disclosure calculates The block diagram of the example arrangement of machine.

Although the disclosure is subjected to various modifications and alternative forms, its specific embodiment is as an example in attached drawing In show, and be described in detail here.It should be understood, however, that the description at this to specific embodiment is not intended to this public affairs Open and be restricted to disclosed concrete form, but on the contrary, disclosure purpose be intended to cover spirit and scope of the present disclosure it It is interior all modifications, equivalent and replace.It should be noted that running through several attached drawings, corresponding label indicates corresponding component.

Embodiment

The example of the disclosure is described more fully referring now to attached drawing.It is described below what is be merely exemplary in nature, And it is not intended to be limited to the disclosure, application or purposes.

Example embodiment is provided, so that the disclosure will become detailed, and will be abundant to those skilled in the art Pass on its scope in ground.The example of numerous specific details such as particular elements, apparatus and method is elaborated, to provide to the disclosure The detailed understanding of embodiment.To those skilled in the art it will be obvious that, it is not necessary to use specific details, example Embodiment can be implemented with many different forms, they shall not be interpreted to limit the scope of the present disclosure.Some In example embodiment, well-known process, well-known structure and widely-known technique are not described in detail.

Using the technical solution of the disclosure, by term vector dimensionality reduction and fixed dimension effectively can be normalized to, to obtain Normalized term vector feature, so as to be conducive to the classification of mood word.

Fig. 1 shows the block diagram of the information processor 100 according to one embodiment of the disclosure.As shown in Figure 1, according to The information processor 100 of embodiment of the disclosure can include language material acquiring unit 110, term vector training unit 120, word to Measure dimensionality reduction unit 130 and normalization unit 140.

Language material acquiring unit 110 can obtain the corpus of text from internet (such as in microblogging).Wherein, corpus of text It is the language material and un-annotated data marked that can include training corpus.

Next, term vector training unit 120 can be directed to training corpus training term vector, wherein, the word of each word to Measure and tieed up for k.It is, for example, possible to use the word steering volume instrument training term vector of Google.But the disclosure is not limited to this.This area Technical staff, which is appreciated that, to train term vector using other suitable instruments or means.

However, since the length of every language material is different, in order to effectively using term vector feature, it is necessary to be dropped to term vector Dimension.

Then, term vector dimensionality reduction unit 130 can form the term vector of all words of each sentence in training corpus Matrix M carry out dimensionality reduction.In training corpus, the term vector of all words of each sentence can form matrix M, wherein, each Term vector is tieed up for k.For example, as shown in Figure 2 a, " today encounters primary school classmate to sentence, and mood is good greatly." all word (amounts to 9 A word, wherein sentence end with</s>Instead of) term vector can form 9 × k dimension matrix.Term vector dimensionality reduction unit 130 can Matrix is tieed up to 7 × 3k as shown in Figure 2 b with the matrix dimensionality reduction for tieing up the 9 × k.

Next, the matrix M' of dimensionality reduction can be normalized in normalization unit 140, to obtain normalized word Vector characteristics.For example, as described in Fig. 2 c, normalization unit 140 can tie up matrix to 7 × 3k and be normalized, so as to obtain Dimension such as 1 × 3k dimensional feature vectors must be fixed.Then, the feature vector of the fixation dimension can be denoted as term vector feature.

Information processor 100 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and is normalized to fixed dimension Number, then obtains normalized term vector feature.

The technical solution of the disclosure in order to better understand, the information processor progress below for the disclosure are more detailed Carefully describe.

According to the information processor of one embodiment of the disclosure, wherein, the term vector dimensionality reduction unit 130 can be into One step includes extracting unit and concatenation unit.

Specifically, extracting unit can extract the n members (n-gram) of each sentence in training corpus, wherein, n 2,3 Or 4.For example, the ternary (tri-gram) of each sentence can be extracted, wherein, i-th of word x_i∈R^k(1≤i≤3) are the word of k dimensions Vector.Assuming that whole sentence is made of m word, then m-2 bar ternarys can be extracted.

Next, concatenation unit can be spliced the term vector of the word in each n members, to obtain the drop of each sentence The matrix M' of dimension.For example, in the case of n=3 (ternary for extracting each sentence), by 3 words in each ternary to Amount is spliced, and then obtains the new vector of 3 × k dimensions.For entirely forming sentence by m word, can obtain (m-2) × (3 × K) matrix of dimension, as shown in Figure 2 b.

Then, the matrix M' of dimensionality reduction can be normalized in normalization unit 140, with obtain normalized word to Measure feature.It will be understood by those skilled in the art that it can be normalized using any appropriate means.According to the disclosure One embodiment information processor, normalization unit 140 may further include normalization computing unit.

Specifically, normalization computing unit can calculate the average value of each row of the matrix M' of dimensionality reduction, to obtain normalizing The term vector feature of change.For example, in the case of n=3 (ternary for extracting each sentence), concatenation unit spliced from And obtain (m-2) × (3 × k) dimension matrix after (as shown in Figure 2 b), normalization computing unit can calculate 3 × k row in The average value of each row, so as to obtain the term vector feature (as shown in Figure 2 c) that dimension is 3 × k.

Additionally, it is provided the information processor 300 according to another embodiment of the disclosure.Fig. 3 is shown according to this The information processor 300 of disclosed another embodiment.Except classifier training unit 350 and language material taxon 360 it Outside, the other components of information processor 300 as shown in Figure 3 are identical with information processor 100 as shown in Figure 1, This is not repeated in the disclosure.

As shown in figure 3, except language material acquiring unit 110, term vector training unit 120, term vector dimensionality reduction unit 130 and Outside normalization unit 140, information processor 300 may further include classifier training unit 350 and language material classification Unit 360.

Classifier training unit 350 can be using normalized term vector feature as grader feature, training grader mould Type.For example, classifier training unit 350 can be using the term vector feature that dimension as illustrated in fig. 2 c is 3 × k as grader Feature, sorter model is trained using support vector machines (Support Vector Machine).However, the disclosure is simultaneously Not limited to this.It will be understood by those skilled in the art that sorter model can be trained using appropriate instrument known in the art.

Next, language material taxon 360 can classify un-annotated data based on trained sorter model.

Information processor 300 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and normalizes to fixed dimension Number, to obtain normalized term vector feature, so as to be conducive to the classification of mood word.

, can be with according to the information processor of the another embodiment of the disclosure in addition, in order to further carry out mood classification Further comprise candidate word set determination unit 610 as shown in Figure 6.

Specifically, candidate word set determination unit 610 can obtain seed mood dictionary, every in the seed mood dictionary A seed mood word is classified into one in multiple and different mood classifications.For example, according to one embodiment of the disclosure, wait Word set determination unit 610 is selected to obtain the seed mood word from the extraction of emotional noumenon storehouse.According to another embodiment of the present disclosure, Candidate word set determination unit 610 can be obtained from such as microblogging the word version of emoticon as " giggle " and " heartily " conduct Seed mood word.In addition, according to the another embodiment of the disclosure, candidate word set determination unit 610 can also will be from emotional noumenon The mood word of storehouse extraction is with emoticon word version together as seed mood dictionary.

In the seed mood dictionary, each seed mood word can tend to be classified into multiple and different based on its mood Mood classification in one.For example, as shown in figure 5, each seed mood word can be categorized into respectively happiness, anger, sorrow, probably, it is frightened With one in sorrow.However, the disclosure is not limited to this.It is different from Fig. 5 institutes it will be appreciated by those skilled in the art that can have Other classifications for the mood classification shown.

Then, candidate word set determination unit 610 can train the term vector of each seed mood word.Equally, candidate word set Determination unit 610 can also use the term vector of the word steering volume instrument training seed mood word of Google.But the disclosure is not It is limited to this.

Next, candidate word set determination unit 610 can be based on each word in corpus of text term vector and each kind COS distance between the term vector of sub- mood word determines candidate's mood word set.

Specifically, candidate word set determination unit 610 can be based on each word W in corpus of text_jTerm vector with it is each Seed mood word w_iTerm vector between COS distance d_ijTo determine candidate's mood word set.In order to determine candidate's mood word set It is required that, it is necessary to it is this COS distance d_ijThreshold value is set, which can be determined by empirical value.For example, according to this public affairs One embodiment opened, it is assumed that threshold value is set to 0.6, then as COS distance d_ijWhen >=0.6, this word W in corpus of text_jCan be with Candidate's mood word set is added to as candidate word.

For example it is assumed that the term vector of each word in the term vector and corpus of text of each seed mood word is 300 Tie up, a seed mood word w in seed mood word_i=excellent, its term vector v_i=[x₁,x₂,...,x₃₀₀]=[0,1 .., 0.5], and " happiness " classification is belonged to.Then, it is assumed that a word W in corpus of text_j=strength is dazzled, its term vector is v_j=[y₁, y₂,...,y₃₀₀]=[0.2,0.3 ..., 0.6], next, the COS distance using formula (1) calculating therebetween:

Wherein, d_ij＞ 0.6, therefore, " strength is dazzled " be introduced into candidate's mood word concentration.

In order to which the candidate's mood word for concentrating candidate's mood word carries out mood classification, according to the another embodiment of the disclosure Information processor may further include taxon 620,630 and 640 as shown in Figure 6.

Taxon 620 can train the word vector of seed mood word, and by the term vector of seed mood word and word to Amount is combined together as the feature of grader, and mood classification is carried out to candidate's mood word that candidate's mood word is concentrated.

For example it is assumed that the term vector and word vector of each mood word in seed mood word are 300 dimensions, for seed feelings Thread word " happiness ", its term vector are [x₁,x₂,...,x₃₀₀], and its word vectorial " U.S. " and " full " respectively [cx₁,cx₂,..., cx₃₀₀] and [cy₁,cy₂,...,cy₃₀₀], then, the feature of grader can be expressed as shown in Figure 6.Next, taxon 620 can carry out mood classification to the mood word that candidate's mood word is concentrated, and classification results are denoted as classification 1.

The term vector for the seed mood word that taxon 630 can be directed in each mood classification respectively forms multiple two dimensions Matrix.Next, taxon 630 can be come using such as gauss hybrid models GMM (Gaussian Mixture Model) Each matrix center is calculated respectively.Then, taxon 630 can calculate probability of candidate's mood word at each matrix center, And the classification using the classification with maximum probability as candidate's mood word, is denoted as classification 2.

Taxon 640, which can be directed to corpus of text, includes the sentence of seed mood word or candidate's mood word, extracts kind Word before and after sub- mood word and seed mood word composition seed triple, or extract before candidate's mood word and Word afterwards and candidate's mood word composition candidate's triple.Then, taxon 640 can be based on the seed triple Mood classification is carried out to candidate's triple, and using the classification of candidate's triple as the classification of mood word, is denoted as class Other 3.

Specifically, to extensive corpus of text, subordinate sentence is carried out according to punctuation mark.Then, therefrom choose and carry seed feelings The sentence of thread word or candidate's mood word, and remove the forward and backward language material for including negative word of seed mood word.

Then, for the language material for including seed mood word, former and later two words of seed mood word and the seed mood are extracted Word forms triple.Assuming that seed mood word is E_i, its former and later two word is respectively E_i-1And E_i+1, then triple is T (E_i-1, E_i,E_i+1).If E in sentence-initial or ending, its previous word or the latter word with</s>Instead of the triple then formed As triple T (it is true, it is glad,</s>).For triple T (E_i-1,E_i,E_i+1), it is known that seed mood word E_iMood classification be C_i, then triple T (E_i-1,E_i,E_i+1) mood classification be marked as C_i。

Next, taxon 640 can be with triple T (E_i-1,E_i,E_i+1) in three words term vector form 3 × k The feature of dimension, the triple for example with SVM classifier to candidate's mood wordMood classification is carried out, and will The tripleMood classification as candidate's mood wordClassification.

Fig. 6 shows a part for the information processor of one embodiment according to the disclosure, including three points Class unit 620,630 and 640.However, the disclosure is not necessarily limited to this.The disclosure can only include three taxons 620, One or two in 630 and 640.

After three noted above different taxons 620,630 and 640 carry out mood word classification, in order into One step is more accurately classified, there is provided according to the information processor 700 of the another embodiment of the disclosure.Fig. 7 is shown According to the information processor 700 of the another embodiment of the disclosure.Except classification determination unit 650 and Feature Dimension Reduction unit 750 Outside, the other components of information processor 700 as shown in Figure 7 and 300 phase of information processor as shown in Figure 3 Together, this is not repeated in the disclosure.

As shown in fig. 7, except language material acquiring unit 110, term vector training unit 120, term vector dimensionality reduction unit 130, return One changes outside unit 140, classifier training unit 350 and language material taxon 360, and information processor 700 can be into one Step includes classification determination unit 650 and Feature Dimension Reduction unit 750.

Specifically, when candidate's mood word that candidate's mood word is concentrated in taxon 620, taxon 630 and divides When having at least two identical results among the classification results in class unit 640, classification determination unit 650 can determine described The mood classification of one candidate's mood word, and one candidate's mood word is added in the seed mood dictionary, with Obtain mood word set.

For example, for candidate's mood word " strength is dazzled ", it is respectively labeled as " liking " in classification 1, classification 2 and classification 3, " frightened " and " happiness ", then classification determination unit 650 determines that the mood classification of " strength is dazzled " is " happiness ", and " strength is dazzled " is added to seed feelings In thread dictionary.

Then, classifier training unit 350 can also by the mood word set and the term vector feature together as point Class device feature.It will be appreciated by those skilled in the art that classifier training unit 350 can be trained using mode as previously discussed Model.Herein, this is not repeated in the disclosure.

In addition, as shown in fig. 7, Feature Dimension Reduction unit is further included according to the information processor of the another embodiment of the disclosure 750.This feature dimensionality reduction unit 750 can carry out dimensionality reduction to first (N-gram) features of N of corpus of text, wherein, N is 1 or 2.

Specifically, Feature Dimension Reduction unit 750 can extract the N-gram of every text, wherein N=1 or N=2.Then, it is special The N-gram of some threshold value can be more than with selected characteristic weight by levying dimensionality reduction unit 750.Similarly, which can pass through empirical value To determine.It is for example, as follows for the N-gram (being denoted as t) in a certain mood classification c, its weight calculation：

Wherein, the concrete meaning of A, B, C and D are as follows：

	Belong to c classes	It is not belonging to c classes	Amount to
				Include t	A	B	A+B
Not comprising t	C	D	C+D
				Amount to	A+C	B+D	A+B+C+D

A represents the bar number for the sentence for not only having belonged to c classes in corpus of text but also comprising t；B is represented and is not belonging to c classes but the sentence comprising t Sub- bar number；C, which is represented, belongs to c classes but the sentence bar number not comprising t；And D represents the sentence bar for being not only not belonging to c classes but also not comprising t Number.

Next, classifier training unit 350 can also by the N members feature of dimensionality reduction and the term vector feature together as Grader feature.Alternatively, classifier training unit 350 can also be by the N members feature, the term vector feature and institute of dimensionality reduction Mood word set is stated together as grader feature.Similarly, it will be appreciated by those skilled in the art that classifier training unit 350 Can be using mode as previously discussed come training pattern.Herein, this is not repeated in the disclosure.

Information processor 700 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and normalizes, and profit Built automatically with seed mood word and update mood dictionary.In addition, information processor 700 in accordance with an embodiment of the present disclosure is also The accuracy of mood word classification is further increased based on different mood discriminant classification mechanism.

On the other hand, when candidate's mood word of candidate's mood word concentration is in taxon 620,630 and of taxon When having different result among the classification results in taxon 640, classification determination unit 650 can be further by institute State candidate's mood word and be added to unfiled mood word concentration.

Next, Iterative classification list may further include according to the information processor of the another embodiment of the disclosure Member, the Iterative classification unit can be directed to candidate's mood word that unfiled mood word is concentrated, make taxon 620, taxon 630 and taxon 640 repeat mood classification, until candidate's mood word that the unfiled mood word is concentrated is determined feelings Thread classification or untill reaching predetermined iterations.

Information processing method in accordance with an embodiment of the present disclosure is described with reference to Fig. 8.As shown in figure 8, according to this public affairs The information processing method for the embodiment opened starts from step S810.In step S810, the corpus of text from internet is obtained, Wherein described corpus of text includes training corpus and un-annotated data.

Next, in step S820, term vector is trained for the training corpus, wherein the term vector of each word is k Dimension.

Next, in step S830, the term vector of all words of each sentence in the training corpus is formed Matrix carries out dimensionality reduction.

Finally, in step S840, the matrix of dimensionality reduction is normalized, it is special to obtain normalized term vector Sign.

In accordance with an embodiment of the present disclosure, the square formed to the term vector of all words of each sentence in the training corpus Battle array carries out dimensionality reduction and may further include：The n members of each sentence in the training corpus are extracted, wherein, n 2,3 or 4；With And spliced the term vector of the word in each n members, the step of to obtain the matrix of the dimensionality reduction of each sentence.

In accordance with an embodiment of the present disclosure, the matrix of dimensionality reduction is normalized and may further include：Described in calculating The average value of each row of the matrix of dimensionality reduction, the step of to obtain normalized term vector feature.

In accordance with an embodiment of the present disclosure, after the matrix of dimensionality reduction is normalized, may further include by The normalized term vector feature is as grader feature, training sorter model；And based on trained sorter model The step of classifying to the un-annotated data.

In accordance with an embodiment of the present disclosure, it may further include and obtain seed mood dictionary, in the seed mood dictionary Each seed mood word be classified into one in multiple and different mood classifications；The word of the training seed mood word to Amount；And the COS distance between the term vector and the term vector of the seed mood word based on the training corpus, determine to wait The step of selection thread word set.

In accordance with an embodiment of the present disclosure, after definite candidate's mood word set, it may further include the training seed The word vector of mood word；And the term vector of the seed mood word and word vector are combined together as to the spy of grader Sign, the step of the first mood is classified is carried out to candidate's mood word that candidate's mood word is concentrated.

In accordance with an embodiment of the present disclosure, the first mood classification is carried out in the candidate's mood word concentrated to candidate's mood word Afterwards, the term vector that may further include the seed mood word being directed to respectively in each mood classification forms multiple Two-Dimensional Moments Battle array；The center of the multiple two-dimensional matrix is calculated respectively；Calculate probability of the candidate's mood word at each center；And it is based on Probability carries out candidate's mood word the step of the second mood classification.

In accordance with an embodiment of the present disclosure, can after the second mood classification is carried out to candidate's mood word based on probability To further comprise including the sentence of the seed mood word or candidate's mood word for the training corpus, the kind is extracted Word before and after sub- mood word and seed mood word composition seed triple, or extract candidate's mood word it Preceding and word afterwards and candidate's mood word composition candidate's triple；And based on the seed triple to the candidate three The step of tuple is classified and the classification of candidate's triple is classified as the 3rd mood of candidate's mood word.

In accordance with an embodiment of the present disclosure, classified based on the seed triple to candidate's triple and by institute Classifying as after the 3rd mood classification of candidate's mood word for candidate's triple is stated, may further include when the time Candidate's mood word in selection thread word set is classified in first mood, second mood classification and the 3rd mood When there is at least two identical results among the result of classification, the mood classification of one candidate's mood word is determined, and One candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and by the mood word The step of collection and the term vector feature are together as the grader feature.

In accordance with an embodiment of the present disclosure, may further include：Dimensionality reduction is carried out to the N members feature of the corpus of text, its In, N is 1 or 2, and by the N members feature of dimensionality reduction and the term vector feature or by the N members feature of dimensionality reduction, the term vector The step of feature and the mood word set are together as the grader feature.

In accordance with an embodiment of the present disclosure, when candidate's mood word of candidate's mood word concentration is in first mood , can be by among the result of classification, second mood classification and the 3rd mood classification when there is different result One candidate's mood word is added to unfiled mood word and concentrates.

In accordance with an embodiment of the present disclosure, it may further include the candidate's mood concentrated for the unfiled mood word Word, repeats the first mood classification, second mood classification and the 3rd mood classification, until described unfiled Untill candidate's mood word that mood word is concentrated is determined mood classification or reaches predetermined iterations.

In accordance with an embodiment of the present disclosure, can be by the word version of the mood word extracted from emotional noumenon storehouse and emoticon This is as the seed mood word.

In accordance with an embodiment of the present disclosure, each seed mood word in the seed mood dictionary is based on each seed The mood of mood word tends to one be classified into multiple and different mood classifications.

In accordance with an embodiment of the present disclosure, the corpus of text comes from the microblogging in internet.

Before the various embodiments of the above-mentioned steps of information processing method in accordance with an embodiment of the present disclosure Made to be described in detail, this will not be repeated here.

Obviously, can be various machine readable to be stored according to each operating process of the information processing method of the disclosure The mode of computer executable program in storage medium is realized.

Moreover, the purpose of the disclosure can also be accomplished in the following manner：Above-mentioned executable program code will be stored with Storage medium is directly or indirectly supplied to system or equipment, and computer or central processing in the system or equipment Unit (CPU) reads and performs above procedure code.At this time, as long as the system or equipment have the function of executive program, then Embodiment of the present disclosure is not limited to program, and the program can also be arbitrary form, for example, target program, explanation The program or be supplied to shell script of operating system etc. that device performs.

These above-mentioned machinable mediums include but not limited to：Various memories and storage unit, semiconductor equipment, Disk cell such as light, magnetic and magneto-optic disk, and other are suitable for medium etc. of storage information.

In addition, computer is by the corresponding website that is connected on internet, and by the computer program according to the disclosure Code is downloaded and is installed in computer and then performs the program, can also realize the technical solution of the disclosure.

As shown in figure 9, CPU 1301 according to the program stored in read-only storage (ROM) 1302 or from storage part 1308 The program for being loaded into random access memory (RAM) 1303 performs various processing.In RAM 1303, work as also according to needing to store CPU 1301 performs data required during various processing etc..CPU 1301, ROM 1302 and RAM 1303 are via bus 1304 It is connected to each other.Input/output interface 1305 is also connected to bus 1304.

Components described below is connected to input/output interface 1305：Importation 1306 (including keyboard, mouse etc.), output Part 1307 (including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage Part 1308 (including hard disk etc.), communications portion 1309 (including network interface card such as LAN card, modem etc.).Communication Part 1309 performs communication process via network such as internet.As needed, driver 1310 can be connected to input/output Interface 1305.Detachable media 1311 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in as needed On driver 1310 so that the computer program read out is mounted in storage part 1308 as needed.

It is such as removable from network such as internet or storage medium in the case where realizing above-mentioned series of processes by software Unload the program that the installation of medium 1311 forms software.

It will be understood by those of skill in the art that this storage medium be not limited to wherein having program stored therein shown in Fig. 9, Separately distribute with equipment to provide a user the detachable media 1311 of program.The example of detachable media 1311 includes magnetic Disk (including floppy disk (registration mark)), CD (including compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic Disk (including mini-disk (MD) (registration mark)) and semiconductor memory.Alternatively, storage medium can be ROM 1302, storage part Divide in 1308 hard disk included etc., wherein computer program stored, and user is distributed to together with the equipment comprising them.

In the system and method for the disclosure, it is clear that each component or each step can be decomposed and/or reconfigured. These decompose and/or reconfigure the equivalents that should be regarded as the disclosure.Also, the step of performing above-mentioned series of processes can be certainly So perform, but and need not be necessarily performed sequentially in time in chronological order according to the order of explanation.Some steps can To perform parallel or independently of one another.

Although embodiment of the disclosure is described in detail with reference to attached drawing above, it is to be understood that reality described above The mode of applying is only intended to the explanation disclosure, and does not form the limitation to the disclosure.For those skilled in the art, may be used To make various changes and modifications the spirit and scope without departing from the disclosure to the above embodiment.Therefore, the disclosure Scope is only limited by appended claim and its equivalents.

On the embodiment including above example, following note is also disclosed：

A kind of 1. information processors are attached, including：

Language material acquiring unit, it obtains the corpus of text from internet, wherein the corpus of text includes training corpus And un-annotated data；

Term vector training unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k；

Term vector dimensionality reduction unit, the matrix of the term vector composition of its all word to each sentence in the training corpus Carry out dimensionality reduction；And

Normalization unit, its matrix to dimensionality reduction are normalized, to obtain normalized term vector feature.

Device of the note 2. according to note 1, wherein, the term vector dimensionality reduction unit further comprises：

Extracting unit, it extracts the n members of each sentence in the training corpus, wherein, n 2,3 or 4；And

Concatenation unit, it is spliced the term vector of the word in each n members, to obtain the square of the dimensionality reduction of each sentence Battle array.

Device of the note 3. according to note 2, wherein, the normalization unit further comprises：

Computing unit is normalized, it calculates the average value of each row of the matrix of the dimensionality reduction, to obtain normalized word Vector characteristics.

Device of the note 4. according to note 1,2 or 3, further comprises：

Classifier training unit, it is using the normalized term vector feature as grader feature, training grader mould Type；And

Language material taxon, it classifies the un-annotated data based on trained sorter model.

Device of the note 5. according to note 4, further comprises candidate word set determination unit, the candidate word set determines Unit is used for：

Seed mood dictionary is obtained, each seed mood word in the seed mood dictionary is classified into multiple and different One in mood classification；

The term vector of the training seed mood word；And

Term vector based on the training corpus and the COS distance between the term vector of the seed mood word, determine to wait Selection thread word set.

Device of the note 6. according to note 5, further comprises the first taxon, and first taxon is used In：

The word vector of the training seed mood word；And

The term vector of the seed mood word and word vector are combined together as to the feature of grader, to the candidate Candidate's mood word that mood word is concentrated carries out the first mood classification.

Device of the note 7. according to note 6, further comprises the second taxon, and second taxon is used In：

The term vector for the seed mood word being directed to respectively in each mood classification forms multiple two-dimensional matrixes；

The center of the multiple two-dimensional matrix is calculated respectively；

Calculate probability of the candidate's mood word at each center；And

Second mood classification is carried out to candidate's mood word based on probability.

Device of the note 8. according to note 7, further comprises the 3rd taxon, and the 3rd taxon is used In：

Include the sentence of the seed mood word or candidate's mood word for the training corpus, extract the kind respectively Word before and after sub- mood word and seed mood word composition seed triple or extract candidate's mood word it Preceding and word afterwards and candidate's mood word composition candidate's triple；And

Classified based on the seed triple to candidate's triple and made the classification of candidate's triple Classify for the 3rd mood of candidate's mood word.

Device of the note 9. according to note 8, further comprises classification determination unit, when candidate's mood word is concentrated Candidate's mood word classify in first mood, the result of second mood classification and the 3rd mood classification is worked as In when there is at least two identical results, the classification determination unit determines the mood classification of one candidate's mood word, And one candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and

The classifier training unit is further by the mood word set and the term vector feature together as described point Class device feature.

Device of the note 10. according to note 4 or 9, further comprises：

Classification dimensionality reduction unit, its N members feature to the corpus of text carry out dimensionality reduction, wherein, N is 1 or 2, and

The classifier training unit is further by the N members feature of dimensionality reduction and the term vector feature or the N by dimensionality reduction First feature, the term vector feature and the mood word set are together as the grader feature.

Device of the note 11. according to note 9, wherein, when candidate's mood word that candidate's mood word is concentrated Have among the result of first mood classification, second mood classification and the 3rd mood classification different When as a result, one candidate's mood word is further added to unfiled mood word and concentrated by the classification determination unit.

Device of the note 12. according to note 11, further comprises Iterative classification unit, the Iterative classification unit pin The candidate's mood word concentrated to the unfiled mood word, repeats the first mood classification, second mood classification Classify with the 3rd mood, until candidate's mood word of the unfiled mood word concentration is determined mood classification or reaches pre- Untill fixed iterations.

Device of the note 13. according to note 5, wherein, the candidate word set determination unit will be carried from emotional noumenon storehouse The mood word and the word version of emoticon taken is as the seed mood word.

Device of the note 14. according to note 5, wherein, each seed mood base in the seed mood dictionary Tend to one be classified into multiple and different mood classifications in the mood of each seed mood word.

A kind of 15. information processing methods are attached, including：

The corpus of text from internet is obtained, wherein the corpus of text includes training corpus and un-annotated data；

Term vector is trained for the training corpus, wherein the term vector of each word is tieed up for k；

The matrix formed to the term vector of all words of each sentence in the training corpus carries out dimensionality reduction；And

The matrix of dimensionality reduction is normalized, to obtain normalized term vector feature.

Methods of the note 16. according to note 15, wherein, to all words of each sentence in the training corpus The matrix of term vector composition carries out dimensionality reduction and further comprises：

The n members of each sentence in the training corpus are extracted, wherein, n 2,3 or 4；And

The term vector of word in each n members is spliced, to obtain the matrix of the dimensionality reduction of each sentence.

Method of the note 17. according to note 16, wherein, further bag is normalized to the matrix of dimensionality reduction Include：

The average value of each row of the matrix of the dimensionality reduction is calculated, to obtain normalized term vector feature.

Method of the note 18. according to note 15,16 or 17, further comprises：

Using the normalized term vector feature as grader feature, training sorter model；And

Classified based on trained sorter model to the un-annotated data.

Method of the note 19. according to note 18, further comprises：

The term vector of the training seed mood word；And

A kind of 20. program products are attached, including the machine readable instructions code being stored therein, wherein, described instruction generation Code can make the computer perform the side according to any one of note 15-19 when being read by computer and being performed Method.

Claims

1. a kind of information processor, including：

Language material acquiring unit, its obtain corpus of text from internet, wherein the corpus of text include training corpus and Un-annotated data；

2. device according to claim 1, wherein, the term vector dimensionality reduction unit further comprises：

Concatenation unit, it is spliced the term vector of the word in each n members, to obtain the matrix of the dimensionality reduction of each sentence.

3. device according to claim 1 or 2, further comprises：

Classifier training unit, it is using the normalized term vector feature as grader feature, training sorter model；With And

4. device according to claim 3, further comprises candidate word set determination unit, the candidate word set determination unit For：

Seed mood dictionary is obtained, each seed mood word in the seed mood dictionary is classified into multiple and different moods One in classification；

The term vector of the training seed mood word；And

Term vector based on the training corpus and the COS distance between the term vector of the seed mood word, determine candidate's feelings Thread word set.

5. device according to claim 4, further comprises the first taxon, first taxon is used for：

The word vector of the training seed mood word；And

The term vector of the seed mood word and word vector are combined together as to the feature of grader, to candidate's mood Candidate's mood word in word set carries out the first mood classification.

6. device according to claim 5, further comprises the second taxon, second taxon is used for：

The center of the multiple two-dimensional matrix is calculated respectively；

Calculate probability of the candidate's mood word at each center；And

7. device according to claim 6, further comprises the 3rd taxon, the 3rd taxon is used for：

Include the sentence of the seed mood word or candidate's mood word for the training corpus, extract the seed mood word Before and after word and the seed mood word composition seed triple or before and after extracting candidate's mood word Word and candidate's mood word composition candidate's triple；And

Classified based on the seed triple to candidate's triple and using the classification of candidate's triple as institute State the 3rd mood classification of candidate's mood word.

8. device according to claim 7, further comprises classification determination unit, when the one of candidate's mood word concentration A candidate's mood word has among the result that first mood is classified, second mood is classified and the 3rd mood is classified When having at least two identical results, the classification determination unit determines the mood classification of one candidate's mood word, and One candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and

The classifier training unit is further by the mood word set and the term vector feature together as the grader Feature.

9. the device according to claim 3 or 8, further comprises：

Feature Dimension Reduction unit, its N members feature to the corpus of text carry out dimensionality reduction, wherein, N is 1 or 2, and

The classifier training unit is further by the N members feature of dimensionality reduction and the term vector feature or the N members of dimensionality reduction is special Sign, the term vector feature and the mood word set are together as the grader feature.

10. a kind of information processing method, including：