CN107977352A - Information processor and method - Google Patents
Information processor and method Download PDFInfo
- Publication number
- CN107977352A CN107977352A CN201610921729.1A CN201610921729A CN107977352A CN 107977352 A CN107977352 A CN 107977352A CN 201610921729 A CN201610921729 A CN 201610921729A CN 107977352 A CN107977352 A CN 107977352A
- Authority
- CN
- China
- Prior art keywords
- mood
- word
- term vector
- candidate
- seed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This disclosure relates to information processor and method.Information processor includes:Language material acquiring unit, it obtains the corpus of text from internet, wherein, the corpus of text includes training corpus and un-annotated data;Term vector training unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k;Term vector dimensionality reduction unit, the matrix of the term vector composition of its all word to each sentence in the training corpus carry out dimensionality reduction;And normalization unit, its matrix to dimensionality reduction are normalized, to obtain normalized term vector feature.Fixed dimension effectively by term vector dimensionality reduction and can be normalized to according to the information processor of the disclosure, then obtain normalized term vector feature.
Description
Technical field
This disclosure relates to the technical field of information processing, more particularly to the apparatus and method of mood word classification.
Background technology
This part provides the background information related with the disclosure, this is not necessarily the prior art.
With the continuous development of artificial intelligence technology, affection computation plays more and more important work in human-computer interaction
With.Traditional Emotion identification task is based primarily upon the methods of mood dictionary, rule, and the dependence to mood dictionary is larger.This both led
The limitation of coverage has been caused to consume more times again.
The content of the invention
This part provides the general summary of the disclosure, rather than its four corner or the comprehensive of its whole feature drape over one's shoulders
Dew.
The purpose of the disclosure is to provide a kind of information processor and information processing method, it effectively drops term vector
Tie up and normalize to fixed dimension, then obtain normalized term vector feature.
According to the one side of the disclosure, there is provided a kind of information processor, the device include:Language material acquiring unit, its
The corpus of text from internet is obtained, wherein the corpus of text includes training corpus and un-annotated data;Term vector is instructed
Practice unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k;Term vector dimensionality reduction unit, its
The matrix formed to the term vector of all words of each sentence in the training corpus carries out dimensionality reduction;And normalization unit,
Its matrix to dimensionality reduction is normalized, to obtain normalized term vector feature.
According to another aspect of the present disclosure, there is provided a kind of information processing method, this method include:Acquisition comes from internet
Corpus of text, wherein the corpus of text includes training corpus and un-annotated data;Word is trained for the training corpus
Vector, wherein the term vector of each word is tieed up for k;The term vector of all words of each sentence in the training corpus is formed
Matrix carries out dimensionality reduction;And the matrix of dimensionality reduction is normalized, to obtain normalized term vector feature.
According to another aspect of the present disclosure, there is provided a kind of program product, the program product include the machine being stored therein
Device readable instruction code, wherein, described instruction code can make the computer perform root when being read by computer and being performed
According to the information processing method of the disclosure.
According to another aspect of the present disclosure, there is provided a kind of machinable medium, carries according to the disclosure thereon
Program product.
Using the information processor and method according to the disclosure, to the word of all words of each sentence in training corpus
The matrix of vector composition carries out dimensionality reduction and the matrix to dimensionality reduction is normalized, so that it is special to obtain normalized term vector
Sign.Thus, fixed dimension effectively by term vector dimensionality reduction and can be normalized to according to the information processor of the disclosure and method
Number, to obtain normalized term vector feature, so as to be conducive to the classification of mood word.
From in the description provided here, further applicability region will become obvious.Description in this summary and
Specific examples are intended merely to the purpose of signal, and are not intended to be limited to the scope of the present disclosure.
Brief description of the drawings
Attached drawing described here is intended merely to the purpose of the signal of selected embodiment and not all possible implementation, and not
It is intended to limitation the scope of the present disclosure.In the accompanying drawings:
Fig. 1 is the block diagram according to the information processor of embodiment of the disclosure;
Fig. 2 a to Fig. 2 c schematically show term vector feature structure;
Fig. 3 is the block diagram according to the information processor of another embodiment of the present disclosure;
Fig. 4 schematically shows the part in seed mood dictionary;
Fig. 5 schematically shows term vector and word vector characteristics form;
Fig. 6 is the block diagram according to a part for the information processor of embodiment of the disclosure;
Fig. 7 is the block diagram according to the information processor of the another embodiment of the disclosure;
Fig. 8 is the flow chart according to the information processing method of embodiment of the disclosure;And
Fig. 9 is that the universal personal that can wherein realize information processor and method in accordance with an embodiment of the present disclosure calculates
The block diagram of the example arrangement of machine.
Although the disclosure is subjected to various modifications and alternative forms, its specific embodiment is as an example in attached drawing
In show, and be described in detail here.It should be understood, however, that the description at this to specific embodiment is not intended to this public affairs
Open and be restricted to disclosed concrete form, but on the contrary, disclosure purpose be intended to cover spirit and scope of the present disclosure it
It is interior all modifications, equivalent and replace.It should be noted that running through several attached drawings, corresponding label indicates corresponding component.
Embodiment
The example of the disclosure is described more fully referring now to attached drawing.It is described below what is be merely exemplary in nature,
And it is not intended to be limited to the disclosure, application or purposes.
Example embodiment is provided, so that the disclosure will become detailed, and will be abundant to those skilled in the art
Pass on its scope in ground.The example of numerous specific details such as particular elements, apparatus and method is elaborated, to provide to the disclosure
The detailed understanding of embodiment.To those skilled in the art it will be obvious that, it is not necessary to use specific details, example
Embodiment can be implemented with many different forms, they shall not be interpreted to limit the scope of the present disclosure.Some
In example embodiment, well-known process, well-known structure and widely-known technique are not described in detail.
Using the technical solution of the disclosure, by term vector dimensionality reduction and fixed dimension effectively can be normalized to, to obtain
Normalized term vector feature, so as to be conducive to the classification of mood word.
Fig. 1 shows the block diagram of the information processor 100 according to one embodiment of the disclosure.As shown in Figure 1, according to
The information processor 100 of embodiment of the disclosure can include language material acquiring unit 110, term vector training unit 120, word to
Measure dimensionality reduction unit 130 and normalization unit 140.
Language material acquiring unit 110 can obtain the corpus of text from internet (such as in microblogging).Wherein, corpus of text
It is the language material and un-annotated data marked that can include training corpus.
Next, term vector training unit 120 can be directed to training corpus training term vector, wherein, the word of each word to
Measure and tieed up for k.It is, for example, possible to use the word steering volume instrument training term vector of Google.But the disclosure is not limited to this.This area
Technical staff, which is appreciated that, to train term vector using other suitable instruments or means.
However, since the length of every language material is different, in order to effectively using term vector feature, it is necessary to be dropped to term vector
Dimension.
Then, term vector dimensionality reduction unit 130 can form the term vector of all words of each sentence in training corpus
Matrix M carry out dimensionality reduction.In training corpus, the term vector of all words of each sentence can form matrix M, wherein, each
Term vector is tieed up for k.For example, as shown in Figure 2 a, " today encounters primary school classmate to sentence, and mood is good greatly." all word (amounts to 9
A word, wherein sentence end with</s>Instead of) term vector can form 9 × k dimension matrix.Term vector dimensionality reduction unit 130 can
Matrix is tieed up to 7 × 3k as shown in Figure 2 b with the matrix dimensionality reduction for tieing up the 9 × k.
Next, the matrix M' of dimensionality reduction can be normalized in normalization unit 140, to obtain normalized word
Vector characteristics.For example, as described in Fig. 2 c, normalization unit 140 can tie up matrix to 7 × 3k and be normalized, so as to obtain
Dimension such as 1 × 3k dimensional feature vectors must be fixed.Then, the feature vector of the fixation dimension can be denoted as term vector feature.
Information processor 100 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and is normalized to fixed dimension
Number, then obtains normalized term vector feature.
The technical solution of the disclosure in order to better understand, the information processor progress below for the disclosure are more detailed
Carefully describe.
According to the information processor of one embodiment of the disclosure, wherein, the term vector dimensionality reduction unit 130 can be into
One step includes extracting unit and concatenation unit.
Specifically, extracting unit can extract the n members (n-gram) of each sentence in training corpus, wherein, n 2,3
Or 4.For example, the ternary (tri-gram) of each sentence can be extracted, wherein, i-th of word xi∈Rk(1≤i≤3) are the word of k dimensions
Vector.Assuming that whole sentence is made of m word, then m-2 bar ternarys can be extracted.
Next, concatenation unit can be spliced the term vector of the word in each n members, to obtain the drop of each sentence
The matrix M' of dimension.For example, in the case of n=3 (ternary for extracting each sentence), by 3 words in each ternary to
Amount is spliced, and then obtains the new vector of 3 × k dimensions.For entirely forming sentence by m word, can obtain (m-2) × (3 ×
K) matrix of dimension, as shown in Figure 2 b.
Then, the matrix M' of dimensionality reduction can be normalized in normalization unit 140, with obtain normalized word to
Measure feature.It will be understood by those skilled in the art that it can be normalized using any appropriate means.According to the disclosure
One embodiment information processor, normalization unit 140 may further include normalization computing unit.
Specifically, normalization computing unit can calculate the average value of each row of the matrix M' of dimensionality reduction, to obtain normalizing
The term vector feature of change.For example, in the case of n=3 (ternary for extracting each sentence), concatenation unit spliced from
And obtain (m-2) × (3 × k) dimension matrix after (as shown in Figure 2 b), normalization computing unit can calculate 3 × k row in
The average value of each row, so as to obtain the term vector feature (as shown in Figure 2 c) that dimension is 3 × k.
Additionally, it is provided the information processor 300 according to another embodiment of the disclosure.Fig. 3 is shown according to this
The information processor 300 of disclosed another embodiment.Except classifier training unit 350 and language material taxon 360 it
Outside, the other components of information processor 300 as shown in Figure 3 are identical with information processor 100 as shown in Figure 1,
This is not repeated in the disclosure.
As shown in figure 3, except language material acquiring unit 110, term vector training unit 120, term vector dimensionality reduction unit 130 and
Outside normalization unit 140, information processor 300 may further include classifier training unit 350 and language material classification
Unit 360.
Classifier training unit 350 can be using normalized term vector feature as grader feature, training grader mould
Type.For example, classifier training unit 350 can be using the term vector feature that dimension as illustrated in fig. 2 c is 3 × k as grader
Feature, sorter model is trained using support vector machines (Support Vector Machine).However, the disclosure is simultaneously
Not limited to this.It will be understood by those skilled in the art that sorter model can be trained using appropriate instrument known in the art.
Next, language material taxon 360 can classify un-annotated data based on trained sorter model.
Information processor 300 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and normalizes to fixed dimension
Number, to obtain normalized term vector feature, so as to be conducive to the classification of mood word.
, can be with according to the information processor of the another embodiment of the disclosure in addition, in order to further carry out mood classification
Further comprise candidate word set determination unit 610 as shown in Figure 6.
Specifically, candidate word set determination unit 610 can obtain seed mood dictionary, every in the seed mood dictionary
A seed mood word is classified into one in multiple and different mood classifications.For example, according to one embodiment of the disclosure, wait
Word set determination unit 610 is selected to obtain the seed mood word from the extraction of emotional noumenon storehouse.According to another embodiment of the present disclosure,
Candidate word set determination unit 610 can be obtained from such as microblogging the word version of emoticon as " giggle " and " heartily " conduct
Seed mood word.In addition, according to the another embodiment of the disclosure, candidate word set determination unit 610 can also will be from emotional noumenon
The mood word of storehouse extraction is with emoticon word version together as seed mood dictionary.
In the seed mood dictionary, each seed mood word can tend to be classified into multiple and different based on its mood
Mood classification in one.For example, as shown in figure 5, each seed mood word can be categorized into respectively happiness, anger, sorrow, probably, it is frightened
With one in sorrow.However, the disclosure is not limited to this.It is different from Fig. 5 institutes it will be appreciated by those skilled in the art that can have
Other classifications for the mood classification shown.
Then, candidate word set determination unit 610 can train the term vector of each seed mood word.Equally, candidate word set
Determination unit 610 can also use the term vector of the word steering volume instrument training seed mood word of Google.But the disclosure is not
It is limited to this.
Next, candidate word set determination unit 610 can be based on each word in corpus of text term vector and each kind
COS distance between the term vector of sub- mood word determines candidate's mood word set.
Specifically, candidate word set determination unit 610 can be based on each word W in corpus of textjTerm vector with it is each
Seed mood word wiTerm vector between COS distance dijTo determine candidate's mood word set.In order to determine candidate's mood word set
It is required that, it is necessary to it is this COS distance dijThreshold value is set, which can be determined by empirical value.For example, according to this public affairs
One embodiment opened, it is assumed that threshold value is set to 0.6, then as COS distance dijWhen >=0.6, this word W in corpus of textjCan be with
Candidate's mood word set is added to as candidate word.
For example it is assumed that the term vector of each word in the term vector and corpus of text of each seed mood word is 300
Tie up, a seed mood word w in seed mood wordi=excellent, its term vector vi=[x1,x2,...,x300]=[0,1 ..,
0.5], and " happiness " classification is belonged to.Then, it is assumed that a word W in corpus of textj=strength is dazzled, its term vector is vj=[y1,
y2,...,y300]=[0.2,0.3 ..., 0.6], next, the COS distance using formula (1) calculating therebetween:
Wherein, dij> 0.6, therefore, " strength is dazzled " be introduced into candidate's mood word concentration.
In order to which the candidate's mood word for concentrating candidate's mood word carries out mood classification, according to the another embodiment of the disclosure
Information processor may further include taxon 620,630 and 640 as shown in Figure 6.
Taxon 620 can train the word vector of seed mood word, and by the term vector of seed mood word and word to
Amount is combined together as the feature of grader, and mood classification is carried out to candidate's mood word that candidate's mood word is concentrated.
For example it is assumed that the term vector and word vector of each mood word in seed mood word are 300 dimensions, for seed feelings
Thread word " happiness ", its term vector are [x1,x2,...,x300], and its word vectorial " U.S. " and " full " respectively [cx1,cx2,...,
cx300] and [cy1,cy2,...,cy300], then, the feature of grader can be expressed as shown in Figure 6.Next, taxon
620 can carry out mood classification to the mood word that candidate's mood word is concentrated, and classification results are denoted as classification 1.
The term vector for the seed mood word that taxon 630 can be directed in each mood classification respectively forms multiple two dimensions
Matrix.Next, taxon 630 can be come using such as gauss hybrid models GMM (Gaussian Mixture Model)
Each matrix center is calculated respectively.Then, taxon 630 can calculate probability of candidate's mood word at each matrix center,
And the classification using the classification with maximum probability as candidate's mood word, is denoted as classification 2.
Taxon 640, which can be directed to corpus of text, includes the sentence of seed mood word or candidate's mood word, extracts kind
Word before and after sub- mood word and seed mood word composition seed triple, or extract before candidate's mood word and
Word afterwards and candidate's mood word composition candidate's triple.Then, taxon 640 can be based on the seed triple
Mood classification is carried out to candidate's triple, and using the classification of candidate's triple as the classification of mood word, is denoted as class
Other 3.
Specifically, to extensive corpus of text, subordinate sentence is carried out according to punctuation mark.Then, therefrom choose and carry seed feelings
The sentence of thread word or candidate's mood word, and remove the forward and backward language material for including negative word of seed mood word.
Then, for the language material for including seed mood word, former and later two words of seed mood word and the seed mood are extracted
Word forms triple.Assuming that seed mood word is Ei, its former and later two word is respectively Ei-1And Ei+1, then triple is T (Ei-1,
Ei,Ei+1).If E in sentence-initial or ending, its previous word or the latter word with</s>Instead of the triple then formed
As triple T (it is true, it is glad,</s>).For triple T (Ei-1,Ei,Ei+1), it is known that seed mood word EiMood classification be
Ci, then triple T (Ei-1,Ei,Ei+1) mood classification be marked as Ci。
Next, taxon 640 can be with triple T (Ei-1,Ei,Ei+1) in three words term vector form 3 × k
The feature of dimension, the triple for example with SVM classifier to candidate's mood wordMood classification is carried out, and will
The tripleMood classification as candidate's mood wordClassification.
Fig. 6 shows a part for the information processor of one embodiment according to the disclosure, including three points
Class unit 620,630 and 640.However, the disclosure is not necessarily limited to this.The disclosure can only include three taxons 620,
One or two in 630 and 640.
After three noted above different taxons 620,630 and 640 carry out mood word classification, in order into
One step is more accurately classified, there is provided according to the information processor 700 of the another embodiment of the disclosure.Fig. 7 is shown
According to the information processor 700 of the another embodiment of the disclosure.Except classification determination unit 650 and Feature Dimension Reduction unit 750
Outside, the other components of information processor 700 as shown in Figure 7 and 300 phase of information processor as shown in Figure 3
Together, this is not repeated in the disclosure.
As shown in fig. 7, except language material acquiring unit 110, term vector training unit 120, term vector dimensionality reduction unit 130, return
One changes outside unit 140, classifier training unit 350 and language material taxon 360, and information processor 700 can be into one
Step includes classification determination unit 650 and Feature Dimension Reduction unit 750.
Specifically, when candidate's mood word that candidate's mood word is concentrated in taxon 620, taxon 630 and divides
When having at least two identical results among the classification results in class unit 640, classification determination unit 650 can determine described
The mood classification of one candidate's mood word, and one candidate's mood word is added in the seed mood dictionary, with
Obtain mood word set.
For example, for candidate's mood word " strength is dazzled ", it is respectively labeled as " liking " in classification 1, classification 2 and classification 3,
" frightened " and " happiness ", then classification determination unit 650 determines that the mood classification of " strength is dazzled " is " happiness ", and " strength is dazzled " is added to seed feelings
In thread dictionary.
Then, classifier training unit 350 can also by the mood word set and the term vector feature together as point
Class device feature.It will be appreciated by those skilled in the art that classifier training unit 350 can be trained using mode as previously discussed
Model.Herein, this is not repeated in the disclosure.
In addition, as shown in fig. 7, Feature Dimension Reduction unit is further included according to the information processor of the another embodiment of the disclosure
750.This feature dimensionality reduction unit 750 can carry out dimensionality reduction to first (N-gram) features of N of corpus of text, wherein, N is 1 or 2.
Specifically, Feature Dimension Reduction unit 750 can extract the N-gram of every text, wherein N=1 or N=2.Then, it is special
The N-gram of some threshold value can be more than with selected characteristic weight by levying dimensionality reduction unit 750.Similarly, which can pass through empirical value
To determine.It is for example, as follows for the N-gram (being denoted as t) in a certain mood classification c, its weight calculation:
Wherein, the concrete meaning of A, B, C and D are as follows:
Belong to c classes | It is not belonging to c classes | Amount to | |
Include t | A | B | A+B |
Not comprising t | C | D | C+D |
Amount to | A+C | B+D | A+B+C+D |
A represents the bar number for the sentence for not only having belonged to c classes in corpus of text but also comprising t;B is represented and is not belonging to c classes but the sentence comprising t
Sub- bar number;C, which is represented, belongs to c classes but the sentence bar number not comprising t;And D represents the sentence bar for being not only not belonging to c classes but also not comprising t
Number.
Next, classifier training unit 350 can also by the N members feature of dimensionality reduction and the term vector feature together as
Grader feature.Alternatively, classifier training unit 350 can also be by the N members feature, the term vector feature and institute of dimensionality reduction
Mood word set is stated together as grader feature.Similarly, it will be appreciated by those skilled in the art that classifier training unit 350
Can be using mode as previously discussed come training pattern.Herein, this is not repeated in the disclosure.
Information processor 700 in accordance with an embodiment of the present disclosure effectively by term vector dimensionality reduction and normalizes, and profit
Built automatically with seed mood word and update mood dictionary.In addition, information processor 700 in accordance with an embodiment of the present disclosure is also
The accuracy of mood word classification is further increased based on different mood discriminant classification mechanism.
On the other hand, when candidate's mood word of candidate's mood word concentration is in taxon 620,630 and of taxon
When having different result among the classification results in taxon 640, classification determination unit 650 can be further by institute
State candidate's mood word and be added to unfiled mood word concentration.
Next, Iterative classification list may further include according to the information processor of the another embodiment of the disclosure
Member, the Iterative classification unit can be directed to candidate's mood word that unfiled mood word is concentrated, make taxon 620, taxon
630 and taxon 640 repeat mood classification, until candidate's mood word that the unfiled mood word is concentrated is determined feelings
Thread classification or untill reaching predetermined iterations.
Information processing method in accordance with an embodiment of the present disclosure is described with reference to Fig. 8.As shown in figure 8, according to this public affairs
The information processing method for the embodiment opened starts from step S810.In step S810, the corpus of text from internet is obtained,
Wherein described corpus of text includes training corpus and un-annotated data.
Next, in step S820, term vector is trained for the training corpus, wherein the term vector of each word is k
Dimension.
Next, in step S830, the term vector of all words of each sentence in the training corpus is formed
Matrix carries out dimensionality reduction.
Finally, in step S840, the matrix of dimensionality reduction is normalized, it is special to obtain normalized term vector
Sign.
In accordance with an embodiment of the present disclosure, the square formed to the term vector of all words of each sentence in the training corpus
Battle array carries out dimensionality reduction and may further include:The n members of each sentence in the training corpus are extracted, wherein, n 2,3 or 4;With
And spliced the term vector of the word in each n members, the step of to obtain the matrix of the dimensionality reduction of each sentence.
In accordance with an embodiment of the present disclosure, the matrix of dimensionality reduction is normalized and may further include:Described in calculating
The average value of each row of the matrix of dimensionality reduction, the step of to obtain normalized term vector feature.
In accordance with an embodiment of the present disclosure, after the matrix of dimensionality reduction is normalized, may further include by
The normalized term vector feature is as grader feature, training sorter model;And based on trained sorter model
The step of classifying to the un-annotated data.
In accordance with an embodiment of the present disclosure, it may further include and obtain seed mood dictionary, in the seed mood dictionary
Each seed mood word be classified into one in multiple and different mood classifications;The word of the training seed mood word to
Amount;And the COS distance between the term vector and the term vector of the seed mood word based on the training corpus, determine to wait
The step of selection thread word set.
In accordance with an embodiment of the present disclosure, after definite candidate's mood word set, it may further include the training seed
The word vector of mood word;And the term vector of the seed mood word and word vector are combined together as to the spy of grader
Sign, the step of the first mood is classified is carried out to candidate's mood word that candidate's mood word is concentrated.
In accordance with an embodiment of the present disclosure, the first mood classification is carried out in the candidate's mood word concentrated to candidate's mood word
Afterwards, the term vector that may further include the seed mood word being directed to respectively in each mood classification forms multiple Two-Dimensional Moments
Battle array;The center of the multiple two-dimensional matrix is calculated respectively;Calculate probability of the candidate's mood word at each center;And it is based on
Probability carries out candidate's mood word the step of the second mood classification.
In accordance with an embodiment of the present disclosure, can after the second mood classification is carried out to candidate's mood word based on probability
To further comprise including the sentence of the seed mood word or candidate's mood word for the training corpus, the kind is extracted
Word before and after sub- mood word and seed mood word composition seed triple, or extract candidate's mood word it
Preceding and word afterwards and candidate's mood word composition candidate's triple;And based on the seed triple to the candidate three
The step of tuple is classified and the classification of candidate's triple is classified as the 3rd mood of candidate's mood word.
In accordance with an embodiment of the present disclosure, classified based on the seed triple to candidate's triple and by institute
Classifying as after the 3rd mood classification of candidate's mood word for candidate's triple is stated, may further include when the time
Candidate's mood word in selection thread word set is classified in first mood, second mood classification and the 3rd mood
When there is at least two identical results among the result of classification, the mood classification of one candidate's mood word is determined, and
One candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and by the mood word
The step of collection and the term vector feature are together as the grader feature.
In accordance with an embodiment of the present disclosure, may further include:Dimensionality reduction is carried out to the N members feature of the corpus of text, its
In, N is 1 or 2, and by the N members feature of dimensionality reduction and the term vector feature or by the N members feature of dimensionality reduction, the term vector
The step of feature and the mood word set are together as the grader feature.
In accordance with an embodiment of the present disclosure, when candidate's mood word of candidate's mood word concentration is in first mood
, can be by among the result of classification, second mood classification and the 3rd mood classification when there is different result
One candidate's mood word is added to unfiled mood word and concentrates.
In accordance with an embodiment of the present disclosure, it may further include the candidate's mood concentrated for the unfiled mood word
Word, repeats the first mood classification, second mood classification and the 3rd mood classification, until described unfiled
Untill candidate's mood word that mood word is concentrated is determined mood classification or reaches predetermined iterations.
In accordance with an embodiment of the present disclosure, can be by the word version of the mood word extracted from emotional noumenon storehouse and emoticon
This is as the seed mood word.
In accordance with an embodiment of the present disclosure, each seed mood word in the seed mood dictionary is based on each seed
The mood of mood word tends to one be classified into multiple and different mood classifications.
In accordance with an embodiment of the present disclosure, the corpus of text comes from the microblogging in internet.
Before the various embodiments of the above-mentioned steps of information processing method in accordance with an embodiment of the present disclosure
Made to be described in detail, this will not be repeated here.
Obviously, can be various machine readable to be stored according to each operating process of the information processing method of the disclosure
The mode of computer executable program in storage medium is realized.
Moreover, the purpose of the disclosure can also be accomplished in the following manner:Above-mentioned executable program code will be stored with
Storage medium is directly or indirectly supplied to system or equipment, and computer or central processing in the system or equipment
Unit (CPU) reads and performs above procedure code.At this time, as long as the system or equipment have the function of executive program, then
Embodiment of the present disclosure is not limited to program, and the program can also be arbitrary form, for example, target program, explanation
The program or be supplied to shell script of operating system etc. that device performs.
These above-mentioned machinable mediums include but not limited to:Various memories and storage unit, semiconductor equipment,
Disk cell such as light, magnetic and magneto-optic disk, and other are suitable for medium etc. of storage information.
In addition, computer is by the corresponding website that is connected on internet, and by the computer program according to the disclosure
Code is downloaded and is installed in computer and then performs the program, can also realize the technical solution of the disclosure.
Fig. 9 is that the universal personal that can wherein realize information processor and method in accordance with an embodiment of the present disclosure calculates
The block diagram of the example arrangement of machine.
As shown in figure 9, CPU 1301 according to the program stored in read-only storage (ROM) 1302 or from storage part 1308
The program for being loaded into random access memory (RAM) 1303 performs various processing.In RAM 1303, work as also according to needing to store
CPU 1301 performs data required during various processing etc..CPU 1301, ROM 1302 and RAM 1303 are via bus 1304
It is connected to each other.Input/output interface 1305 is also connected to bus 1304.
Components described below is connected to input/output interface 1305:Importation 1306 (including keyboard, mouse etc.), output
Part 1307 (including display, such as cathode-ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage
Part 1308 (including hard disk etc.), communications portion 1309 (including network interface card such as LAN card, modem etc.).Communication
Part 1309 performs communication process via network such as internet.As needed, driver 1310 can be connected to input/output
Interface 1305.Detachable media 1311 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in as needed
On driver 1310 so that the computer program read out is mounted in storage part 1308 as needed.
It is such as removable from network such as internet or storage medium in the case where realizing above-mentioned series of processes by software
Unload the program that the installation of medium 1311 forms software.
It will be understood by those of skill in the art that this storage medium be not limited to wherein having program stored therein shown in Fig. 9,
Separately distribute with equipment to provide a user the detachable media 1311 of program.The example of detachable media 1311 includes magnetic
Disk (including floppy disk (registration mark)), CD (including compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic
Disk (including mini-disk (MD) (registration mark)) and semiconductor memory.Alternatively, storage medium can be ROM 1302, storage part
Divide in 1308 hard disk included etc., wherein computer program stored, and user is distributed to together with the equipment comprising them.
In the system and method for the disclosure, it is clear that each component or each step can be decomposed and/or reconfigured.
These decompose and/or reconfigure the equivalents that should be regarded as the disclosure.Also, the step of performing above-mentioned series of processes can be certainly
So perform, but and need not be necessarily performed sequentially in time in chronological order according to the order of explanation.Some steps can
To perform parallel or independently of one another.
Although embodiment of the disclosure is described in detail with reference to attached drawing above, it is to be understood that reality described above
The mode of applying is only intended to the explanation disclosure, and does not form the limitation to the disclosure.For those skilled in the art, may be used
To make various changes and modifications the spirit and scope without departing from the disclosure to the above embodiment.Therefore, the disclosure
Scope is only limited by appended claim and its equivalents.
On the embodiment including above example, following note is also disclosed:
A kind of 1. information processors are attached, including:
Language material acquiring unit, it obtains the corpus of text from internet, wherein the corpus of text includes training corpus
And un-annotated data;
Term vector training unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k;
Term vector dimensionality reduction unit, the matrix of the term vector composition of its all word to each sentence in the training corpus
Carry out dimensionality reduction;And
Normalization unit, its matrix to dimensionality reduction are normalized, to obtain normalized term vector feature.
Device of the note 2. according to note 1, wherein, the term vector dimensionality reduction unit further comprises:
Extracting unit, it extracts the n members of each sentence in the training corpus, wherein, n 2,3 or 4;And
Concatenation unit, it is spliced the term vector of the word in each n members, to obtain the square of the dimensionality reduction of each sentence
Battle array.
Device of the note 3. according to note 2, wherein, the normalization unit further comprises:
Computing unit is normalized, it calculates the average value of each row of the matrix of the dimensionality reduction, to obtain normalized word
Vector characteristics.
Device of the note 4. according to note 1,2 or 3, further comprises:
Classifier training unit, it is using the normalized term vector feature as grader feature, training grader mould
Type;And
Language material taxon, it classifies the un-annotated data based on trained sorter model.
Device of the note 5. according to note 4, further comprises candidate word set determination unit, the candidate word set determines
Unit is used for:
Seed mood dictionary is obtained, each seed mood word in the seed mood dictionary is classified into multiple and different
One in mood classification;
The term vector of the training seed mood word;And
Term vector based on the training corpus and the COS distance between the term vector of the seed mood word, determine to wait
Selection thread word set.
Device of the note 6. according to note 5, further comprises the first taxon, and first taxon is used
In:
The word vector of the training seed mood word;And
The term vector of the seed mood word and word vector are combined together as to the feature of grader, to the candidate
Candidate's mood word that mood word is concentrated carries out the first mood classification.
Device of the note 7. according to note 6, further comprises the second taxon, and second taxon is used
In:
The term vector for the seed mood word being directed to respectively in each mood classification forms multiple two-dimensional matrixes;
The center of the multiple two-dimensional matrix is calculated respectively;
Calculate probability of the candidate's mood word at each center;And
Second mood classification is carried out to candidate's mood word based on probability.
Device of the note 8. according to note 7, further comprises the 3rd taxon, and the 3rd taxon is used
In:
Include the sentence of the seed mood word or candidate's mood word for the training corpus, extract the kind respectively
Word before and after sub- mood word and seed mood word composition seed triple or extract candidate's mood word it
Preceding and word afterwards and candidate's mood word composition candidate's triple;And
Classified based on the seed triple to candidate's triple and made the classification of candidate's triple
Classify for the 3rd mood of candidate's mood word.
Device of the note 9. according to note 8, further comprises classification determination unit, when candidate's mood word is concentrated
Candidate's mood word classify in first mood, the result of second mood classification and the 3rd mood classification is worked as
In when there is at least two identical results, the classification determination unit determines the mood classification of one candidate's mood word,
And one candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and
The classifier training unit is further by the mood word set and the term vector feature together as described point
Class device feature.
Device of the note 10. according to note 4 or 9, further comprises:
Classification dimensionality reduction unit, its N members feature to the corpus of text carry out dimensionality reduction, wherein, N is 1 or 2, and
The classifier training unit is further by the N members feature of dimensionality reduction and the term vector feature or the N by dimensionality reduction
First feature, the term vector feature and the mood word set are together as the grader feature.
Device of the note 11. according to note 9, wherein, when candidate's mood word that candidate's mood word is concentrated
Have among the result of first mood classification, second mood classification and the 3rd mood classification different
When as a result, one candidate's mood word is further added to unfiled mood word and concentrated by the classification determination unit.
Device of the note 12. according to note 11, further comprises Iterative classification unit, the Iterative classification unit pin
The candidate's mood word concentrated to the unfiled mood word, repeats the first mood classification, second mood classification
Classify with the 3rd mood, until candidate's mood word of the unfiled mood word concentration is determined mood classification or reaches pre-
Untill fixed iterations.
Device of the note 13. according to note 5, wherein, the candidate word set determination unit will be carried from emotional noumenon storehouse
The mood word and the word version of emoticon taken is as the seed mood word.
Device of the note 14. according to note 5, wherein, each seed mood base in the seed mood dictionary
Tend to one be classified into multiple and different mood classifications in the mood of each seed mood word.
A kind of 15. information processing methods are attached, including:
The corpus of text from internet is obtained, wherein the corpus of text includes training corpus and un-annotated data;
Term vector is trained for the training corpus, wherein the term vector of each word is tieed up for k;
The matrix formed to the term vector of all words of each sentence in the training corpus carries out dimensionality reduction;And
The matrix of dimensionality reduction is normalized, to obtain normalized term vector feature.
Methods of the note 16. according to note 15, wherein, to all words of each sentence in the training corpus
The matrix of term vector composition carries out dimensionality reduction and further comprises:
The n members of each sentence in the training corpus are extracted, wherein, n 2,3 or 4;And
The term vector of word in each n members is spliced, to obtain the matrix of the dimensionality reduction of each sentence.
Method of the note 17. according to note 16, wherein, further bag is normalized to the matrix of dimensionality reduction
Include:
The average value of each row of the matrix of the dimensionality reduction is calculated, to obtain normalized term vector feature.
Method of the note 18. according to note 15,16 or 17, further comprises:
Using the normalized term vector feature as grader feature, training sorter model;And
Classified based on trained sorter model to the un-annotated data.
Method of the note 19. according to note 18, further comprises:
Seed mood dictionary is obtained, each seed mood word in the seed mood dictionary is classified into multiple and different
One in mood classification;
The term vector of the training seed mood word;And
Term vector based on the training corpus and the COS distance between the term vector of the seed mood word, determine to wait
Selection thread word set.
A kind of 20. program products are attached, including the machine readable instructions code being stored therein, wherein, described instruction generation
Code can make the computer perform the side according to any one of note 15-19 when being read by computer and being performed
Method.
Claims (10)
1. a kind of information processor, including:
Language material acquiring unit, its obtain corpus of text from internet, wherein the corpus of text include training corpus and
Un-annotated data;
Term vector training unit, it is for training corpus training term vector, wherein the term vector of each word is tieed up for k;
Term vector dimensionality reduction unit, the matrix of the term vector composition of its all word to each sentence in the training corpus carry out
Dimensionality reduction;And
Normalization unit, its matrix to dimensionality reduction are normalized, to obtain normalized term vector feature.
2. device according to claim 1, wherein, the term vector dimensionality reduction unit further comprises:
Extracting unit, it extracts the n members of each sentence in the training corpus, wherein, n 2,3 or 4;And
Concatenation unit, it is spliced the term vector of the word in each n members, to obtain the matrix of the dimensionality reduction of each sentence.
3. device according to claim 1 or 2, further comprises:
Classifier training unit, it is using the normalized term vector feature as grader feature, training sorter model;With
And
Language material taxon, it classifies the un-annotated data based on trained sorter model.
4. device according to claim 3, further comprises candidate word set determination unit, the candidate word set determination unit
For:
Seed mood dictionary is obtained, each seed mood word in the seed mood dictionary is classified into multiple and different moods
One in classification;
The term vector of the training seed mood word;And
Term vector based on the training corpus and the COS distance between the term vector of the seed mood word, determine candidate's feelings
Thread word set.
5. device according to claim 4, further comprises the first taxon, first taxon is used for:
The word vector of the training seed mood word;And
The term vector of the seed mood word and word vector are combined together as to the feature of grader, to candidate's mood
Candidate's mood word in word set carries out the first mood classification.
6. device according to claim 5, further comprises the second taxon, second taxon is used for:
The term vector for the seed mood word being directed to respectively in each mood classification forms multiple two-dimensional matrixes;
The center of the multiple two-dimensional matrix is calculated respectively;
Calculate probability of the candidate's mood word at each center;And
Second mood classification is carried out to candidate's mood word based on probability.
7. device according to claim 6, further comprises the 3rd taxon, the 3rd taxon is used for:
Include the sentence of the seed mood word or candidate's mood word for the training corpus, extract the seed mood word
Before and after word and the seed mood word composition seed triple or before and after extracting candidate's mood word
Word and candidate's mood word composition candidate's triple;And
Classified based on the seed triple to candidate's triple and using the classification of candidate's triple as institute
State the 3rd mood classification of candidate's mood word.
8. device according to claim 7, further comprises classification determination unit, when the one of candidate's mood word concentration
A candidate's mood word has among the result that first mood is classified, second mood is classified and the 3rd mood is classified
When having at least two identical results, the classification determination unit determines the mood classification of one candidate's mood word, and
One candidate's mood word is added in the seed mood dictionary, to obtain mood word set, and
The classifier training unit is further by the mood word set and the term vector feature together as the grader
Feature.
9. the device according to claim 3 or 8, further comprises:
Feature Dimension Reduction unit, its N members feature to the corpus of text carry out dimensionality reduction, wherein, N is 1 or 2, and
The classifier training unit is further by the N members feature of dimensionality reduction and the term vector feature or the N members of dimensionality reduction is special
Sign, the term vector feature and the mood word set are together as the grader feature.
10. a kind of information processing method, including:
The corpus of text from internet is obtained, wherein the corpus of text includes training corpus and un-annotated data;
Term vector is trained for the training corpus, wherein the term vector of each word is tieed up for k;
The matrix formed to the term vector of all words of each sentence in the training corpus carries out dimensionality reduction;And
The matrix of dimensionality reduction is normalized, to obtain normalized term vector feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610921729.1A CN107977352A (en) | 2016-10-21 | 2016-10-21 | Information processor and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610921729.1A CN107977352A (en) | 2016-10-21 | 2016-10-21 | Information processor and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107977352A true CN107977352A (en) | 2018-05-01 |
Family
ID=62004764
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610921729.1A Pending CN107977352A (en) | 2016-10-21 | 2016-10-21 | Information processor and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107977352A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN109933793A (en) * | 2019-03-15 | 2019-06-25 | 腾讯科技(深圳)有限公司 | Text polarity identification method, apparatus, equipment and readable storage medium storing program for executing |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100332287A1 (en) * | 2009-06-24 | 2010-12-30 | International Business Machines Corporation | System and method for real-time prediction of customer satisfaction |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN103885933A (en) * | 2012-12-21 | 2014-06-25 | 富士通株式会社 | Method and equipment for evaluating text sentiment |
CN103927529A (en) * | 2014-05-05 | 2014-07-16 | 苏州大学 | Acquiring method, application method and application system of final classifier |
CN105930411A (en) * | 2016-04-18 | 2016-09-07 | 苏州大学 | Classifier training method, classifier and sentiment classification system |
US9460076B1 (en) * | 2014-11-18 | 2016-10-04 | Lexalytics, Inc. | Method for unsupervised learning of grammatical parsers |
-
2016
- 2016-10-21 CN CN201610921729.1A patent/CN107977352A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100332287A1 (en) * | 2009-06-24 | 2010-12-30 | International Business Machines Corporation | System and method for real-time prediction of customer satisfaction |
CN103678318A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Multi-word unit extraction method and equipment and artificial neural network training method and equipment |
CN103885933A (en) * | 2012-12-21 | 2014-06-25 | 富士通株式会社 | Method and equipment for evaluating text sentiment |
CN103927529A (en) * | 2014-05-05 | 2014-07-16 | 苏州大学 | Acquiring method, application method and application system of final classifier |
US9460076B1 (en) * | 2014-11-18 | 2016-10-04 | Lexalytics, Inc. | Method for unsupervised learning of grammatical parsers |
CN105930411A (en) * | 2016-04-18 | 2016-09-07 | 苏州大学 | Classifier training method, classifier and sentiment classification system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN108804423B (en) * | 2018-05-30 | 2023-09-08 | 深圳平安医疗健康科技服务有限公司 | Medical text feature extraction and automatic matching method and system |
CN109933793A (en) * | 2019-03-15 | 2019-06-25 | 腾讯科技(深圳)有限公司 | Text polarity identification method, apparatus, equipment and readable storage medium storing program for executing |
CN109933793B (en) * | 2019-03-15 | 2023-01-06 | 腾讯科技(深圳)有限公司 | Text polarity identification method, device and equipment and readable storage medium |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
CN111696674B (en) * | 2020-06-12 | 2023-09-08 | 电子科技大学 | Deep learning method and system for electronic medical records |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Al-Haija et al. | Breast cancer diagnosis in histopathological images using ResNet-50 convolutional neural network | |
Janowczyk et al. | Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases | |
CN113707235B (en) | Drug micromolecule property prediction method, device and equipment based on self-supervision learning | |
Bi et al. | Improving image-based plant disease classification with generative adversarial network under limited training set | |
CN109740154A (en) | A kind of online comment fine granularity sentiment analysis method based on multi-task learning | |
US10963685B2 (en) | Generating variations of a known shred | |
KR102310487B1 (en) | Apparatus and method for review analysis per attribute | |
CN109840279A (en) | File classification method based on convolution loop neural network | |
CN106202030B (en) | Rapid sequence labeling method and device based on heterogeneous labeling data | |
CN106445919A (en) | Sentiment classifying method and device | |
CN112908436B (en) | Clinical test data structuring method, clinical test recommending method and device | |
CN104346622A (en) | Convolutional neural network classifier, and classifying method and training method thereof | |
CN114582470B (en) | Model training method and device and medical image report labeling method | |
CN110222330B (en) | Semantic recognition method and device, storage medium and computer equipment | |
CN110457677B (en) | Entity relationship identification method and device, storage medium and computer equipment | |
Mazo et al. | Classification of cardiovascular tissues using LBP based descriptors and a cascade SVM | |
WO2021238279A1 (en) | Data classification method, and classifier training method and system | |
CN107977352A (en) | Information processor and method | |
CN110287311A (en) | File classification method and device, storage medium, computer equipment | |
CN106326904A (en) | Device and method of acquiring feature ranking model and feature ranking method | |
CN104537280B (en) | Protein interactive relation recognition methods based on text relation similitude | |
CN108090099A (en) | A kind of text handling method and device | |
CN106203508A (en) | A kind of image classification method based on Hadoop platform | |
JP2022541199A (en) | A system and method for inserting data into a structured database based on image representations of data tables. | |
CN114662477A (en) | Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180501 |
|
WD01 | Invention patent application deemed withdrawn after publication |