CN104572613A - Data processing device, data processing method and program - Google Patents

Data processing device, data processing method and program Download PDF

Info

Publication number
CN104572613A
CN104572613A CN201310495278.6A CN201310495278A CN104572613A CN 104572613 A CN104572613 A CN 104572613A CN 201310495278 A CN201310495278 A CN 201310495278A CN 104572613 A CN104572613 A CN 104572613A
Authority
CN
China
Prior art keywords
text
probability
theme
emotion
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310495278.6A
Other languages
Chinese (zh)
Inventor
孙健
夏迎炬
王云芝
李中华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201310495278.6A priority Critical patent/CN104572613A/en
Publication of CN104572613A publication Critical patent/CN104572613A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing device for judging whether or not a text released in a social service network by a user is a question. The data processing device comprises a topic feature acquisition unit, an emotion feature acquisition unit, a question label feature extraction unit and a classifier. The topic feature acquisition unit is configured to be used for acquiring topic features of the text according to a pre-trained topic model; the emotion feature acquisition unit is configured to be used for acquiring emotion features of the text according to a pre-trained emotion model; the question label feature extraction unit is configured to be used for acquiring question label features of the text; the classifier is configured to be used for classifying the text according to the topic features, the emotion features and the question label features.

Description

Data processing equipment, data processing method and program
Technical field
The disclosure relates to data processing field, relates to particularly, relates to a kind of for judging that whether text that user in social service network issues is the data processing equipment of problem, data processing method and program.In addition, the disclosure also relates to the method for a kind of training for the topic model in above-mentioned data processing equipment, data processing method or program, and a kind of training is used for the method for emotion model wherein.
Background technology
In social service network, such as in the social networks such as microblogging, facebook, user usually issues some viewpoints for certain topic, comment, evaluation etc.Such as, user may issue some views for physical health issues or emotion expression service.Therefore, a kind of method that identification problem is provided is needed.
Summary of the invention
Give hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.
In view of the demand described in background technology part, the present invention pays close attention to the text apparatus and method whether problem identifies issued user in social service network.Particularly, the present invention proposes a kind of by using the model of training in advance to obtain correlated characteristic in text thus judging that whether the text is data processing equipment and the method for problem based on these correlated characteristics.
According to an aspect of the present invention, provide a kind of for judging that whether text that user in social service network issues is the data processing equipment of problem, comprise: theme feature acquiring unit, be configured to utilize the topic model of training in advance to obtain the theme feature of text; Affective characteristics acquiring unit, is configured to utilize the emotion model of training in advance to obtain the affective characteristics of text; Question mark feature extraction unit, is configured to the question mark feature obtaining text; And sorter, be configured to utilize theme feature, affective characteristics and question mark feature to classify to text.
According to another aspect of the present invention, providing a kind of for judging that whether text that user in social service network issues is the data processing method of problem, comprising: utilize the topic model of training in advance to obtain the theme feature of text; The emotion model of training in advance is utilized to obtain the affective characteristics of text; Obtain the question mark feature of text; And use sorter to utilize theme feature, affective characteristics and question mark feature to classify to text.
According to a further aspect of the invention, additionally providing a kind of training for judging that whether text in social service network is the method for the topic model of problem, comprising: prepare expertise corpus; Participle is carried out to each text in expertise corpus; Extract the keyword of one or more notional word in text as the theme of reflection text; And calculate using lower probability at least partially as topic model: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
According to another aspect of the present invention, additionally providing a kind of training for judging that whether text in social service network is the method for the emotion model of problem, comprising: prepare for whether being the problem data collection that problem marked; Participle is carried out to each text that problem data is concentrated; Extract one or more non-noun in text and/or symbol as the emotion word of the Sentiment orientation of reflection text and/or symbol; Calculate using lower probability at least partially as emotion model: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
According to other side of the present invention, additionally provide corresponding computer program code, computer-readable recording medium and computer program.
By below in conjunction with accompanying drawing the following detailed description of the embodiment of the present invention, these and other advantage of the present invention will be more obvious.
Accompanying drawing explanation
In order to set forth above and other advantage and the feature of the application further, be described in further detail below in conjunction with the embodiment of accompanying drawing to the application.Described accompanying drawing comprises in this manual together with detailed description below and forms the part of this instructions.The element with identical function and structure is denoted by like references.Should be appreciated that these accompanying drawings only describe the typical case of the application, and should not regard the restriction of the scope to the application as.In the accompanying drawings:
Fig. 1 shows the structured flowchart of the data processing equipment of an embodiment according to the application;
Fig. 2 shows the structured flowchart according to the theme feature acquiring unit in the data processing equipment of an embodiment of the application;
Fig. 3 shows the schematic diagram of the production process of the topic model of an embodiment according to the application;
Fig. 4 shows the structured flowchart according to the affective characteristics acquiring unit in the data processing equipment of an embodiment of the application;
Fig. 5 shows the schematic diagram of the production process of the emotion model of an embodiment according to the application;
Fig. 6 shows the process flow diagram of the data processing method of an embodiment according to the application;
Fig. 7 shows the process flow diagram according to the theme feature obtaining step in the process disposal route of an embodiment of the application;
Fig. 8 shows the process flow diagram according to the affective characteristics obtaining step in the process disposal route of an embodiment of the application;
Fig. 9 shows the process flow diagram of the topic model training method of an embodiment according to the application;
Figure 10 shows the process flow diagram of the emotion model training method of an embodiment according to the application; And
Figure 11 is the block diagram of the example arrangement of the general purpose personal computer that wherein can realize method and/or device according to an embodiment of the invention.
Embodiment
To be described one exemplary embodiment of the present invention by reference to the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in the description.But, should understand, must make a lot specific to the decision of embodiment in the process of any this practical embodiments of exploitation, to realize the objectives of developer, such as, meet those restrictive conditions relevant to system and business, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, although will also be appreciated that development is likely very complicated and time-consuming, concerning the those skilled in the art having benefited from present disclosure, this development is only routine task.
At this, also it should be noted is that, in order to avoid the present invention fuzzy because of unnecessary details, illustrate only in the accompanying drawings with according to the closely-related device structure of the solution of the present invention and/or treatment step, and eliminate other details little with relation of the present invention.
Description is hereinafter carried out in the following order:
1. data processing equipment
2. data processing method
3. topic model training method
4. emotion model training method
5. in order to implement the computing equipment of the apparatus and method of the application
[1. data processing equipment]
First with reference to Fig. 1, the structure according to the data processing equipment 100 of an embodiment of the application is described.As shown in Figure 1, data processing equipment 100 comprises: theme feature acquiring unit 101, is configured to utilize the topic model of training in advance to obtain the theme feature of text; Affective characteristics acquiring unit 102, is configured to utilize the emotion model of training in advance to obtain the affective characteristics of text; Question mark feature extraction unit 103, is configured to the question mark feature obtaining text; And sorter 104, be configured to utilize theme feature, affective characteristics and question mark feature to classify to text.
Particularly, when usage data treating apparatus 100 judges whether the text that user issues is problem, theme feature acquiring unit 101, affective characteristics acquiring unit 102 and interrogative marker characteristic extraction unit 103 obtain its theme feature, affective characteristics and question mark feature respectively from the text, then sorter uses these features obtained to classify to the text, namely judges whether the text is problem.
Wherein, theme feature represents one or more theme involved by the text, and affective characteristics represents the Sentiment orientation of the publisher that the text reflects, and question mark feature refers in text the word or symbol etc. that represent question.These features how obtaining text will be described in the following description in detail.
First the 26S Proteasome Structure and Function of theme feature acquiring unit 101 and affective characteristics acquiring unit 102 is described below in detail with reference to Fig. 2 to 5.
< theme feature acquiring unit >
As shown in Figure 2, theme feature acquiring unit 101 comprises word-dividing mode 1001, is configured to carry out participle to text; Keyword extracting module 1002, is configured to extract the keyword of one or more notional word in text as the theme of reflection text; And theme feature computing module 1003, be configured to utilize topic model to calculate the theme feature of text based on keyword, wherein, topic model comprises with lower probability at least partially: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
Wherein, word-dividing mode 1001 can use existing various technology to carry out participle to pending text.For the text after participle, keyword extracting module 1002 extracts the keyword of the notional word such as noun, verb, adjective as the theme of reflection text of text, and these keywords may be used for the theme feature calculating text.
In some text, there is the part being called thematic indicia, such as, content between two No. # is arranged as thematic indicia.Therefore, keyword extracting module 1002 be also configured to extract the content with thematic indicia at least partially as the keyword of the theme of reflection text.
As previously mentioned, theme feature computing module 1003 utilizes the topic model of training in advance to calculate the theme feature of text based on extracted keyword.Wherein, topic model obtains by carrying out training to expertise corpus.Expertise corpus has certain intellectual thus can with the set of helping the language material that problematic user deals with problems, these language materials can be such as the knowledge microbloggings etc. of expert intelligent, and these experts comprise the customer service representative etc. of the intelligent that provides knowledge in a certain field or some brand, company.
For each language material (text) in expertise corpus, first carry out participle, then extract the notional words such as noun, verb, adjective as keyword.Because every bar text is all to express one or more theme, or in order to solve a class problem, or provide a kind of technical support, therefore, these keywords reflect the theme of text.In other words, between text layers and keyword layer, there is subject layer.This subject layer can not explicitly point out but implicit expression, and therefore theme can be hidden variable.
In one embodiment, production model can be set up to text, theme and keyword, such as, use PLSA or LDA model etc.Such as, in PLSA model, the Probability p (d) of a selected text, each text belongs to a theme t with Probability p (t|d), and a given theme t, each keyword w produce with Probability p (w|t), as shown in Figure 3.By using each language material, this production model is trained, can obtain as lower probability at least partially as topic model: text (d), keyword (w) and theme (t), and the probability of aforementioned every various combinations, joint probability or conditional probability.
Illustrate how to obtain topic model for PLSA model below.As mentioned above, one or more theme of each text representation in corpus, and for each theme, all need keyword to fill, the production process shown in composition graphs 3, following joint probability expression formula can be obtained:
p ( w , t , d ) = p ( d ) &Sigma; k = 1 N p ( w | t k ) * p ( t k | d ) - - - ( 1 )
Wherein, N is the number of theme, t krepresent a kth theme, its dependent variable has the implication identical with aforementioned definitions.Using maximum likelihood probability and EM(greatest hope) derivation algorithm solves p (t|d) and p (w|t).The objective function of maximum likelihood probability is as shown in the formula shown in (2).
L = &Sigma; i = 1 N &Sigma; j = 1 M n ( d j , w j ) * log ( p ( d i , w j ) ) - - - ( 2 )
Wherein, n (d i, w j) represent i-th text d ia middle jth keyword w jnumber, N and M represents the quantity of text and keyword respectively.
First expected by E() step tries to achieve:
p ( t k | d i , w j ) = p ( w j | t k ) p ( t k ) p ( d i | t k ) &Sigma; k &prime; p ( w j | t k &prime; ) p ( t k &prime; ) p ( d i | t k &prime; ) - - - ( 3 )
Then M(maximization is carried out) step:
p ( w j | d j ) &Proportional; &Sigma; d n ( d i , w j ) * p ( t k | d i , w j )
p ( d i | t k ) &Proportional; &Sigma; w n ( d i , w j ) * p ( t k | d i , w j ) - - - ( 4 )
p ( t k ) &Proportional; &Sigma; d &Sigma;n w ( d i , w j ) * p ( t k | d i , w j )
By above-mentioned algorithm, Probability p (d), p (w|t can be obtained k), p (t k| d) etc.
Therefore, as an example, topic model can comprise: the conditional probability (p (t|d)) of the theme of the probability (p (d)) of each text, the text premised on each text, with the joint probability (p (d of the keyword in the theme of the probability (p (w|t)) theming as the keyword in the text of prerequisite of each text and each text, text and text, t, w)).
Based on this, in one embodiment, theme feature computing module 1003 is configured to the conditional probability of each theme calculated premised on text.
Exemplarily, this conditional probability p (t|d) is with this product theming as the conditional probability of each keyword of prerequisite and the prior probability of this theme, is shown below.
p ( t | d ) = p ( d | t ) &times; p ( t ) p ( d ) &Proportional; p ( d | t ) &times; p ( t ) = &Pi; w &Element; T p ( w | t ) n ( w , T ) &times; p ( t ) - - - ( 5 )
Wherein, p (d|t) is the probability of the text premised on theme t, and p (w|t) is the probability of the keyword w premised on theme t, subscript n (w, T) represent the number of times that keyword w occurs in text T, p (t) is the prior probability of theme t.This prior probability is in expertise corpus, after training production model, and the distribution situation of each theme obtained.The conditional probability p (t|d) obtained as the theme feature of text for classification.
It should be noted that the mode producing topic model is not limited to above-mentioned PLSA or LDA, but any mode that can obtain above-mentioned various probability can be used.
< affective characteristics acquiring unit >
As shown in Figure 4, affective characteristics acquiring unit 102 comprises: word-dividing mode 2001, is configured to carry out participle to text; Emotion word and/or symbol extraction module 2002, be configured to extract one or more non-noun in text and/or symbol as the emotion word of the Sentiment orientation of reflection text and/or symbol; And affective characteristics computing module 2003, be configured to utilize emotion model to calculate the affective characteristics of text based on emotion word and/or symbol, wherein, emotion model comprises with lower probability at least partially: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
Wherein, word-dividing mode 2001 and aforementioned word-dividing mode 1001 have similar function and structure, and existing various technology can be used to carry out participle to pending text.In concrete enforcement, word-dividing mode 2001 and word-dividing mode 1001 can be same module or element, also can be disparate modules or the element with identical function.
For the text after participle, emotion word and/or symbol extraction module 2002 extract symbol such as emotion animation or emoticon etc. in other words of non-noun and text as the emotion word of the Sentiment orientation of reflection text and/or symbol.Therefore, in one embodiment, for the text comprising symbol and represent, after participle, also needing these symbol transition is corresponding word, such as " goes mad ", " having cried soon ", " laugh " etc.In the following description, for simplicity, emotion word and/or symbol are referred to as emotion word.
In addition, in one embodiment, emotion word and/or symbol extraction module 2002 are also configured to the emotion word of the most contiguous negative word and/or symbol transition is its antonym.Negative word includes, but are not limited to: avoid, be not, or not, can not have, can not, be difficult to, not too, reduce, do not have, no longer, could not, change, how can, how can, cannot, seldom.Such as, for text " today goes to have seen " A Fanda ", really do not allow me disappointed ", wherein, " disappointment " is pessimistic word, but the words is not downbeat mood, thus " disappointment " is become its antonym such as " hope " as emotion word.
Generally speaking, Sentiment orientation has three kinds: positive emotion (such as actively, optimistic), negative emotion (such as passive, pessimistic) and neutral emotion.The text that user issues embodies above-mentioned certain or several emotion by emotion word.
As previously mentioned, affective characteristics computing module 2003 utilizes the emotion model of training in advance to calculate the affective characteristics of text based on extracted emotion word.Wherein, emotion model obtains by carrying out training to the problem data collection of artificial mark.Particularly, use artificial mark means to mark the text such as microblogging text that the user captured and arrange issues in advance, the label of mark can be such as that { 0,1}, wherein such as 0 represents that text is not problem, and 1 represents that text is problem.The training of emotion model be use label be 1 text, i.e. question text carry out.
For each question text, first carry out participle, the emotion word of the most contiguous negative word, as emotion word, can also be converted to its antonym by other words then extracting non-noun and the word that is converted by symbols such as emotion animations in this process.After the emotion word obtaining text, need to calculate the model of these emotion word and various Sentiment orientation (front, negative and neutral).
In one embodiment, be similar to the situation of topic model, production model can be set up to text, Sentiment orientation and emotion word, such as, use PLSA or LDA model etc.By using each question text, this production model is trained, can obtain as lower probability at least partially as emotion model: text (d), emotion word (w) and Sentiment orientation (s), and the probability of aforementioned every various combinations, joint probability or conditional probability.
Illustrate how to obtain emotion model for PLSA model below.Concentrate in problem data, the Probability p (d) of a selected text, each text belongs to a class emotion s with Probability p (s|d), and a given class emotion s, each emotion word w produce with Probability p (w|s), as shown in Figure 5.It is following joint probability expression formula (6) by this procedural representation.
p ( w , s , d ) = p ( d ) &Sigma; k = 1 3 p ( w | s k ) * p ( s k | d ) - - - ( 6 )
Wherein, there are 3 kinds of Sentiment orientation described above, s krepresent s kind Sentiment orientation, its dependent variable has the implication identical with aforementioned definitions.Using maximum likelihood probability and EM(greatest hope) derivation algorithm solves p (s|d) and p (w|s).The objective function of maximum likelihood probability is as shown in the formula shown in (7).
L = &Sigma; i = 1 N &Sigma; j = 1 M n ( d j , w j ) * log ( p ( d i , w j ) ) - - - ( 7 )
Wherein, n (d i, w j) represent a jth emotion word w in i-th text di jnumber, N and M represents the quantity of text and emotion word respectively.
First expected by E() step tries to achieve:
p ( s k | d i , w j ) = p ( w j | s k ) p ( s k ) p ( d i | s k ) &Sigma; k &prime; p ( w j | s k &prime; ) p ( s k &prime; ) p ( d i | t k &prime; ) - - - ( 8 )
Then M(maximization is carried out) step:
p ( w j | d j ) &Proportional; &Sigma; d n ( d i , w j ) * p ( s k | d i , w j )
p ( d i | s k ) &Proportional; &Sigma; w n ( d i , w j ) * p ( s k | d i , w j ) - - - ( 9 )
p ( s k ) &Proportional; &Sigma; d &Sigma;n w ( d i , w j ) * p ( s k | d i , w j )
By above-mentioned algorithm, Probability p (d), p (w|s can be obtained k), p (s k| d) etc.
As an example, emotion model can comprise: the joint probability (p (d of the emotion word in the Sentiment orientation of the probability (p (w|s)) of the emotion word in the conditional probability (p (s|d)) of the probability (p (d)) of each text, each Sentiment orientation premised on each text, the text premised on the Sentiment orientation of each text and each text, text and text, s, w)).
Based on this, in one embodiment, affective characteristics computing module 2003 is configured to the conditional probability of each Sentiment orientation calculated premised on text.
Exemplarily, this conditional probability is the product of the conditional probability of each emotion word premised on this Sentiment orientation and the prior probability of this Sentiment orientation, is shown below.
p ( s | d ) = p ( d | s ) &times; p ( s ) p ( d ) &Proportional; p ( d | s ) &times; p ( s ) = &Pi; w &Element; T p ( w | s ) n ( w , T ) &times; p ( s ) - - - ( 10 )
Wherein, p (d|s) is the probability of the text premised on Sentiment orientation s, p (w|s) is the probability of the emotion word w premised on Sentiment orientation s, subscript n (w, T) represent the number of times that emotion word w occurs in text T, p (s) is the prior probability of Sentiment orientation s.This prior probability concentrates in the problem data marked, after training production model, and the distribution situation of each Sentiment orientation obtained.The conditional probability p (s|d) obtained as the affective characteristics of text for classification.
It should be noted that similarly, the mode producing emotion model is not limited to above-mentioned PLSA or LDA, but can use any mode that can obtain above-mentioned various probability.
< question mark feature extraction unit >
Question mark feature extraction unit 103 extracts the question mark feature in text.This can use existing any extracting method to carry out.This question mark feature can be interrogative such as what, how etc. or query symbol such as question mark.This feature such as can use Boolean type, and { 0,1} limits, and namely represents its presence or absence.
< sorter >
After as above obtaining the theme feature of text, affective characteristics and question mark feature, sorter 104 utilizes these features to classify to text, namely predicts that the text is problem or is not problem.
The sorter that existing various sorting technique can be used to build is classified to text, includes but not limited to: support vector machine, random forest, decision tree, K k-nearest neighbor, maximum entropy etc.
Owing to considering theme feature, affective characteristics and question mark feature in assorting process comprehensively, therefore, compared with conventional apparatus, the data processing equipment 100 of the application can obtain classification results more accurately, namely can judge with higher accuracy whether the text that user issues is problem.
[2. data processing method]
Below describe the embodiment according to data processing equipment of the present invention by reference to the accompanying drawings, in fact also illustrate a kind of data processing method in the process.Briefly describe described methods combining accompanying drawing 6 to 8 below, details wherein can see above to the description of data processing equipment.
This data processing method is for judging whether the text that in social service network, user issues is problem, and as shown in Figure 6, the method comprises the steps: to utilize the topic model of training in advance to obtain the theme feature (S11) of text; The emotion model of training in advance is utilized to obtain the affective characteristics (S12) of text; Obtain the question mark feature (S13) of text; And use sorter to utilize theme feature, affective characteristics and question mark feature to classify (S14) to text.
Although it should be noted that in Fig. 6 and illustrated that step S11 to S13 performs in turn, and not necessarily is like this, but can with other various orders or partly or entirely parallel execution of steps S11 to S13.
Wherein, in one embodiment, topic model utilizes production model to obtain based on expertise corpus training in advance, and emotion model utilizes production model to obtain based on the problem data collection training in advance manually marked.
Fig. 7 shows a kind of embodiment of theme feature obtaining step S11, and as shown in Figure 7, step S11 comprises: carry out participle (S101) to text; Extract the keyword (S102) of one or more notional word in text as the theme of reflection text; And utilize topic model to calculate the theme feature (S103) of text based on keyword.Wherein, this topic model comprises with lower probability at least partially: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.As mentioned above, subject layer is the hidden layer between text layers and keyword layer.
In one embodiment, step S102 also comprise extract have the content of thematic indicia at least partially as keyword.Exemplarily, the conditional probability that theme feature S103 comprises each theme calculated premised on text is calculated.In one embodiment, this conditional probability is with the corresponding product theming as the conditional probability of each keyword of prerequisite and the prior probability of this theme.
Fig. 8 shows a kind of embodiment of affective characteristics obtaining step S12, and as shown in Figure 8, step S12 comprises: carry out participle (S201) to text; Extract one or more non-noun in text and/or symbol as the emotion word of the Sentiment orientation of reflection text and/or symbol (S202); And utilize emotion model to calculate the affective characteristics (S203) of text based on emotion word and/or symbol.Wherein, this emotion model comprises with lower probability at least partially: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.Wherein, Sentiment orientation can comprise positive emotion, negative emotion and neutral emotion three class.
In one embodiment, step S202 also comprises the emotion word of the most contiguous negative word and/or symbol transition is its antonym.Exemplarily, the conditional probability that affective characteristics S203 comprises each Sentiment orientation calculated premised on text is calculated.In one embodiment, this conditional probability is the product of each emotion word premised on corresponding Sentiment orientation and/or the conditional probability of symbol and the prior probability of this Sentiment orientation.
Correlative detail in above embodiment provides in detail in the description of data processing equipment, does not repeat them here.
[3. topic model training method]
In the process of foregoing description data processing equipment 100, in fact a kind of topic model training method is also disclosed, this topic model is for judging whether the text in social service network is problem, and as shown in Figure 9, the method comprises: prepare expertise corpus (S21); Participle (S22) is carried out to each text in expertise corpus; Extract the keyword (S23) of one or more notional word in text as the theme of reflection text; And calculate using lower probability at least partially as topic model: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability (S24).
In one embodiment, this topic model comprises: the conditional probability of the theme of the probability of each text, the text premised on each text, with the joint probability of the keyword in the theme of the probability theming as the keyword in the text of prerequisite of each text and each text, text and text.
Train the topic model obtained to reflect text, distribution relation between theme and keyword by said method, thus may be used for the theme feature calculating text to be predicted.Correlative detail in above embodiment provides in detail in the description of data processing equipment, does not repeat them here.
[4. emotion model training method]
In the process of foregoing description data processing equipment 100, in fact a kind of emotion model training method is also disclosed, this emotion model is for judging whether the text in social service network is problem, as shown in Figure 10, whether the method comprises: prepare for being the problem data collection (S31) that problem marked; Participle (S32) is carried out to each text that described problem data is concentrated; Extract one or more non-noun in text and/or symbol as the emotion word of the Sentiment orientation of the described text of reflection and/or symbol (S33); Calculate using lower probability at least partially as described emotion model: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability (S34).
In one embodiment, this emotion model comprises: the emotion word in the Sentiment orientation of the emotion word in the conditional probability of the probability of each text, each Sentiment orientation premised on each text, the text premised on the Sentiment orientation of each text and/or the probability of symbol and each text, text and text and/or the joint probability of symbol.
Train the emotion model obtained to reflect text, distribution relation between Sentiment orientation and emotion word by said method, thus may be used for the affective characteristics calculating text to be predicted.Correlative detail in above embodiment provides in detail in the description of data processing equipment, does not repeat them here.
[5. in order to implement the computing equipment of the apparatus and method of the application]
In said apparatus, all modules, unit are configured by software, firmware, hardware or its mode combined.Configure spendable concrete means or mode is well known to those skilled in the art, do not repeat them here.When being realized by software or firmware, to the computing machine (multi-purpose computer 1100 such as shown in Figure 11) with specialized hardware structure, the program forming this software is installed from storage medium or network, this computing machine, when being provided with various program, can perform various functions etc.
In fig. 11, CPU (central processing unit) (CPU) 1101 performs various process according to the program stored in ROM (read-only memory) (ROM) 1102 or from the program that storage area 1108 is loaded into random-access memory (ram) 1103.In RAM1103, also store the data required when CPU1101 performs various process etc. as required.CPU1101, ROM1102 and RAM1103 are connected to each other via bus 1104.Input/output interface 1105 is also connected to bus 1104.
Following parts are connected to input/output interface 1105: importation 1106(comprises keyboard, mouse etc.), output 1107(comprises display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 1108(comprises hard disk etc.), communications portion 1109(comprises network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 1109 is via network such as the Internet executive communication process.As required, driver 1110 also can be connected to input/output interface 1105.Removable media 1111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 1110 as required, and the computer program therefrom read is installed in storage area 1108 as required.
When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as removable media 1111 installs the program forming software.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 11, distributes the removable media 1111 to provide program to user separately with equipment.The example of removable media 1111 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM1102, comprise in storage area 1108 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.
The present invention also proposes a kind of program product storing the instruction code of machine-readable.When described instruction code is read by machine and performs, the above-mentioned method according to the embodiment of the present invention can be performed.
Correspondingly, be also included within of the present invention disclosing for carrying the above-mentioned storage medium storing the program product of the instruction code of machine-readable.Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.
Finally, also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.In addition, when not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
Although describe embodiments of the invention in detail by reference to the accompanying drawings above, it should be understood that embodiment described above is just for illustration of the present invention, and be not construed as limiting the invention.For a person skilled in the art, can make various changes and modifications above-mentioned embodiment and not deviate from the spirit and scope of the invention.Therefore, scope of the present invention is only limited by appended claim and equivalents thereof.
By above-mentioned description, The embodiment provides following technical scheme.
Whether remarks 1. 1 kinds is the data processing equipment of problem for the text that judges user in social service network and issue, comprising:
Theme feature acquiring unit, is configured to utilize the topic model of training in advance to obtain the theme feature of described text;
Affective characteristics acquiring unit, is configured to utilize the emotion model of training in advance to obtain the affective characteristics of described text;
Question mark feature extraction unit, is configured to the question mark feature obtaining described text; And
Sorter, is configured to utilize described theme feature, described affective characteristics and described question mark feature to classify to described text.
The data processing equipment of remarks 2. according to remarks 1, wherein, described theme feature acquiring unit comprises:
Word-dividing mode, is configured to carry out participle to described text;
Keyword extracting module, is configured to extract the keyword of one or more notional word in described text as the theme of the described text of reflection; And
Theme feature computing module, is configured to utilize described topic model to calculate the theme feature of described text based on described keyword,
Wherein, described topic model comprises with lower probability at least partially: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The data processing equipment of remarks 3. according to remarks 1, wherein, described affective characteristics acquiring unit comprises:
Word-dividing mode, is configured to carry out participle to described text;
Emotion word and/or symbol extraction module, be configured to extract one or more non-noun in described text and/or symbol as the emotion word of the Sentiment orientation of the described text of reflection and/or symbol; And
Affective characteristics computing module, is configured to utilize described emotion model to calculate the affective characteristics of described text based on described emotion word and/or symbol,
Wherein, described emotion model comprises with lower probability at least partially: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The data processing equipment of remarks 4. according to remarks 2, wherein, described theme feature computing module is configured to the conditional probability of each theme calculated premised on described text.
The data processing equipment of remarks 5. according to remarks 3, wherein, described affective characteristics computing module is configured to the conditional probability of each Sentiment orientation calculated premised on described text.
The data processing equipment of remarks 6. according to remarks 4, wherein, the conditional probability of the theme premised on described text is theme as the product of the conditional probability of each keyword of prerequisite and the prior probability of this theme with this.
The data processing equipment of remarks 7. according to remarks 5, wherein, the conditional probability of the Sentiment orientation premised on described text is the product of each emotion word premised on this Sentiment orientation and/or the conditional probability of symbol and the prior probability of this Sentiment orientation.
The data processing equipment of remarks 8. according to remarks 2, wherein, described keyword extracting module be configured to extract the content with thematic indicia at least partially as the described keyword of theme of the described text of reflection.
The data processing equipment of remarks 9. according to remarks 3, wherein, it is its antonym that described emotion word and/or symbol extraction module are also configured to the emotion word of the most contiguous negative word and/or symbol transition.
Whether remarks 10. 1 kinds is the data processing method of problem for the text that judges user in social service network and issue, comprising:
The topic model of training in advance is utilized to obtain the theme feature of described text;
The emotion model of training in advance is utilized to obtain the affective characteristics of described text;
Obtain the question mark feature of described text; And
Sorter is used to utilize described theme feature, described affective characteristics and described question mark feature to classify to described text.
The data processing method of remarks 11. according to remarks 10, wherein, obtains described theme feature and comprises:
Participle is carried out to described text;
Extract the keyword of one or more notional word in described text as the theme of the described text of reflection; And
Described topic model is utilized to calculate the theme feature of described text based on described keyword,
Wherein, described topic model comprises with lower probability at least partially: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The data processing method of remarks 12. according to remarks 10, wherein, obtains described affective characteristics and comprises:
Participle is carried out to described text;
Extract one or more non-noun in described text and/or symbol as the emotion word of the Sentiment orientation of the described text of reflection and/or symbol; And
Described emotion model is utilized to calculate the affective characteristics of described text based on described emotion word and/or symbol,
Wherein, described emotion model comprises with lower probability at least partially: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The data processing method of remarks 13. according to remarks 11, wherein, calculates the conditional probability that described theme feature comprises each theme calculated premised on described text.
The data processing method of remarks 14. according to remarks 12, wherein, calculates the conditional probability that described affective characteristics comprises each Sentiment orientation calculated premised on described text.
The data processing method of remarks 15. according to remarks 13, wherein, the conditional probability of the theme premised on described text is theme as the product of the conditional probability of each keyword of prerequisite and the prior probability of this theme with this.
The data processing equipment of remarks 16. according to remarks 14, wherein, the conditional probability of the Sentiment orientation premised on described text is the product of each emotion word premised on this Sentiment orientation and/or the conditional probability of symbol and the prior probability of this Sentiment orientation.
Remarks 17. 1 kinds training, for judging that whether text in social service network is the method for the topic model of problem, comprising:
Prepare expertise corpus;
Participle is carried out to each text in described expertise corpus;
Extract the keyword of one or more notional word in text as the theme of the described text of reflection; And
Calculate using lower probability at least partially as described topic model: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The method of remarks 18. according to remarks 17, wherein, described topic model comprises: the conditional probability of the theme of the probability of each text, the text premised on each text, with the joint probability of the keyword in the theme of the probability theming as the keyword in the text of prerequisite of each text and each text, text and text.
Remarks 19. 1 kinds training, for judging that whether text in social service network is the method for the emotion model of problem, comprising:
Prepare for whether being the problem data collection that problem marked;
Participle is carried out to each text that described problem data is concentrated;
Extract one or more non-noun in text and/or symbol as the emotion word of the Sentiment orientation of the described text of reflection and/or symbol;
Calculate using lower probability at least partially as described emotion model: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
The method of remarks 20. according to remarks 19, wherein, described emotion model comprises: the emotion word in the Sentiment orientation of the emotion word in the conditional probability of the probability of each text, each Sentiment orientation premised on each text, the text premised on the Sentiment orientation of each text and/or the probability of symbol and each text, text and text and/or the joint probability of symbol.

Claims (10)

1., for judging that whether text that user in social service network issues is a data processing equipment for problem, comprising:
Theme feature acquiring unit, is configured to utilize the topic model of training in advance to obtain the theme feature of described text;
Affective characteristics acquiring unit, is configured to utilize the emotion model of training in advance to obtain the affective characteristics of described text;
Question mark feature extraction unit, is configured to the question mark feature obtaining described text; And
Sorter, is configured to utilize described theme feature, described affective characteristics and described question mark feature to classify to described text.
2. data processing equipment according to claim 1, wherein, described theme feature acquiring unit comprises:
Word-dividing mode, is configured to carry out participle to described text;
Keyword extracting module, is configured to extract the keyword of one or more notional word in described text as the theme of the described text of reflection; And
Theme feature computing module, is configured to utilize described topic model to calculate the theme feature of described text based on described keyword,
Wherein, described topic model comprises with lower probability at least partially: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
3. data processing equipment according to claim 1, wherein, described affective characteristics acquiring unit comprises:
Word-dividing mode, is configured to carry out participle to described text;
Emotion word and/or symbol extraction module, be configured to extract one or more non-noun in described text and/or symbol as the emotion word of the Sentiment orientation of the described text of reflection and/or symbol; And
Affective characteristics computing module, is configured to utilize described emotion model to calculate the affective characteristics of described text based on described emotion word and/or symbol,
Wherein, described emotion model comprises with lower probability at least partially: text, emotion word and/or symbol and Sentiment orientation, and the probability of aforementioned every various combinations, joint probability or conditional probability.
4. data processing equipment according to claim 2, wherein, described theme feature computing module is configured to the conditional probability of each theme calculated premised on described text.
5. data processing equipment according to claim 3, wherein, described affective characteristics computing module is configured to the conditional probability of each Sentiment orientation calculated premised on described text.
6. data processing equipment according to claim 4, wherein, the conditional probability of the theme premised on described text is theme as the product of the conditional probability of each keyword of prerequisite and the prior probability of this theme with this.
7. data processing equipment according to claim 5, wherein, the conditional probability of the Sentiment orientation premised on described text is the product of each emotion word premised on this Sentiment orientation and/or the conditional probability of symbol and the prior probability of this Sentiment orientation.
8., for judging that whether text that user in social service network issues is a data processing method for problem, comprising:
The topic model of training in advance is utilized to obtain the theme feature of described text;
The emotion model of training in advance is utilized to obtain the affective characteristics of described text;
Obtain the question mark feature of described text; And
Sorter is used to utilize described theme feature, described affective characteristics and described question mark feature to classify to described text.
9. training is for judging that whether text in social service network is a method for the topic model of problem, comprising:
Prepare expertise corpus;
Participle is carried out to each text in described expertise corpus;
Extract the keyword of one or more notional word in text as the theme of the described text of reflection; And
Calculate using lower probability at least partially as described topic model: text, keyword and theme, and the probability of aforementioned every various combinations, joint probability or conditional probability.
10. method according to claim 9, wherein, described topic model comprises: the conditional probability of the theme of the probability of each text, the text premised on each text, with the joint probability of the keyword in the theme of the probability theming as the keyword in the text of prerequisite of each text and each text, text and text.
CN201310495278.6A 2013-10-21 2013-10-21 Data processing device, data processing method and program Pending CN104572613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310495278.6A CN104572613A (en) 2013-10-21 2013-10-21 Data processing device, data processing method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310495278.6A CN104572613A (en) 2013-10-21 2013-10-21 Data processing device, data processing method and program

Publications (1)

Publication Number Publication Date
CN104572613A true CN104572613A (en) 2015-04-29

Family

ID=53088717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310495278.6A Pending CN104572613A (en) 2013-10-21 2013-10-21 Data processing device, data processing method and program

Country Status (1)

Country Link
CN (1) CN104572613A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN109684444A (en) * 2018-11-02 2019-04-26 厦门快商通信息技术有限公司 A kind of intelligent customer service method and system
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
CN117151344A (en) * 2023-10-26 2023-12-01 乘木科技(珠海)有限公司 Digital twin city population management method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device
US8337208B1 (en) * 2009-11-19 2012-12-25 The United States Of America As Represented By The Administrator Of The National Aeronautics & Space Administration Content analysis to detect high stress in oral interviews and text documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8337208B1 (en) * 2009-11-19 2012-12-25 The United States Of America As Represented By The Administrator Of The National Aeronautics & Space Administration Content analysis to detect high stress in oral interviews and text documents
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
同济大学概率统计教研组: "《概率统计》", 31 May 2013 *
尹航: "基于特定领域汉语意见型问答系统的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN108280164A (en) * 2018-01-18 2018-07-13 武汉大学 A kind of short text filtering and sorting technique based on classification related words
CN108280164B (en) * 2018-01-18 2021-10-01 武汉大学 Short text filtering and classifying method based on category related words
CN109684444A (en) * 2018-11-02 2019-04-26 厦门快商通信息技术有限公司 A kind of intelligent customer service method and system
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
CN109783800B (en) * 2018-12-13 2024-04-12 北京百度网讯科技有限公司 Emotion keyword acquisition method, device, equipment and storage medium
CN117151344A (en) * 2023-10-26 2023-12-01 乘木科技(珠海)有限公司 Digital twin city population management method
CN117151344B (en) * 2023-10-26 2024-02-02 乘木科技(珠海)有限公司 Digital twin city population management method

Similar Documents

Publication Publication Date Title
Smeureanu et al. Applying supervised opinion mining techniques on online user reviews
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN108038725A (en) A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
Sharma et al. A document-level sentiment analysis approach using artificial neural network and sentiment lexicons
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN104268197A (en) Industry comment data fine grain sentiment analysis method
El-Halees Mining opinions in user-generated contents to improve course evaluation
Kumari et al. Sentiment analysis of smart phone product review using SVM classification technique
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN104794500A (en) Tri-training semi-supervised learning method and device
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN106202481A (en) The evaluation methodology of a kind of perception data and system
CN103164428B (en) Determine the method and apparatus of the correlativity of microblogging and given entity
CN104572613A (en) Data processing device, data processing method and program
Romanov et al. Application of natural language processing algorithms to the task of automatic classification of Russian scientific texts
Tyagi et al. Sentiment analysis of product reviews using support vector machine learning algorithm
CN103473380A (en) Computer text sentiment classification method
Antonio et al. Sentiment analysis for covid-19 in Indonesia on Twitter with TF-IDF featured extraction and stochastic gradient descent
Bakht et al. Game-based crowdsourcing to support collaborative customization of the definition of sustainability
CN106055633A (en) Chinese microblog subjective and objective sentence classification method
CN104951478A (en) Information processing method and information processing device
Gaye et al. Sentiment classification for employees reviews using regression vector-stochastic gradient descent classifier (RV-SGDC)
CN106445914A (en) Microblog emotion classifier establishing method and device
CN105550292A (en) Web page classification method based on von Mises-Fisher probability model
Vishnu et al. Learning domain-specific and domain-independent opinion oriented lexicons using multiple domain knowledge

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150429