CN111538834A - Emotion dictionary construction method and system, emotion recognition method and system and storage medium - Google Patents
Emotion dictionary construction method and system, emotion recognition method and system and storage medium Download PDFInfo
- Publication number
- CN111538834A CN111538834A CN202010073983.7A CN202010073983A CN111538834A CN 111538834 A CN111538834 A CN 111538834A CN 202010073983 A CN202010073983 A CN 202010073983A CN 111538834 A CN111538834 A CN 111538834A
- Authority
- CN
- China
- Prior art keywords
- word
- pad
- emotion
- similarity
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 322
- 238000010276 construction Methods 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 title claims description 48
- 230000008909 emotion recognition Effects 0.000 title claims description 14
- 239000013598 vector Substances 0.000 claims abstract description 56
- 238000007781 pre-processing Methods 0.000 claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000011156 evaluation Methods 0.000 claims description 17
- 230000002996 emotional effect Effects 0.000 claims description 12
- 230000001419 dependent effect Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 9
- 206010020400 Hostility Diseases 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 208000019901 Anxiety disease Diseases 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000037007 arousal Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 241000287828 Gallus gallus Species 0.000 description 2
- 230000036506 anxiety Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 238000010792 warming Methods 0.000 description 2
- 235000017166 Bambusa arundinacea Nutrition 0.000 description 1
- 235000017491 Bambusa tulda Nutrition 0.000 description 1
- 241001330002 Bambuseae Species 0.000 description 1
- 206010049976 Impatience Diseases 0.000 description 1
- 235000015334 Phyllostachys viridis Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000011425 bamboo Substances 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an emotion dictionary construction method, which comprises the following steps: preprocessing the corpus data to obtain a word list after word segmentation; performing word vector training on the preprocessed corpus data to obtain word vectors; determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.
Description
Technical Field
The invention relates to a mechanism for performing corpus emotion analysis by utilizing semantics, in particular to an emotion dictionary construction method and system, an emotion recognition method and system and a computer readable storage medium.
Background
The existing emotion dictionary mainly comprises a single dimension, namely two emotion sets of a positive emotion word list and a negative emotion word list. The dictionaries widely used at present include Taiwan university emotion dictionary (NTUSD), Hopkinson emotion dictionary, HL emotion dictionary, and the like.
NTUSD is a widely used emotion dictionary collected by Taiwan university and contains 2810 positive words and 8276 negative words. The agnostic dictionary is organized by researchers in the chinese web of knowledge, where the chinese aspect contains 4570 positive words and 4374 negative words, and the english aspect contains 4360 positive words and 4574 negative words. The HL dictionary is an emotion dictionary which is issued and maintained by Hu and Liu and mainly depends on manual construction, and comprises 2006 positive words and 4783 negative words.
However, the existing dictionary mainly consists of a positive and negative word list and is not supported by a psychological model, namely the dictionary is mainly summarized by artificial subjective collection. This means that the construction of the dictionary may be affected by human subjective factors, and the final dictionary is not scientific and accurate enough.
Second, a single dimension cannot fully describe an emotion. The traditional single-dimensional emotion dictionary can only distinguish whether the emotion of the user is positive or negative, but in man-machine conversation, not only the positive and negative emotions of the user need to be captured, but also exquisite emotions such as anger and carelessness need to be obtained, and then corresponding placating dialect can be returned. Obviously, the traditional single-dimensional dictionary cannot be used for completing the task, and the lack of the multi-dimensional emotion dictionary causes that the emotion detection of the user of the man-machine conversation platform is difficult to carry out or has poor effect. At present, mainstream man-machine conversation platforms in the market are all deficient in this point, and if the platforms such as the pursuit of one year and the like do not have emotion detection mechanisms, the bamboo platform can only detect three emotions of angry, dissatisfaction and satisfaction of a user.
Furthermore, existing dictionaries are built mostly by human work. The manual construction causes the above-mentioned subjective factors to interfere with dictionary construction on one hand, and also has some disadvantages in construction efficiency on the other hand. In addition, the existing dictionary also lacks a domain-specific dictionary. The use of a universal dictionary in a human-computer interaction robot for intelligent customer service may affect the final emotion analysis accuracy. At present, an emotion dictionary in a specific field and a cross-field dictionary construction method are lacked.
Disclosure of Invention
The invention provides a mechanism for performing corpus emotion analysis by utilizing semantics, which can describe emotion from multiple dimensions on the basis of psychology, and specifically comprises the following steps:
according to an aspect of the present invention, there is provided an emotion dictionary construction method, including the steps of: preprocessing the corpus data to obtain a word list after word segmentation; performing word vector training on the preprocessed corpus data to obtain word vectors; determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.
In some embodiments of the invention, optionally, the corpus data comprises general corpus.
In some embodiments of the invention, optionally, the corpus data comprises domain-related corpora.
In some embodiments of the present invention, optionally, the preprocessing includes word segmentation, word stop, and simplified and traditional unification.
In some embodiments of the invention, optionally, the plurality of PAD seed emotion words comprise happy, boring, dependency, slight, relaxed, anxious, temperate and hostility.
In some embodiments of the present invention, optionally, the dictionary vocabulary further includes words in the near sense thesaurus using a word frequency above the first threshold.
In some embodiments of the present invention, optionally, each of the plurality of PAD seed emotion words is expanded based on the word vector and the near sense word bank to form a corresponding plurality of seed emotion word sets, and the similarity is determined based on the plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the attributed seed emotion word set is determined according to the similarity between the word vector and the plurality of PAD seed emotion words.
In some embodiments of the present invention, optionally, the near word of each of the plurality of PAD seed emotion words is obtained through the near word library to expand it.
In some embodiments of the present invention, optionally, the similarity between each word in the dictionary vocabulary and the member of each of the corresponding plurality of sets of seed emotion words is determined according to a weighted average of the similarity.
In some embodiments of the invention, optionally, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises: determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
In some embodiments of the invention, optionally, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises: determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
In some embodiments of the invention, optionallyDetermining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity comprises determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotion word, and determining the PAD value of each word in the dictionary vocabulary according to the PAD (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
In some embodiments of the invention, optionally, the parameter α is equal to 15 and the parameter β is equal to 0.17.
According to another aspect of the present invention, there is provided a system for constructing an emotion dictionary, including: the preprocessing module is configured to preprocess the corpus data to obtain a word list after word segmentation; a word vector training module configured to perform word vector training on the corpus data preprocessed by the preprocessing module to obtain a word vector; a similarity determination module configured to determine a similarity of each word in a dictionary vocabulary comprising the participled vocabulary with each of a plurality of PAD seed sentiment words having a corresponding plurality of standard PAD values; and an emotion dictionary generation module configured to determine a PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and the similarity, thereby forming the emotion dictionary.
In some embodiments of the invention, optionally, the corpus data comprises general corpus.
In some embodiments of the invention, optionally, the corpus data comprises domain-related corpora.
In some embodiments of the present invention, optionally, the preprocessing includes word segmentation, word stop, and simplified and traditional unification.
In some embodiments of the invention, optionally, the plurality of PAD seed emotion words comprise happy, boring, dependency, slight, relaxed, anxious, temperate and hostility.
In some embodiments of the present invention, optionally, the dictionary vocabulary further includes words in the near sense thesaurus using a word frequency above the first threshold.
In some embodiments of the present invention, optionally, the system further comprises an emotion word expansion module configured to expand each of the plurality of PAD seed emotion words based on the word vector and the near word library to form a corresponding plurality of seed emotion word sets, and the similarity determination module determines the similarity based on the plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the emotion word expansion module determines the seed emotion word set to which the word vector belongs according to the similarity between the word vector and the plurality of PAD seed emotion words.
In some embodiments of the present invention, optionally, the emotion word expansion module obtains a near word of each of the plurality of PAD seed emotion words through the near word library to expand the near word.
In some embodiments of the present invention, optionally, the similarity determination module determines the similarity of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words according to a weighted average of the similarity of each word with the members of each of the corresponding plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
In some embodiments of the present invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining a PAD value for the word, wherein PAD represents the PAD value for the word,max(SIMi) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
In some embodiments of the invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word, and generate an emotion dictionary according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
In some embodiments of the invention, optionally, the parameter α is equal to 15 and the parameter β is equal to 0.17.
According to another aspect of the present invention, there is provided a method for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods described above or the emotion dictionary construction system described above, the method including the steps of: preprocessing the corpus to be identified; determining evaluation words in the preprocessed corpus; mapping the evaluation words to target words in the emotion dictionary, and determining PAD values of the target words; determining the PAD value of the corpus according to the PAD value of the target word; and determining the emotion type of the corpus according to the PAD value of the corpus.
In some embodiments of the present invention, optionally, the step of determining the PAD value of the corpus according to the PAD value of the target word includes: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus.
According to another aspect of the present invention, there is provided a system for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods described above or according to any one of the emotion dictionary construction systems described above, including: a preprocessing module configured to preprocess a corpus to be recognized; an evaluation word determination module configured to determine an evaluation word in the corpus preprocessed by the preprocessing module; a target word PAD value determination module configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value of the target word; a corpus PAD value determining module configured to determine a PAD value of the corpus according to the PAD value of the target word; and the emotion type determining module is configured to determine the emotion type of the corpus according to the PAD value of the corpus.
In some embodiments of the present invention, optionally, the corpus PAD value determination module uses a weighted average of PAD values of the target words as the PAD value of the corpus.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the methods described above.
Drawings
The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.
FIG. 1 shows an emotion dictionary construction method according to an embodiment of the present invention.
FIG. 2 shows a system for constructing an emotion dictionary according to an embodiment of the present invention.
FIG. 3 illustrates a method for emotion recognition according to an embodiment of the present invention.
FIG. 4 illustrates a system for emotion recognition according to one embodiment of the present invention.
FIG. 5 illustrates an emotion dictionary construction and application method according to an embodiment of the present invention.
Detailed Description
For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of emotion dictionary construction methods and systems, emotion recognition methods and systems, and computer-readable storage media, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application.
In order to solve one or more problems that theoretical support is lacked, a single-dimensional dictionary cannot meet the requirement of human-computer conversation emotion analysis, manual construction efficiency is low, an emotion dictionary in a specific field is deficient and the like, the invention provides a PAD model-based multi-dimensional emotion dictionary construction mechanism.
In some aspects of the invention, construction of an emotion dictionary relies on a widely prevalent psychological model. The model has wide acceptance including psychologists and artificial intelligence learners. In the construction process, manual participation is hardly needed, so that the final emotion dictionary is more accurate. In particular, in some aspects of the invention, the invention introduces a PAD psychology model into the construction process of an emotional dictionary. The PAD model is a psychological model describing human emotional states, and the model is divided into P, A, D three dimensions. Wherein P is pleasure-displeasure, which represents the positive and negative characteristics of individual emotional states; a is the degree of arousal (arousal-nonarosal) which represents the level of neurophysiologic activation in an individual; d is a dominance-degree (dominance) representing the control state of the individual over the scene and others. The PAD model provides a solid theoretical support for the construction of the emotion dictionary.
In some aspects of the invention, each vocabulary comprises a plurality of dimensions. The model describes the emotion into three dimensions, so that the emotion is more accurately explained, and the requirement for constructing a multi-dimensional emotion dictionary is met. And finally, the description of the emotion by the vocabulary in the dictionary is more accurate and finer, so that the accurate exploration of the user psychology in the man-machine conversation process is facilitated. In some aspects of the invention, the invention is based on the PAD model, and the three dimensions are used for emotion description on the vocabulary, so that a multi-dimensional emotion dictionary is constructed. The three dimensions of the vocabulary include a pleasure level, a arousal level, and a dominance level. For example, the "anger" emotion and the "fear" emotion are both low in joy and high in arousal, but the two emotions are diametrically opposed in dominance, the "anger" is high in dominance, and the "fear" is low in dominance. Obviously, the PAD three-dimensional model is complete and exquisite in description of human emotion, and meets high requirements for emotion description in man-machine conversation practice.
In some aspects of the invention, the dictionary is automatically constructed using a method of seed word expansion. Through the construction and the expansion of the seed words, the interference of human subjective factors to the construction process is reduced, and the construction efficiency is obviously improved. In some aspects of the invention, the invention automatically constructs a dictionary using seed emotion words. First, the basic seed word is derived from the basic definition of the PAD model, and contains 8 basic emotions. It is then expanded using semantic space and synonym tables, expanding seed words to the order of hundreds. And finally, automatically constructing an emotion dictionary containing PAD dimension by depending on semantic space. The whole construction process basically does not need manual participation, and the construction efficiency is greatly improved.
In some aspects of the invention, domain information is added during the construction process. Related words in the field can be imported in the construction of the dictionary, the style of the final dictionary is influenced, and the emotion dictionary in the specific field can be flexibly customized. In some aspects of the invention, a domain-specific sentiment dictionary is constructed. Some aspects of the invention incorporate domain information in several ways: training by using domain-related linguistic data to obtain a word vector file; adding related words of the field when the seed words are expanded; the segmentation result of the domain-related corpus is used as a component of the vocabulary. Therefore, the domain information is added in all aspects of the construction process, and the domain related emotion dictionary is finally obtained. It is worth mentioning that only general information may be selected for use, so that the final emotion dictionary is applicable to general fields.
FIG. 1 shows an emotion dictionary construction method according to an embodiment of the present invention, and as shown, the method includes the following steps. Preprocessing the material data to obtain a word list after word segmentation in step S102; and performing word vector training on the preprocessed corpus data to obtain word vectors.
The corpus data may be a universal chinese corpus, for example, may be dog search internet corpus. Since these corpus data are derived from the general-purpose domain (thus referred to as general corpus), the emotion dictionary formed thereby can be applied to the general-purpose domain. In addition, the corpus data may also be data that needs to be analyzed, such as domain-specific user feedback data, network customer service dialogue data, and the like. Such corpus data belongs to (specific) domain-related corpora, and thus the emotion dictionary formed based on this can be applied to a specific domain. On the other hand, the introduction of the general-purpose domain corpus can improve the training effect of the word vector when the (specific) domain related corpus is used. The following table illustrates domain-related corpora according to an aspect of the present invention:
preprocessing may include, for example, one or more of data cleansing, word segmentation, stop words, simplified and traditional corporations, etc., subject to the ability to obtain processed clean corpus data for further analysis. The pre-processing word segmentation may be implemented by a jieba chinese word segmentation toolkit, and in other examples of the present invention, other natural language processing word segmentation tools, such as ICTCLAS, may be used.
In step 102, word vector training is also performed on the preprocessed corpus data to obtain word vectors, which may be implemented by a separate step. In one example of the invention, the preprocessed corpus data may be trained using a word vector, such as the Gensim toolkit. For example, the dimension is set to 200, the sliding window size is 5, the remaining parameters may use default configurations, and the model may select CBOW or Skip-Gram. And finally, obtaining a general or specific field word vector (depending on the adopted corpus data source), which provides the semantic relation information among vocabularies for the later construction of the emotion dictionary.
A similarity (SIM value) of each word in the dictionary vocabulary including the participled vocabulary and each of a plurality of PAD seed emotion words having a corresponding plurality of standard PAD values is determined in step S104. For example, each PAD seed sentiment word may be represented by three quantitative values of P, A, D, and thus form a plurality of standard PAD values. In some examples of the invention, cosine similarity in semantic space may be used to measure the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words. In the present invention, the dictionary vocabulary includes the vocabulary after the word segmentation in step S102. In other words, if the corpus data is from the general domain, the dictionary vocabulary will have the characteristics of the general domain; if the corpus data is derived from both the general domain and the specific domain, the dictionary vocabulary will have both general domain and specific domain features.
In step S106, a PAD value of each word in the dictionary vocabulary is determined according to the plurality of standard PAD values and the similarity, and an emotion dictionary is formed. Thus, each word in the emotion dictionary formed has a PAD value that is related to the standard PAD value and the similarity to the seed emotion word. The PAD value of each word may thus reflect the emotional properties behind the word.
In some embodiments of the present invention, the plurality of PAD seed emotion words may include happy, boring, dependent, slight, relaxed, anxious, temperate and hostility, which are also 8 emotions defined by the PAD emotion scale in chinese version. On the other hand, when the standard PAD value is set to be + and-values of P, A, D (i.e., can be + P, -P, + A, -A, + D and-D), a total of 8 combined value forms can also be obtained. The following table shows one possible standard PAD value for PAD seed sentiment words including happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
Emotion | PAD representation | Standard PAD value |
Happiness | +P+A+D | [+1,+1,+1] |
Boring to | -P-A-D | [-1,-1,-1] |
Depend on | +P+A-D | [+1,+1,-1] |
Thin strip view | -P-A+D | [-1,-1,+1] |
Relax the body | +P-A+D | [+1,-1,+1] |
Anxiety disorder | -P+A-D | [-1,+1,-1] |
Warming and smoothing | +P-A-D | [+1,-1,-1] |
Hostility | -P+A+D | [-1,+1,+1] |
All permutations of the 8 seed emotions, i.e. P, A, D values, are shown in the table above, whereby the value of each dimension of the PAD value of each word in the dictionary vocabulary may for example be within the interval [ -1, +1] according to the description above in relation to step S106.
In some embodiments of the invention, the dictionary vocabulary further includes words in the near sense corpus that use a word frequency above the first threshold. Thus, the dictionary vocabulary will include the word segmentation results and words in the near word library that have a word frequency above the first threshold. For example, in some examples, a synnym universal thesaurus may be used as a vocabulary source and words with a word frequency above 100 are selected as members of the dictionary vocabulary, where the synnym thesaurus uses a total of 56633 words with a word frequency above 100. Of course, other lexicons may be selected as a vocabulary source in other examples of the invention, and words in which the word frequency is above a certain threshold may also be selected as members of the dictionary vocabulary.
In some embodiments of the present invention, each of a plurality of PAD seed emotion words may be expanded based on a word vector and a near word library to form a corresponding plurality of seed emotion word sets, and the similarity is determined based on the plurality of seed emotion word sets. For example, a certain PAD seed emotion word may be extended with a word vector to get a part of the corresponding seed emotion word set. As another example, a PAD seed emotion word may be expanded by a near sense word library to obtain another part of the corresponding seed emotion word set. Accordingly, the seed emotion word set can be composed of the two parts, that is, the PAD seed emotion words can be expanded into the corresponding seed emotion word set based on the word vector and the near sense word library.
In some embodiments of the invention, the attributed seed emotion word set can be determined according to the similarity between the word vector and the plurality of PAD seed emotion words. Specifically, for example, words close to the seed emotion word in semantic space may first be obtained using cosine distance to form set si cosineWherein:
wiand (3) representing word vectors of the PAD seed emotion words, wherein the value of i depends on the number of the seed emotion words (for example, i can be 1-8), w represents the word vector to be determined, and the maximum cosine distance between the word vector to be determined and which PAD seed emotion word is calculated by using the above formula, so that the word vector to be determined is classified into the seed emotion word set corresponding to the PAD seed emotion word.
In some embodiments of the present invention, the near word of each of the plurality of PAD seed emotion words may be obtained through a near word library to expand it:
si synonyms=synonyms(wi)
wirepresenting PAD seed emotion words, function synnyms representing find word wiThe near-sense words in the near-sense word library and form a corresponding set si synonyms。
The final extended emotion seed set is the union of two sets as seed emotion word set si seed:
si seed={wi,si cosineUsi synonyms}
The value of i is 1-8 to represent 8 seed emotions, wiIs a PAD seed sentiment word. In some examples, each expansion mode may expand about 50 words or so for PAD seed emotion words, i.e., about 100 words. The following table shows a set of seed emotion words that are expanded for 8 PAD seed emotion words to 8 × 800 words (50+ 50).
In some embodiments of the present invention, the similarity (SIM value) of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words may be determined according to a weighted average of its similarities with members of each of the corresponding plurality of seed emotion word sets. And (4) calculating the SIM value of the word w in the vocabulary, namely the similarity metric value of the word w and the 8 seed emotions. The similarity measure here can also be obtained by calculating a weighted average of the corresponding expansion words (seed emotion word sets) of each seed emotion by using cosine similarity in semantic space:
the value of i is 1-8 to represent 8 seed emotions, k is the number of expansion words (the number of members), and the value of j is 1-k.
Taking fear as an example, the similarity between the seed word set and 8 seed word sets is calculated, and the larger the value is, the more similar the value is. It should be noted that the following values are merely illustrative and may differ from the actual situation.
PAD seed emotion word | SIM value |
Happiness | 0.1 |
Boring to | 0.2 |
Depend on | 0.1 |
Thin strip view | 0.4 |
Relax the body | 0.2 |
Anxiety disorder | 0.8 |
Warming and smoothing | 0.2 |
Hostility | 0.7 |
As can be seen from the above table, the word "fear" is more similar to the "hostile" or "anxious" emotions.
In some embodiments of the present invention, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarities recited in the above embodiments may comprise: determining the maximum similarity of each word in a dictionary vocabulary and the corresponding PAD seed emotional word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. Continuing with the above example, the maximum SIM value calculated is the "anxiety" emotion, and according to the table given previously, the PAD value for the term "fear" is 0.8 [ -1, +1, -1]=[-0.8,0.8,-0.8]。
In some embodiments of the present invention, some adjustment principles may also be considered in calculating the PAD value, and the step of determining the PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and the similarity recited in the above embodiments may include: firstly, determining the maximum similarity of each word in a dictionary vocabulary and the PAD seed emotion word corresponding to the word. Next, it is determined whether the maximum similarity of the word is greater than or equal to a second threshold. If the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. If the maximum similarity of the word is less than the second threshold value, thenThe PAD value of the word is 0 in each dimension. In the case that the maximum similarity of the word is still very small, the word may not be highly related to the PAD seed emotion words recorded in the foregoing, so that it is not necessary to forcibly associate the word with a certain PAD seed emotion word; conversely, each dimension of its PAD value may be designated as 0 in some examples.
In some embodiments of the present invention, the step of determining the PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and similarities described in the above embodiments may include firstly determining the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed sentiment word, and secondly determining the PAD sigmoid (α max (SIM) according to the PADi)-β)×PADiDetermining the PAD value of the word using the sigmoid function, where PAD denotes the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiThe standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented, α and β are parameters, the mode is more detailed than the previous embodiment, the obtained PAD value can reflect the emotion attribute of the corpus data better, a threshold β is set, and when max (SIM)i) A too low value also leads to a PAD value of 0, and furthermore, a α value controls the amplitude, continuing the above example, for the word "fear", the PAD value is sigmoid (α max (SIM)i)-β)×[-1,+1,-1]i。
In some embodiments of the present invention, parameter α is equal to 15 and parameter β is equal to 0.17, which are all experimentally optimized parameters.
FIG. 2 shows a system for constructing an emotion dictionary according to an embodiment of the present invention. As shown, system 20 includes a preprocessing module 22, a word vector training module 24, a similarity determination module 26, and an emotion dictionary generation module 28.
The preprocessing module 22 is configured to preprocess the corpus data to obtain a word list after word segmentation. The corpus data may be a universal chinese corpus, for example, may be dog search internet corpus. Since these corpus data are derived from the general-purpose domain (thus referred to as general corpus), the emotion dictionary formed thereby can be applied to the general-purpose domain. In addition, the corpus data may also be data that needs to be analyzed, such as domain-specific user feedback data, network customer service dialogue data, and the like. Such corpus data belongs to (specific) domain-related corpora, and thus the emotion dictionary formed based on this can be applied to a specific domain. On the other hand, the introduction of the general-purpose domain corpus can improve the training effect of the word vector when the (specific) domain related corpus is used.
Preprocessing may include, for example, one or more of data cleansing, word segmentation, stop words, simplified and traditional corporations, etc., subject to the ability to obtain processed clean corpus data for further analysis. The pre-processing word segmentation may be implemented by a jieba chinese word segmentation toolkit, and in other examples of the present invention, other natural language processing word segmentation tools, such as ICTCLAS, may be used.
Word vector training module 24 is configured to perform word vector training on the corpus data preprocessed by preprocessing module 22 to obtain word vectors. In one example of the invention, the preprocessed corpus data may be trained using a word vector, such as the Gensim toolkit. The parameters may use default configurations and the model may choose either CBOW or Skip-Gram. And finally, obtaining a general or specific field word vector (depending on the adopted corpus data source), which provides the semantic relation information among vocabularies for the later construction of the emotion dictionary.
The similarity determination module 26 is configured to determine a similarity (SIM value) of each word in the dictionary vocabulary including the participled vocabulary with each of a plurality of PAD seed emotion words having a corresponding plurality of standard PAD values. For example, each PAD seed sentiment word may be represented by three quantitative values of P, A, D, and thus form a plurality of standard PAD values. In some examples of the invention, cosine similarity in semantic space may be used to measure the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words. In the present invention, the dictionary vocabulary includes the vocabulary after word segmentation. In other words, if the corpus data is from the general domain, the dictionary vocabulary will have the characteristics of the general domain; if the corpus data is derived from both the general domain and the specific domain, the dictionary vocabulary will have both general domain and specific domain features.
And an emotion dictionary generating module 28, configured to determine a PAD value of each word in the dictionary word list according to the plurality of standard PAD values and the similarity, and further form an emotion dictionary. Thus, each word in the emotion dictionary formed has a PAD value that is related to the standard PAD value and the similarity to the seed emotion word. The PAD value of each word may thus reflect the emotional properties behind the word.
In some embodiments of the present invention, the plurality of PAD seed emotion words may include happy, boring, dependent, slight, relaxed, anxious, temperate and hostility, which are also 8 emotions defined by the PAD emotion scale in chinese version. On the other hand, when the standard PAD value is set to be + and-values of P, A, D (i.e., can be + P, -P, + A, -A, + D and-D), a total of 8 combined value forms can also be obtained. The following table shows one possible standard PAD value for PAD seed sentiment words including happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
All permutations of 8 seed emotions, i.e. P, A, D values, are shown in the table above, whereby the value of each dimension of the PAD value of each word in the dictionary vocabulary may for example be in the interval [ -1, +1 ].
In some embodiments of the invention, the dictionary vocabulary further includes words in the near sense corpus that use a word frequency above the first threshold. Thus, the dictionary vocabulary will include the word segmentation results and words in the near word library that have a word frequency above the first threshold. For example, in some examples, a synnym universal thesaurus may be used as a vocabulary source and words with a word frequency above 100 are selected as members of the dictionary vocabulary, where the synnym thesaurus uses a total of 56633 words with a word frequency above 100. Of course, other lexicons may be selected as a vocabulary source in other examples of the invention, and words in which the word frequency is above a certain threshold may also be selected as members of the dictionary vocabulary.
In some embodiments of the present invention, system 20 further comprises an emotion word expansion module (not shown) configured to expand each of the plurality of PAD seed emotion words based on the word vector and the near word library to form a corresponding plurality of seed emotion word sets, and the similarity determination module determines the similarity based on the plurality of seed emotion word sets. For example, a certain PAD seed emotion word may be extended with a word vector to get a part of the corresponding seed emotion word set. As another example, a PAD seed emotion word may be expanded by a near sense word library to obtain another part of the corresponding seed emotion word set. Accordingly, the seed emotion word set can be composed of the two parts, that is, the PAD seed emotion words can be expanded into the corresponding seed emotion word set based on the word vector and the near sense word library.
In some embodiments of the invention, the emotion word expansion module may determine the seed emotion word set to which the word vector belongs according to the similarity between the word vector and the plurality of PAD seed emotion words. Specifically, for example, words close to the seed emotion word in semantic space may first be obtained using cosine distance to form set si cosineWherein:
wiand (3) representing word vectors of the PAD seed emotion words, wherein the value of i depends on the number of the seed emotion words (for example, i can be 1-8), w represents the word vector to be determined, and the maximum cosine distance between the word vector to be determined and which PAD seed emotion word is calculated by using the above formula, so that the word vector to be determined is classified into the seed emotion word set corresponding to the PAD seed emotion word.
In some embodiments of the present invention, the emotion word expansion module may obtain a near word of each of the plurality of PAD seed emotion words through a near word library to expand it:
si synonyms=synonyms(wi)
wirepresenting PAD seed sentiment words, and function synnyms representing finding wiThe near-sense words in the near-sense word library and form a corresponding set si synonyms。
The final extended emotion seed set is the union of two sets as seed emotion word set si seed:
si seed={wi,si cosine∪si synonyms}
The value of i is 1-8 to represent 8 seed emotions, wiIs a PAD seed sentiment word. In some examples, each expansion mode may expand about 50 words or so for PAD seed emotion words, i.e., about 100 words. The following table shows a set of seed emotion words that are expanded for 8 PAD seed emotion words to 8 × 800 words (50+ 50).
In some embodiments of the present invention, similarity determination module 26 may determine a similarity (SIM value) of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words according to a weighted average of its similarities with members of each of the corresponding plurality of seed emotion word sets. And (4) calculating the SIM value of the word w in the vocabulary, namely the similarity metric value of the word w and the 8 seed emotions. The similarity measure here can also be obtained by calculating a weighted average of the corresponding expansion words (seed emotion word sets) of each seed emotion by using cosine similarity in semantic space:
the value of i is 1-8 to represent 8 seed emotions, and k is the number of the expansion words.
Taking fear as an example, the similarity between the seed word set and 8 seed word sets is calculated, and the larger the value is, the more similar the value is. It should be noted that the following values are merely illustrative and may differ from the actual situation.
As can be seen from the above table, the word "fear" is more similar to the "hostile" or "anxious" emotions.
In some embodiments of the present invention, emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. Continuing with the above example, the maximum SIM value calculated is the "anxiety" emotion, and according to the table given previously, the PAD value for the term "fear" is 0.8 [ -1, +1, -1]=[-0.8,0.8,-0.8]。
In some embodiments of the present invention, some adjustment principles may also be considered in calculating the PAD value, and the emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than a second threshold value, all dimensions of the PAD value of the word are 0. In case the maximum similarity of the word is still small, the word may not be as high as the several PAD seed emotional words described aboveDegree correlation, so that the relation between the PAD seed emotion words and the PAD seed emotion words does not need to be forcibly established at the moment; conversely, each dimension of its PAD value may be designated as 0 in some examples.
In some embodiments of the invention, emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word, and to identify the most likely word based on PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiThe standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented, α and β are parameters, the mode is more detailed than the previous embodiment, the obtained PAD value can reflect the emotion attribute of the corpus data better, a threshold β is set, and when max (SIM)i) A too low value also leads to a PAD value of 0, and furthermore, a α value controls the amplitude, continuing the above example, for the word "fear", the PAD value is sigmoid (α max (SIM)i)-β)×[-1,+1,-1]i。
In some embodiments of the present invention, optionally, the parameter α is equal to 15, and the parameter β is equal to 0.17, which are all experimentally obtained optimal parameters.
According to another aspect of the present invention, as shown in fig. 3, there is provided a method for emotion recognition using an emotion dictionary constructed according to any one of the above emotion dictionary construction methods or the emotion dictionary constructed according to any one of the above emotion dictionary construction systems, the method including the steps of: firstly, dictionary mapping is carried out, specifically, preprocessing is carried out on the corpus to be recognized in the step S302; determining an evaluation word in the preprocessed corpus in step S304; the evaluation word is mapped to a target word in the emotion dictionary in step S306, and the PAD value of the target word is determined. How to build the PAD emotion dictionary has been described above. For the user reply in the man-machine conversation, the sentence is cleaned and segmented, and then the segmentation result and the PAD sentiment dictionary are mapped to obtain the PAD value of each target word. For example, for the sentence "under the red envelope basically not used, the money is too troublesome to spend, the chicken rib is felt! ", first go through decommissioning words, participle processing, and then use dependency parsing and rules to obtain evaluation words. And finally, mapping the evaluation words and the dictionary to obtain target words { trouble, chicken ribs }.
Next, sentiment analysis is performed. Determining a PAD value of the corpus according to the PAD value of the target word in step S308; in step S310, the emotion type of the corpus is determined according to the PAD value of the corpus. In some embodiments of the present invention, the step of determining the PAD value of the corpus based on the PAD value of the target word comprises: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus. For example, the PAD values of all target words in the sentence can be weighted and averaged to obtain the PAD value of the final sentence. Some standard emotions can be found out from a PAD dictionary, the distance between the PAD value of the sentence and the standard emotion is observed, and finally the emotion to which the sentence belongs is judged. Some standard emotions can be found from the PAD dictionary, which can be derived from the HOE model (hourglass emotion model). The HOE model contains 24 standard emotions in total, which are respectively: "calm", "happy", "mad", "sad", "hurting", "reception", "trusted", "worrisd", "hate", "worry", "fear", "frightened", "surprise", "anger", "rage", "undeveloped", "surprise", "concern", "expectation", "alert". And finally judging that the sentence belongs to the most similar emotion [ impatience ] by calculating the distance between the PAD value of the sentence and the standard emotion.
According to another aspect of the present invention, as shown in fig. 4, there is provided a system 40 for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods above or the emotion dictionary construction system according to any one of the emotion dictionary construction methods above, including: a preprocessing module 402 configured to preprocess a corpus to be recognized; an evaluation word determination module 404 configured to determine an evaluation word in the corpus preprocessed by the preprocessing module; a target word PAD value determination module 406 configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value for the target word. How to build the PAD emotion dictionary has been described above. For the user reply in the man-machine conversation, the sentence is cleaned and segmented, and then the segmentation result and the PAD sentiment dictionary are mapped to obtain the PAD value of each target word. A corpus PAD value determination module 408 configured to determine a PAD value of a corpus according to the PAD value of the target word; and an emotion type determination module 410 configured to determine an emotion type of the corpus according to the PAD value of the corpus. In some embodiments of the present invention, the corpus PAD value determination module 408 takes a weighted average of the PAD values of the target words as the PAD values of the corpus. For example, the PAD values of all target words in the sentence can be weighted and averaged to obtain the PAD value of the final sentence. Some standard emotions can be found out from a PAD dictionary, the distance between the PAD value of the sentence and the standard emotion is observed, and finally the emotion to which the sentence belongs is judged.
FIG. 5 illustrates an emotion dictionary construction and application method according to an embodiment of the present invention, in which the application of the basic principles of the present invention is more fully illustrated in one possible application scenario. First, domain word vector training is performed in step one, and more detailed steps are shown in fig. 5, which specifically includes obtaining domain-related corpus 501, data preprocessing 502, and training word vector 503, where one of the purposes of step one is to generate domain word vector 504. Secondly, performing seed word expansion in step two, and referring to fig. 5 in more detail, specifically including expanding 507 the base seed emotion words by using a synonym table 506 based on the spatial similarity between the segmented result 505 and the base seed emotion words, and finally forming an expanded seed set 508. Thirdly, constructing an emotion dictionary in the third step to form a PAD emotion dictionary 512, specifically including vocabulary acquisition, SIM value calculation 509, PAD value calculation 510 and PAD value adjustment 511, although the PAD value calculation 510 and PAD value adjustment 511 may also be expressed as directly calculating an optimized PAD value. Finally, the dictionary is applied in the human-computer conversation for emotion analysis, specifically comprising dictionary mapping 513, emotion analysis 514, human-computer conversation scenario QA application 515.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the methods described above. What is called calculation in the present inventionComputer-readable media includes all types of computer storage media, which can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, computer-readable media may include RAM, ROM, E2PROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other temporary or non-temporary medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer, or a general purpose or special purpose processor. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
To test the effectiveness of the constructed PAD emotion dictionary, the most widely used Chinese emotion dictionary NTUSD can be selected as the baseline standard. The data set selects emotion classification data set COAE2015, and the statistical information of the data set is as follows:
in order to avoid the influence of the upper emotion analysis method on the bottom layer resources, an SVM algorithm can be selected uniformly, a dictionary feature representation method proposed by Duy Tin Vo is used, a dictionary is used as one of SVM features, a classification model is obtained by training data, and test data are verified. The test indexes can select accuracy Acc and F1 values, and the calculation formula is as follows:
wherein P is the prediction rate and R is the recall rate. TP indicates that the original positive example sentence is predicted to be positive, TN indicates that the original negative example sentence is predicted to be negative, FP indicates that the original negative example sentence is predicted to be positive, and FN indicates that the original positive example sentence is predicted to be negative.
The results are shown below:
dictionary for storing dictionary data | Acc | F1 |
NTUSD | 0.793 | 0.642 |
PAD | 0.826 | 0.714 |
As can be seen from the results, the PAD emotion dictionary is superior to NTUSD in both the accuracy Acc and F1 values, mainly because the PAD model captures emotion information of multiple dimensions of each word and can provide more information in the machine learning model, and therefore, compared with the traditional two-dimensional dictionary, the PAD emotion dictionary has better effect in emotion analysis tasks. It is also worth mentioning that the PAD emotion dictionary has the advantage of multi-dimensional representation of emotion, so that the use method and application scene thereof have more possibilities.
It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled, interpreted, declarative, and procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, libraries, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). Such a special-purpose circuit may be referred to as a computer processor, even if it is not a general-purpose processor.
The above examples mainly illustrate the emotion dictionary construction method and system, emotion recognition method and system, and computer-readable storage medium of the present invention. From the above description, it can be seen that the emotion dictionary construction method and system, the emotion recognition method and system, and the computer readable storage medium in some embodiments of the present invention can completely describe emotion from multiple dimensions by means of a psychological model, and furthermore, the dictionary construction mechanism of the present invention is efficient, and domain characteristics of a dictionary are considered in some examples of the present invention. By combining the above, the mechanism for performing the corpus emotion analysis is more accurate. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.
Claims (33)
1. An emotion dictionary construction method, characterized in that the method comprises:
preprocessing the corpus data to obtain a word list after word segmentation;
performing word vector training on the preprocessed corpus data to obtain word vectors;
determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and
and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.
2. The method of claim 1, wherein the corpus data comprises general corpus.
3. The method of claim 1, wherein the corpus data comprises domain-related corpora.
4. The method of claim 1, wherein the pre-processing comprises word segmentation, word decommissioning, and simplified and traditional unification.
5. The method of claim 1, wherein the plurality of PAD seed emotion words comprise happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
6. The method of claim 1, wherein the dictionary vocabulary further comprises words in the thesaurus using a word frequency above a first threshold.
7. The method of claim 1, wherein each of the plurality of PAD seed emotion words is expanded based on the word vector and a near word library to form a corresponding plurality of seed emotion word sets, and wherein the similarity is determined based on the plurality of seed emotion word sets.
8. The method of claim 7, wherein the attributed seed emotion word set is determined according to the similarity between the word vector and the PAD seed emotion words.
9. The method of claim 7, wherein the near word of each of the plurality of PAD seed emotion words is obtained through the near word library to expand it.
10. The method of claim 7, wherein the similarity between each word in the dictionary vocabulary and each of the plurality of PAD seed emotion words is determined according to a weighted average of the similarity between each word and the member of each of the corresponding plurality of seed emotion word sets.
11. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and
according to PAD max (SIM)i)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiStandard PAD value of PAD seed emotion word corresponding to maximum similarity。
12. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word;
if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and
and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
13. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and
according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
14. The method according to claim 13, characterized in that the parameter α is equal to 15 and the parameter β is equal to 0.17.
15. A system for constructing an emotion dictionary, the system comprising:
the preprocessing module is configured to preprocess the corpus data to obtain a word list after word segmentation;
a word vector training module configured to perform word vector training on the corpus data preprocessed by the preprocessing module to obtain a word vector;
a similarity determination module configured to determine a similarity of each word in a dictionary vocabulary comprising the participled vocabulary with each of a plurality of PAD seed sentiment words having a corresponding plurality of standard PAD values; and
an emotion dictionary generation module configured to determine a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity, thereby forming the emotion dictionary.
16. The system according to claim 15, wherein said corpus data comprises general corpus.
17. The system according to claim 15, wherein said corpus data comprises domain-related corpora.
18. The system of claim 15, wherein the pre-processing includes word segmentation, stop word, and simplified and traditional unification.
19. The system of claim 15, wherein the plurality of PAD seed emotion words include happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
20. The system of claim 15, wherein the dictionary vocabulary further comprises words in the thesaurus using a word frequency above a first threshold.
21. The system of claim 15, further comprising an emotion word expansion module configured to expand each of the plurality of PAD seed emotion words based on the word vector and a near word library to form a corresponding plurality of seed emotion word sets, and wherein the similarity determination module determines the similarity based on the plurality of seed emotion word sets.
22. The system of claim 21, wherein said emotion word expansion module determines the attributed seed emotion word set according to the similarity between said word vector and said plurality of PAD seed emotion words.
23. The system of claim 21, wherein said emotion word expansion module obtains a near word of each of said plurality of PAD seed emotion words through said near word library to expand it.
24. The system of claim 21, wherein the similarity determination module determines the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words based on a weighted average of the similarity of each word to members of each of a corresponding plurality of sets of seed emotion words.
25. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and
according to PAD max (SIM)i)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
26. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word;
if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max(SIMi) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and
and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
27. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and
according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
28. The system of claim 27, wherein the parameter α is equal to 15 and the parameter β is equal to 0.17.
29. A method for emotion recognition using the method according to any one of claims 1-14 or the emotion dictionary constructed according to the system according to any one of claims 15-28, the method comprising:
preprocessing the corpus to be identified;
determining evaluation words in the preprocessed corpus;
mapping the evaluation words to target words in the emotion dictionary, and determining PAD values of the target words;
determining the PAD value of the corpus according to the PAD value of the target word; and
and determining the emotion type of the corpus according to the PAD value of the corpus.
30. The method of claim 29, wherein determining the PAD value of the corpus based on the PAD value of the target word comprises: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus.
31. A system for emotion recognition using an emotion dictionary constructed according to the method of any one of claims 1 to 14 or the system of any one of claims 15 to 28, the system comprising:
a preprocessing module configured to preprocess a corpus to be recognized;
an evaluation word determination module configured to determine an evaluation word in the corpus preprocessed by the preprocessing module;
a target word PAD value determination module configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value of the target word;
a corpus PAD value determining module configured to determine a PAD value of the corpus according to the PAD value of the target word; and
and the emotion type determining module is configured to determine the emotion type of the corpus according to the PAD value of the corpus.
32. The system of claim 31, wherein said corpus PAD value determination module takes a weighted average of PAD values of said target words as PAD values of said corpus.
33. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-14, 29, 30.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073983.7A CN111538834A (en) | 2020-01-21 | 2020-01-21 | Emotion dictionary construction method and system, emotion recognition method and system and storage medium |
PCT/CN2020/107688 WO2021147298A1 (en) | 2020-01-21 | 2020-08-07 | Sentiment lexicon construction method and system, sentiment recognition method and system, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073983.7A CN111538834A (en) | 2020-01-21 | 2020-01-21 | Emotion dictionary construction method and system, emotion recognition method and system and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111538834A true CN111538834A (en) | 2020-08-14 |
Family
ID=71973104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010073983.7A Pending CN111538834A (en) | 2020-01-21 | 2020-01-21 | Emotion dictionary construction method and system, emotion recognition method and system and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111538834A (en) |
WO (1) | WO2021147298A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648015A (en) * | 2022-03-15 | 2022-06-21 | 北京理工大学 | Dependency relationship attention model-based aspect-level emotional word recognition method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102163191A (en) * | 2011-05-11 | 2011-08-24 | 北京航空航天大学 | Short text emotion recognition method based on HowNet |
CN102184232A (en) * | 2011-05-11 | 2011-09-14 | 北京航空航天大学 | Chinese vocabulary emotion modeling method based on pleasure, arousal and dominance (PAD) |
CN105956095A (en) * | 2016-04-29 | 2016-09-21 | 天津大学 | Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary |
CN108563635A (en) * | 2018-04-04 | 2018-09-21 | 北京理工大学 | A kind of sentiment dictionary fast construction method based on emotion wheel model |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3903993B2 (en) * | 2004-02-05 | 2007-04-11 | セイコーエプソン株式会社 | Sentiment recognition device, sentence emotion recognition method and program |
CN102663139B (en) * | 2012-05-07 | 2013-04-03 | 苏州大学 | Method and system for constructing emotional dictionary |
CN103678278A (en) * | 2013-12-16 | 2014-03-26 | 中国科学院计算机网络信息中心 | Chinese text emotion recognition method |
CN109376251A (en) * | 2018-09-25 | 2019-02-22 | 南京大学 | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model |
-
2020
- 2020-01-21 CN CN202010073983.7A patent/CN111538834A/en active Pending
- 2020-08-07 WO PCT/CN2020/107688 patent/WO2021147298A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102163191A (en) * | 2011-05-11 | 2011-08-24 | 北京航空航天大学 | Short text emotion recognition method based on HowNet |
CN102184232A (en) * | 2011-05-11 | 2011-09-14 | 北京航空航天大学 | Chinese vocabulary emotion modeling method based on pleasure, arousal and dominance (PAD) |
CN105956095A (en) * | 2016-04-29 | 2016-09-21 | 天津大学 | Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary |
CN108563635A (en) * | 2018-04-04 | 2018-09-21 | 北京理工大学 | A kind of sentiment dictionary fast construction method based on emotion wheel model |
Non-Patent Citations (2)
Title |
---|
徐小阳 等: "基于网络文本挖掘的投资者情绪对股票市场风险的预警研究", 江苏大学出版社, pages: 83 - 89 * |
曹海涛: "基于PAD模型的中文微博情感分析研究", 《中国优秀硕士学位论文全文数据库》, no. 08, pages 139 - 95 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648015A (en) * | 2022-03-15 | 2022-06-21 | 北京理工大学 | Dependency relationship attention model-based aspect-level emotional word recognition method |
CN114648015B (en) * | 2022-03-15 | 2022-11-15 | 北京理工大学 | Dependency relationship attention model-based aspect-level emotional word recognition method |
Also Published As
Publication number | Publication date |
---|---|
WO2021147298A1 (en) | 2021-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jin et al. | A compact statistical model of the song syntax in Bengalese finch | |
Herbelot et al. | Building a shared world: Mapping distributional to model-theoretic semantic spaces | |
Futrell et al. | Do RNNs learn human-like abstract word order preferences? | |
CN109977215B (en) | Statement recommendation method and device based on associated interest points | |
CN113609264B (en) | Data query method and device for power system nodes | |
CN110472022A (en) | Dialogue method and device, storage medium and terminal based on deep learning | |
Ettinger et al. | Evaluating vector space models using human semantic priming results | |
CN117094291B (en) | Automatic news generation system based on intelligent writing | |
Arruti et al. | Feature selection for speech emotion recognition in Spanish and Basque: on the use of machine learning to improve human-computer interaction | |
CN114722833B (en) | Semantic classification method and device | |
Ravichander et al. | How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing | |
Shah et al. | Articulation constrained learning with application to speech emotion recognition | |
CN110069601A (en) | Mood determination method and relevant apparatus | |
CN111538834A (en) | Emotion dictionary construction method and system, emotion recognition method and system and storage medium | |
CN111783473B (en) | Method and device for identifying best answer in medical question and answer and computer equipment | |
CN113761104A (en) | Method and device for detecting entity relationship in knowledge graph and electronic equipment | |
Engonopoulos et al. | Generating effective referring expressions using charts | |
Wang et al. | Computational models to study language processing in the human brain: A survey | |
Aakur et al. | Going deeper with semantics: Video activity interpretation using semantic contextualization | |
CN116450855A (en) | Knowledge graph-based reply generation strategy method and system for question-answering robot | |
CN115203356B (en) | Professional field question-answering library construction method, question-answering method and system | |
CN114048319B (en) | Humor text classification method, device, equipment and medium based on attention mechanism | |
CN114969375A (en) | Method and system for giving artificial intelligence learning to machine based on psychological knowledge | |
Mao et al. | Spatial versus graphical representation of distributional semantic knowledge. | |
CN113157932A (en) | Metaphor calculation and device based on knowledge graph representation learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |