CN111538834A - Emotion dictionary construction method and system, emotion recognition method and system and storage medium - Google Patents

Emotion dictionary construction method and system, emotion recognition method and system and storage medium Download PDF

Info

Publication number
CN111538834A
CN111538834A CN202010073983.7A CN202010073983A CN111538834A CN 111538834 A CN111538834 A CN 111538834A CN 202010073983 A CN202010073983 A CN 202010073983A CN 111538834 A CN111538834 A CN 111538834A
Authority
CN
China
Prior art keywords
word
pad
emotion
similarity
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010073983.7A
Other languages
Chinese (zh)
Inventor
王阳
邱雪涛
王宇
佘萧寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202010073983.7A priority Critical patent/CN111538834A/en
Priority to PCT/CN2020/107688 priority patent/WO2021147298A1/en
Publication of CN111538834A publication Critical patent/CN111538834A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an emotion dictionary construction method, which comprises the following steps: preprocessing the corpus data to obtain a word list after word segmentation; performing word vector training on the preprocessed corpus data to obtain word vectors; determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.

Description

Emotion dictionary construction method and system, emotion recognition method and system and storage medium
Technical Field
The invention relates to a mechanism for performing corpus emotion analysis by utilizing semantics, in particular to an emotion dictionary construction method and system, an emotion recognition method and system and a computer readable storage medium.
Background
The existing emotion dictionary mainly comprises a single dimension, namely two emotion sets of a positive emotion word list and a negative emotion word list. The dictionaries widely used at present include Taiwan university emotion dictionary (NTUSD), Hopkinson emotion dictionary, HL emotion dictionary, and the like.
NTUSD is a widely used emotion dictionary collected by Taiwan university and contains 2810 positive words and 8276 negative words. The agnostic dictionary is organized by researchers in the chinese web of knowledge, where the chinese aspect contains 4570 positive words and 4374 negative words, and the english aspect contains 4360 positive words and 4574 negative words. The HL dictionary is an emotion dictionary which is issued and maintained by Hu and Liu and mainly depends on manual construction, and comprises 2006 positive words and 4783 negative words.
However, the existing dictionary mainly consists of a positive and negative word list and is not supported by a psychological model, namely the dictionary is mainly summarized by artificial subjective collection. This means that the construction of the dictionary may be affected by human subjective factors, and the final dictionary is not scientific and accurate enough.
Second, a single dimension cannot fully describe an emotion. The traditional single-dimensional emotion dictionary can only distinguish whether the emotion of the user is positive or negative, but in man-machine conversation, not only the positive and negative emotions of the user need to be captured, but also exquisite emotions such as anger and carelessness need to be obtained, and then corresponding placating dialect can be returned. Obviously, the traditional single-dimensional dictionary cannot be used for completing the task, and the lack of the multi-dimensional emotion dictionary causes that the emotion detection of the user of the man-machine conversation platform is difficult to carry out or has poor effect. At present, mainstream man-machine conversation platforms in the market are all deficient in this point, and if the platforms such as the pursuit of one year and the like do not have emotion detection mechanisms, the bamboo platform can only detect three emotions of angry, dissatisfaction and satisfaction of a user.
Furthermore, existing dictionaries are built mostly by human work. The manual construction causes the above-mentioned subjective factors to interfere with dictionary construction on one hand, and also has some disadvantages in construction efficiency on the other hand. In addition, the existing dictionary also lacks a domain-specific dictionary. The use of a universal dictionary in a human-computer interaction robot for intelligent customer service may affect the final emotion analysis accuracy. At present, an emotion dictionary in a specific field and a cross-field dictionary construction method are lacked.
Disclosure of Invention
The invention provides a mechanism for performing corpus emotion analysis by utilizing semantics, which can describe emotion from multiple dimensions on the basis of psychology, and specifically comprises the following steps:
according to an aspect of the present invention, there is provided an emotion dictionary construction method, including the steps of: preprocessing the corpus data to obtain a word list after word segmentation; performing word vector training on the preprocessed corpus data to obtain word vectors; determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.
In some embodiments of the invention, optionally, the corpus data comprises general corpus.
In some embodiments of the invention, optionally, the corpus data comprises domain-related corpora.
In some embodiments of the present invention, optionally, the preprocessing includes word segmentation, word stop, and simplified and traditional unification.
In some embodiments of the invention, optionally, the plurality of PAD seed emotion words comprise happy, boring, dependency, slight, relaxed, anxious, temperate and hostility.
In some embodiments of the present invention, optionally, the dictionary vocabulary further includes words in the near sense thesaurus using a word frequency above the first threshold.
In some embodiments of the present invention, optionally, each of the plurality of PAD seed emotion words is expanded based on the word vector and the near sense word bank to form a corresponding plurality of seed emotion word sets, and the similarity is determined based on the plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the attributed seed emotion word set is determined according to the similarity between the word vector and the plurality of PAD seed emotion words.
In some embodiments of the present invention, optionally, the near word of each of the plurality of PAD seed emotion words is obtained through the near word library to expand it.
In some embodiments of the present invention, optionally, the similarity between each word in the dictionary vocabulary and the member of each of the corresponding plurality of sets of seed emotion words is determined according to a weighted average of the similarity.
In some embodiments of the invention, optionally, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises: determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
In some embodiments of the invention, optionally, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises: determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
In some embodiments of the invention, optionallyDetermining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity comprises determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotion word, and determining the PAD value of each word in the dictionary vocabulary according to the PAD (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
In some embodiments of the invention, optionally, the parameter α is equal to 15 and the parameter β is equal to 0.17.
According to another aspect of the present invention, there is provided a system for constructing an emotion dictionary, including: the preprocessing module is configured to preprocess the corpus data to obtain a word list after word segmentation; a word vector training module configured to perform word vector training on the corpus data preprocessed by the preprocessing module to obtain a word vector; a similarity determination module configured to determine a similarity of each word in a dictionary vocabulary comprising the participled vocabulary with each of a plurality of PAD seed sentiment words having a corresponding plurality of standard PAD values; and an emotion dictionary generation module configured to determine a PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and the similarity, thereby forming the emotion dictionary.
In some embodiments of the invention, optionally, the corpus data comprises general corpus.
In some embodiments of the invention, optionally, the corpus data comprises domain-related corpora.
In some embodiments of the present invention, optionally, the preprocessing includes word segmentation, word stop, and simplified and traditional unification.
In some embodiments of the invention, optionally, the plurality of PAD seed emotion words comprise happy, boring, dependency, slight, relaxed, anxious, temperate and hostility.
In some embodiments of the present invention, optionally, the dictionary vocabulary further includes words in the near sense thesaurus using a word frequency above the first threshold.
In some embodiments of the present invention, optionally, the system further comprises an emotion word expansion module configured to expand each of the plurality of PAD seed emotion words based on the word vector and the near word library to form a corresponding plurality of seed emotion word sets, and the similarity determination module determines the similarity based on the plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the emotion word expansion module determines the seed emotion word set to which the word vector belongs according to the similarity between the word vector and the plurality of PAD seed emotion words.
In some embodiments of the present invention, optionally, the emotion word expansion module obtains a near word of each of the plurality of PAD seed emotion words through the near word library to expand the near word.
In some embodiments of the present invention, optionally, the similarity determination module determines the similarity of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words according to a weighted average of the similarity of each word with the members of each of the corresponding plurality of seed emotion word sets.
In some embodiments of the present invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
In some embodiments of the present invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining a PAD value for the word, wherein PAD represents the PAD value for the word,max(SIMi) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
In some embodiments of the invention, optionally, the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word, and generate an emotion dictionary according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
In some embodiments of the invention, optionally, the parameter α is equal to 15 and the parameter β is equal to 0.17.
According to another aspect of the present invention, there is provided a method for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods described above or the emotion dictionary construction system described above, the method including the steps of: preprocessing the corpus to be identified; determining evaluation words in the preprocessed corpus; mapping the evaluation words to target words in the emotion dictionary, and determining PAD values of the target words; determining the PAD value of the corpus according to the PAD value of the target word; and determining the emotion type of the corpus according to the PAD value of the corpus.
In some embodiments of the present invention, optionally, the step of determining the PAD value of the corpus according to the PAD value of the target word includes: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus.
According to another aspect of the present invention, there is provided a system for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods described above or according to any one of the emotion dictionary construction systems described above, including: a preprocessing module configured to preprocess a corpus to be recognized; an evaluation word determination module configured to determine an evaluation word in the corpus preprocessed by the preprocessing module; a target word PAD value determination module configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value of the target word; a corpus PAD value determining module configured to determine a PAD value of the corpus according to the PAD value of the target word; and the emotion type determining module is configured to determine the emotion type of the corpus according to the PAD value of the corpus.
In some embodiments of the present invention, optionally, the corpus PAD value determination module uses a weighted average of PAD values of the target words as the PAD value of the corpus.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the methods described above.
Drawings
The above and other objects and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings, in which like or similar elements are designated by like reference numerals.
FIG. 1 shows an emotion dictionary construction method according to an embodiment of the present invention.
FIG. 2 shows a system for constructing an emotion dictionary according to an embodiment of the present invention.
FIG. 3 illustrates a method for emotion recognition according to an embodiment of the present invention.
FIG. 4 illustrates a system for emotion recognition according to one embodiment of the present invention.
FIG. 5 illustrates an emotion dictionary construction and application method according to an embodiment of the present invention.
Detailed Description
For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of emotion dictionary construction methods and systems, emotion recognition methods and systems, and computer-readable storage media, and that these same or similar principles may be implemented therein, with any such variations not departing from the true spirit and scope of the present patent application.
In order to solve one or more problems that theoretical support is lacked, a single-dimensional dictionary cannot meet the requirement of human-computer conversation emotion analysis, manual construction efficiency is low, an emotion dictionary in a specific field is deficient and the like, the invention provides a PAD model-based multi-dimensional emotion dictionary construction mechanism.
In some aspects of the invention, construction of an emotion dictionary relies on a widely prevalent psychological model. The model has wide acceptance including psychologists and artificial intelligence learners. In the construction process, manual participation is hardly needed, so that the final emotion dictionary is more accurate. In particular, in some aspects of the invention, the invention introduces a PAD psychology model into the construction process of an emotional dictionary. The PAD model is a psychological model describing human emotional states, and the model is divided into P, A, D three dimensions. Wherein P is pleasure-displeasure, which represents the positive and negative characteristics of individual emotional states; a is the degree of arousal (arousal-nonarosal) which represents the level of neurophysiologic activation in an individual; d is a dominance-degree (dominance) representing the control state of the individual over the scene and others. The PAD model provides a solid theoretical support for the construction of the emotion dictionary.
In some aspects of the invention, each vocabulary comprises a plurality of dimensions. The model describes the emotion into three dimensions, so that the emotion is more accurately explained, and the requirement for constructing a multi-dimensional emotion dictionary is met. And finally, the description of the emotion by the vocabulary in the dictionary is more accurate and finer, so that the accurate exploration of the user psychology in the man-machine conversation process is facilitated. In some aspects of the invention, the invention is based on the PAD model, and the three dimensions are used for emotion description on the vocabulary, so that a multi-dimensional emotion dictionary is constructed. The three dimensions of the vocabulary include a pleasure level, a arousal level, and a dominance level. For example, the "anger" emotion and the "fear" emotion are both low in joy and high in arousal, but the two emotions are diametrically opposed in dominance, the "anger" is high in dominance, and the "fear" is low in dominance. Obviously, the PAD three-dimensional model is complete and exquisite in description of human emotion, and meets high requirements for emotion description in man-machine conversation practice.
In some aspects of the invention, the dictionary is automatically constructed using a method of seed word expansion. Through the construction and the expansion of the seed words, the interference of human subjective factors to the construction process is reduced, and the construction efficiency is obviously improved. In some aspects of the invention, the invention automatically constructs a dictionary using seed emotion words. First, the basic seed word is derived from the basic definition of the PAD model, and contains 8 basic emotions. It is then expanded using semantic space and synonym tables, expanding seed words to the order of hundreds. And finally, automatically constructing an emotion dictionary containing PAD dimension by depending on semantic space. The whole construction process basically does not need manual participation, and the construction efficiency is greatly improved.
In some aspects of the invention, domain information is added during the construction process. Related words in the field can be imported in the construction of the dictionary, the style of the final dictionary is influenced, and the emotion dictionary in the specific field can be flexibly customized. In some aspects of the invention, a domain-specific sentiment dictionary is constructed. Some aspects of the invention incorporate domain information in several ways: training by using domain-related linguistic data to obtain a word vector file; adding related words of the field when the seed words are expanded; the segmentation result of the domain-related corpus is used as a component of the vocabulary. Therefore, the domain information is added in all aspects of the construction process, and the domain related emotion dictionary is finally obtained. It is worth mentioning that only general information may be selected for use, so that the final emotion dictionary is applicable to general fields.
FIG. 1 shows an emotion dictionary construction method according to an embodiment of the present invention, and as shown, the method includes the following steps. Preprocessing the material data to obtain a word list after word segmentation in step S102; and performing word vector training on the preprocessed corpus data to obtain word vectors.
The corpus data may be a universal chinese corpus, for example, may be dog search internet corpus. Since these corpus data are derived from the general-purpose domain (thus referred to as general corpus), the emotion dictionary formed thereby can be applied to the general-purpose domain. In addition, the corpus data may also be data that needs to be analyzed, such as domain-specific user feedback data, network customer service dialogue data, and the like. Such corpus data belongs to (specific) domain-related corpora, and thus the emotion dictionary formed based on this can be applied to a specific domain. On the other hand, the introduction of the general-purpose domain corpus can improve the training effect of the word vector when the (specific) domain related corpus is used. The following table illustrates domain-related corpora according to an aspect of the present invention:
Figure BDA0002377192600000081
Figure BDA0002377192600000091
preprocessing may include, for example, one or more of data cleansing, word segmentation, stop words, simplified and traditional corporations, etc., subject to the ability to obtain processed clean corpus data for further analysis. The pre-processing word segmentation may be implemented by a jieba chinese word segmentation toolkit, and in other examples of the present invention, other natural language processing word segmentation tools, such as ICTCLAS, may be used.
In step 102, word vector training is also performed on the preprocessed corpus data to obtain word vectors, which may be implemented by a separate step. In one example of the invention, the preprocessed corpus data may be trained using a word vector, such as the Gensim toolkit. For example, the dimension is set to 200, the sliding window size is 5, the remaining parameters may use default configurations, and the model may select CBOW or Skip-Gram. And finally, obtaining a general or specific field word vector (depending on the adopted corpus data source), which provides the semantic relation information among vocabularies for the later construction of the emotion dictionary.
A similarity (SIM value) of each word in the dictionary vocabulary including the participled vocabulary and each of a plurality of PAD seed emotion words having a corresponding plurality of standard PAD values is determined in step S104. For example, each PAD seed sentiment word may be represented by three quantitative values of P, A, D, and thus form a plurality of standard PAD values. In some examples of the invention, cosine similarity in semantic space may be used to measure the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words. In the present invention, the dictionary vocabulary includes the vocabulary after the word segmentation in step S102. In other words, if the corpus data is from the general domain, the dictionary vocabulary will have the characteristics of the general domain; if the corpus data is derived from both the general domain and the specific domain, the dictionary vocabulary will have both general domain and specific domain features.
In step S106, a PAD value of each word in the dictionary vocabulary is determined according to the plurality of standard PAD values and the similarity, and an emotion dictionary is formed. Thus, each word in the emotion dictionary formed has a PAD value that is related to the standard PAD value and the similarity to the seed emotion word. The PAD value of each word may thus reflect the emotional properties behind the word.
In some embodiments of the present invention, the plurality of PAD seed emotion words may include happy, boring, dependent, slight, relaxed, anxious, temperate and hostility, which are also 8 emotions defined by the PAD emotion scale in chinese version. On the other hand, when the standard PAD value is set to be + and-values of P, A, D (i.e., can be + P, -P, + A, -A, + D and-D), a total of 8 combined value forms can also be obtained. The following table shows one possible standard PAD value for PAD seed sentiment words including happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
Emotion PAD representation Standard PAD value
Happiness +P+A+D [+1,+1,+1]
Boring to -P-A-D [-1,-1,-1]
Depend on +P+A-D [+1,+1,-1]
Thin strip view -P-A+D [-1,-1,+1]
Relax the body +P-A+D [+1,-1,+1]
Anxiety disorder -P+A-D [-1,+1,-1]
Warming and smoothing +P-A-D [+1,-1,-1]
Hostility -P+A+D [-1,+1,+1]
All permutations of the 8 seed emotions, i.e. P, A, D values, are shown in the table above, whereby the value of each dimension of the PAD value of each word in the dictionary vocabulary may for example be within the interval [ -1, +1] according to the description above in relation to step S106.
In some embodiments of the invention, the dictionary vocabulary further includes words in the near sense corpus that use a word frequency above the first threshold. Thus, the dictionary vocabulary will include the word segmentation results and words in the near word library that have a word frequency above the first threshold. For example, in some examples, a synnym universal thesaurus may be used as a vocabulary source and words with a word frequency above 100 are selected as members of the dictionary vocabulary, where the synnym thesaurus uses a total of 56633 words with a word frequency above 100. Of course, other lexicons may be selected as a vocabulary source in other examples of the invention, and words in which the word frequency is above a certain threshold may also be selected as members of the dictionary vocabulary.
In some embodiments of the present invention, each of a plurality of PAD seed emotion words may be expanded based on a word vector and a near word library to form a corresponding plurality of seed emotion word sets, and the similarity is determined based on the plurality of seed emotion word sets. For example, a certain PAD seed emotion word may be extended with a word vector to get a part of the corresponding seed emotion word set. As another example, a PAD seed emotion word may be expanded by a near sense word library to obtain another part of the corresponding seed emotion word set. Accordingly, the seed emotion word set can be composed of the two parts, that is, the PAD seed emotion words can be expanded into the corresponding seed emotion word set based on the word vector and the near sense word library.
In some embodiments of the invention, the attributed seed emotion word set can be determined according to the similarity between the word vector and the plurality of PAD seed emotion words. Specifically, for example, words close to the seed emotion word in semantic space may first be obtained using cosine distance to form set si cosineWherein:
Figure BDA0002377192600000111
wiand (3) representing word vectors of the PAD seed emotion words, wherein the value of i depends on the number of the seed emotion words (for example, i can be 1-8), w represents the word vector to be determined, and the maximum cosine distance between the word vector to be determined and which PAD seed emotion word is calculated by using the above formula, so that the word vector to be determined is classified into the seed emotion word set corresponding to the PAD seed emotion word.
In some embodiments of the present invention, the near word of each of the plurality of PAD seed emotion words may be obtained through a near word library to expand it:
si synonyms=synonyms(wi)
wirepresenting PAD seed emotion words, function synnyms representing find word wiThe near-sense words in the near-sense word library and form a corresponding set si synonyms
The final extended emotion seed set is the union of two sets as seed emotion word set si seed
si seed={wi,si cosineUsi synonyms}
The value of i is 1-8 to represent 8 seed emotions, wiIs a PAD seed sentiment word. In some examples, each expansion mode may expand about 50 words or so for PAD seed emotion words, i.e., about 100 words. The following table shows a set of seed emotion words that are expanded for 8 PAD seed emotion words to 8 × 800 words (50+ 50).
Figure BDA0002377192600000121
Figure BDA0002377192600000131
In some embodiments of the present invention, the similarity (SIM value) of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words may be determined according to a weighted average of its similarities with members of each of the corresponding plurality of seed emotion word sets. And (4) calculating the SIM value of the word w in the vocabulary, namely the similarity metric value of the word w and the 8 seed emotions. The similarity measure here can also be obtained by calculating a weighted average of the corresponding expansion words (seed emotion word sets) of each seed emotion by using cosine similarity in semantic space:
Figure BDA0002377192600000132
the value of i is 1-8 to represent 8 seed emotions, k is the number of expansion words (the number of members), and the value of j is 1-k.
Taking fear as an example, the similarity between the seed word set and 8 seed word sets is calculated, and the larger the value is, the more similar the value is. It should be noted that the following values are merely illustrative and may differ from the actual situation.
PAD seed emotion word SIM value
Happiness 0.1
Boring to 0.2
Depend on 0.1
Thin strip view 0.4
Relax the body 0.2
Anxiety disorder 0.8
Warming and smoothing 0.2
Hostility 0.7
As can be seen from the above table, the word "fear" is more similar to the "hostile" or "anxious" emotions.
In some embodiments of the present invention, the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarities recited in the above embodiments may comprise: determining the maximum similarity of each word in a dictionary vocabulary and the corresponding PAD seed emotional word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. Continuing with the above example, the maximum SIM value calculated is the "anxiety" emotion, and according to the table given previously, the PAD value for the term "fear" is 0.8 [ -1, +1, -1]=[-0.8,0.8,-0.8]。
In some embodiments of the present invention, some adjustment principles may also be considered in calculating the PAD value, and the step of determining the PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and the similarity recited in the above embodiments may include: firstly, determining the maximum similarity of each word in a dictionary vocabulary and the PAD seed emotion word corresponding to the word. Next, it is determined whether the maximum similarity of the word is greater than or equal to a second threshold. If the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. If the maximum similarity of the word is less than the second threshold value, thenThe PAD value of the word is 0 in each dimension. In the case that the maximum similarity of the word is still very small, the word may not be highly related to the PAD seed emotion words recorded in the foregoing, so that it is not necessary to forcibly associate the word with a certain PAD seed emotion word; conversely, each dimension of its PAD value may be designated as 0 in some examples.
In some embodiments of the present invention, the step of determining the PAD value of each word in the dictionary vocabulary according to the plurality of standard PAD values and similarities described in the above embodiments may include firstly determining the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed sentiment word, and secondly determining the PAD sigmoid (α max (SIM) according to the PADi)-β)×PADiDetermining the PAD value of the word using the sigmoid function, where PAD denotes the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiThe standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented, α and β are parameters, the mode is more detailed than the previous embodiment, the obtained PAD value can reflect the emotion attribute of the corpus data better, a threshold β is set, and when max (SIM)i) A too low value also leads to a PAD value of 0, and furthermore, a α value controls the amplitude, continuing the above example, for the word "fear", the PAD value is sigmoid (α max (SIM)i)-β)×[-1,+1,-1]i
In some embodiments of the present invention, parameter α is equal to 15 and parameter β is equal to 0.17, which are all experimentally optimized parameters.
FIG. 2 shows a system for constructing an emotion dictionary according to an embodiment of the present invention. As shown, system 20 includes a preprocessing module 22, a word vector training module 24, a similarity determination module 26, and an emotion dictionary generation module 28.
The preprocessing module 22 is configured to preprocess the corpus data to obtain a word list after word segmentation. The corpus data may be a universal chinese corpus, for example, may be dog search internet corpus. Since these corpus data are derived from the general-purpose domain (thus referred to as general corpus), the emotion dictionary formed thereby can be applied to the general-purpose domain. In addition, the corpus data may also be data that needs to be analyzed, such as domain-specific user feedback data, network customer service dialogue data, and the like. Such corpus data belongs to (specific) domain-related corpora, and thus the emotion dictionary formed based on this can be applied to a specific domain. On the other hand, the introduction of the general-purpose domain corpus can improve the training effect of the word vector when the (specific) domain related corpus is used.
Preprocessing may include, for example, one or more of data cleansing, word segmentation, stop words, simplified and traditional corporations, etc., subject to the ability to obtain processed clean corpus data for further analysis. The pre-processing word segmentation may be implemented by a jieba chinese word segmentation toolkit, and in other examples of the present invention, other natural language processing word segmentation tools, such as ICTCLAS, may be used.
Word vector training module 24 is configured to perform word vector training on the corpus data preprocessed by preprocessing module 22 to obtain word vectors. In one example of the invention, the preprocessed corpus data may be trained using a word vector, such as the Gensim toolkit. The parameters may use default configurations and the model may choose either CBOW or Skip-Gram. And finally, obtaining a general or specific field word vector (depending on the adopted corpus data source), which provides the semantic relation information among vocabularies for the later construction of the emotion dictionary.
The similarity determination module 26 is configured to determine a similarity (SIM value) of each word in the dictionary vocabulary including the participled vocabulary with each of a plurality of PAD seed emotion words having a corresponding plurality of standard PAD values. For example, each PAD seed sentiment word may be represented by three quantitative values of P, A, D, and thus form a plurality of standard PAD values. In some examples of the invention, cosine similarity in semantic space may be used to measure the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words. In the present invention, the dictionary vocabulary includes the vocabulary after word segmentation. In other words, if the corpus data is from the general domain, the dictionary vocabulary will have the characteristics of the general domain; if the corpus data is derived from both the general domain and the specific domain, the dictionary vocabulary will have both general domain and specific domain features.
And an emotion dictionary generating module 28, configured to determine a PAD value of each word in the dictionary word list according to the plurality of standard PAD values and the similarity, and further form an emotion dictionary. Thus, each word in the emotion dictionary formed has a PAD value that is related to the standard PAD value and the similarity to the seed emotion word. The PAD value of each word may thus reflect the emotional properties behind the word.
In some embodiments of the present invention, the plurality of PAD seed emotion words may include happy, boring, dependent, slight, relaxed, anxious, temperate and hostility, which are also 8 emotions defined by the PAD emotion scale in chinese version. On the other hand, when the standard PAD value is set to be + and-values of P, A, D (i.e., can be + P, -P, + A, -A, + D and-D), a total of 8 combined value forms can also be obtained. The following table shows one possible standard PAD value for PAD seed sentiment words including happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
Figure BDA0002377192600000161
Figure BDA0002377192600000171
All permutations of 8 seed emotions, i.e. P, A, D values, are shown in the table above, whereby the value of each dimension of the PAD value of each word in the dictionary vocabulary may for example be in the interval [ -1, +1 ].
In some embodiments of the invention, the dictionary vocabulary further includes words in the near sense corpus that use a word frequency above the first threshold. Thus, the dictionary vocabulary will include the word segmentation results and words in the near word library that have a word frequency above the first threshold. For example, in some examples, a synnym universal thesaurus may be used as a vocabulary source and words with a word frequency above 100 are selected as members of the dictionary vocabulary, where the synnym thesaurus uses a total of 56633 words with a word frequency above 100. Of course, other lexicons may be selected as a vocabulary source in other examples of the invention, and words in which the word frequency is above a certain threshold may also be selected as members of the dictionary vocabulary.
In some embodiments of the present invention, system 20 further comprises an emotion word expansion module (not shown) configured to expand each of the plurality of PAD seed emotion words based on the word vector and the near word library to form a corresponding plurality of seed emotion word sets, and the similarity determination module determines the similarity based on the plurality of seed emotion word sets. For example, a certain PAD seed emotion word may be extended with a word vector to get a part of the corresponding seed emotion word set. As another example, a PAD seed emotion word may be expanded by a near sense word library to obtain another part of the corresponding seed emotion word set. Accordingly, the seed emotion word set can be composed of the two parts, that is, the PAD seed emotion words can be expanded into the corresponding seed emotion word set based on the word vector and the near sense word library.
In some embodiments of the invention, the emotion word expansion module may determine the seed emotion word set to which the word vector belongs according to the similarity between the word vector and the plurality of PAD seed emotion words. Specifically, for example, words close to the seed emotion word in semantic space may first be obtained using cosine distance to form set si cosineWherein:
Figure BDA0002377192600000172
wiand (3) representing word vectors of the PAD seed emotion words, wherein the value of i depends on the number of the seed emotion words (for example, i can be 1-8), w represents the word vector to be determined, and the maximum cosine distance between the word vector to be determined and which PAD seed emotion word is calculated by using the above formula, so that the word vector to be determined is classified into the seed emotion word set corresponding to the PAD seed emotion word.
In some embodiments of the present invention, the emotion word expansion module may obtain a near word of each of the plurality of PAD seed emotion words through a near word library to expand it:
si synonyms=synonyms(wi)
wirepresenting PAD seed sentiment words, and function synnyms representing finding wiThe near-sense words in the near-sense word library and form a corresponding set si synonyms
The final extended emotion seed set is the union of two sets as seed emotion word set si seed
si seed={wi,si cosine∪si synonyms}
The value of i is 1-8 to represent 8 seed emotions, wiIs a PAD seed sentiment word. In some examples, each expansion mode may expand about 50 words or so for PAD seed emotion words, i.e., about 100 words. The following table shows a set of seed emotion words that are expanded for 8 PAD seed emotion words to 8 × 800 words (50+ 50).
Figure BDA0002377192600000181
Figure BDA0002377192600000191
In some embodiments of the present invention, similarity determination module 26 may determine a similarity (SIM value) of each word in the dictionary vocabulary with each of the plurality of PAD seed emotion words according to a weighted average of its similarities with members of each of the corresponding plurality of seed emotion word sets. And (4) calculating the SIM value of the word w in the vocabulary, namely the similarity metric value of the word w and the 8 seed emotions. The similarity measure here can also be obtained by calculating a weighted average of the corresponding expansion words (seed emotion word sets) of each seed emotion by using cosine similarity in semantic space:
Figure BDA0002377192600000192
the value of i is 1-8 to represent 8 seed emotions, and k is the number of the expansion words.
Taking fear as an example, the similarity between the seed word set and 8 seed word sets is calculated, and the larger the value is, the more similar the value is. It should be noted that the following values are merely illustrative and may differ from the actual situation.
Figure BDA0002377192600000193
Figure BDA0002377192600000201
As can be seen from the above table, the word "fear" is more similar to the "hostile" or "anxious" emotions.
In some embodiments of the present invention, emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and Max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented. Continuing with the above example, the maximum SIM value calculated is the "anxiety" emotion, and according to the table given previously, the PAD value for the term "fear" is 0.8 [ -1, +1, -1]=[-0.8,0.8,-0.8]。
In some embodiments of the present invention, some adjustment principles may also be considered in calculating the PAD value, and the emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and if the maximum similarity of the word is smaller than a second threshold value, all dimensions of the PAD value of the word are 0. In case the maximum similarity of the word is still small, the word may not be as high as the several PAD seed emotional words described aboveDegree correlation, so that the relation between the PAD seed emotion words and the PAD seed emotion words does not need to be forcibly established at the moment; conversely, each dimension of its PAD value may be designated as 0 in some examples.
In some embodiments of the invention, emotion dictionary generation module 28 may be configured to determine the maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word, and to identify the most likely word based on PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Indicates maximum similarity, PADiThe standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented, α and β are parameters, the mode is more detailed than the previous embodiment, the obtained PAD value can reflect the emotion attribute of the corpus data better, a threshold β is set, and when max (SIM)i) A too low value also leads to a PAD value of 0, and furthermore, a α value controls the amplitude, continuing the above example, for the word "fear", the PAD value is sigmoid (α max (SIM)i)-β)×[-1,+1,-1]i
In some embodiments of the present invention, optionally, the parameter α is equal to 15, and the parameter β is equal to 0.17, which are all experimentally obtained optimal parameters.
According to another aspect of the present invention, as shown in fig. 3, there is provided a method for emotion recognition using an emotion dictionary constructed according to any one of the above emotion dictionary construction methods or the emotion dictionary constructed according to any one of the above emotion dictionary construction systems, the method including the steps of: firstly, dictionary mapping is carried out, specifically, preprocessing is carried out on the corpus to be recognized in the step S302; determining an evaluation word in the preprocessed corpus in step S304; the evaluation word is mapped to a target word in the emotion dictionary in step S306, and the PAD value of the target word is determined. How to build the PAD emotion dictionary has been described above. For the user reply in the man-machine conversation, the sentence is cleaned and segmented, and then the segmentation result and the PAD sentiment dictionary are mapped to obtain the PAD value of each target word. For example, for the sentence "under the red envelope basically not used, the money is too troublesome to spend, the chicken rib is felt! ", first go through decommissioning words, participle processing, and then use dependency parsing and rules to obtain evaluation words. And finally, mapping the evaluation words and the dictionary to obtain target words { trouble, chicken ribs }.
Next, sentiment analysis is performed. Determining a PAD value of the corpus according to the PAD value of the target word in step S308; in step S310, the emotion type of the corpus is determined according to the PAD value of the corpus. In some embodiments of the present invention, the step of determining the PAD value of the corpus based on the PAD value of the target word comprises: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus. For example, the PAD values of all target words in the sentence can be weighted and averaged to obtain the PAD value of the final sentence. Some standard emotions can be found out from a PAD dictionary, the distance between the PAD value of the sentence and the standard emotion is observed, and finally the emotion to which the sentence belongs is judged. Some standard emotions can be found from the PAD dictionary, which can be derived from the HOE model (hourglass emotion model). The HOE model contains 24 standard emotions in total, which are respectively: "calm", "happy", "mad", "sad", "hurting", "reception", "trusted", "worrisd", "hate", "worry", "fear", "frightened", "surprise", "anger", "rage", "undeveloped", "surprise", "concern", "expectation", "alert". And finally judging that the sentence belongs to the most similar emotion [ impatience ] by calculating the distance between the PAD value of the sentence and the standard emotion.
According to another aspect of the present invention, as shown in fig. 4, there is provided a system 40 for emotion recognition using an emotion dictionary constructed according to any one of the emotion dictionary construction methods above or the emotion dictionary construction system according to any one of the emotion dictionary construction methods above, including: a preprocessing module 402 configured to preprocess a corpus to be recognized; an evaluation word determination module 404 configured to determine an evaluation word in the corpus preprocessed by the preprocessing module; a target word PAD value determination module 406 configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value for the target word. How to build the PAD emotion dictionary has been described above. For the user reply in the man-machine conversation, the sentence is cleaned and segmented, and then the segmentation result and the PAD sentiment dictionary are mapped to obtain the PAD value of each target word. A corpus PAD value determination module 408 configured to determine a PAD value of a corpus according to the PAD value of the target word; and an emotion type determination module 410 configured to determine an emotion type of the corpus according to the PAD value of the corpus. In some embodiments of the present invention, the corpus PAD value determination module 408 takes a weighted average of the PAD values of the target words as the PAD values of the corpus. For example, the PAD values of all target words in the sentence can be weighted and averaged to obtain the PAD value of the final sentence. Some standard emotions can be found out from a PAD dictionary, the distance between the PAD value of the sentence and the standard emotion is observed, and finally the emotion to which the sentence belongs is judged.
FIG. 5 illustrates an emotion dictionary construction and application method according to an embodiment of the present invention, in which the application of the basic principles of the present invention is more fully illustrated in one possible application scenario. First, domain word vector training is performed in step one, and more detailed steps are shown in fig. 5, which specifically includes obtaining domain-related corpus 501, data preprocessing 502, and training word vector 503, where one of the purposes of step one is to generate domain word vector 504. Secondly, performing seed word expansion in step two, and referring to fig. 5 in more detail, specifically including expanding 507 the base seed emotion words by using a synonym table 506 based on the spatial similarity between the segmented result 505 and the base seed emotion words, and finally forming an expanded seed set 508. Thirdly, constructing an emotion dictionary in the third step to form a PAD emotion dictionary 512, specifically including vocabulary acquisition, SIM value calculation 509, PAD value calculation 510 and PAD value adjustment 511, although the PAD value calculation 510 and PAD value adjustment 511 may also be expressed as directly calculating an optimized PAD value. Finally, the dictionary is applied in the human-computer conversation for emotion analysis, specifically comprising dictionary mapping 513, emotion analysis 514, human-computer conversation scenario QA application 515.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform any of the methods described above. What is called calculation in the present inventionComputer-readable media includes all types of computer storage media, which can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, computer-readable media may include RAM, ROM, E2PROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other temporary or non-temporary medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer, or a general purpose or special purpose processor. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
To test the effectiveness of the constructed PAD emotion dictionary, the most widely used Chinese emotion dictionary NTUSD can be selected as the baseline standard. The data set selects emotion classification data set COAE2015, and the statistical information of the data set is as follows:
Figure BDA0002377192600000231
in order to avoid the influence of the upper emotion analysis method on the bottom layer resources, an SVM algorithm can be selected uniformly, a dictionary feature representation method proposed by Duy Tin Vo is used, a dictionary is used as one of SVM features, a classification model is obtained by training data, and test data are verified. The test indexes can select accuracy Acc and F1 values, and the calculation formula is as follows:
Figure BDA0002377192600000241
Figure BDA0002377192600000242
wherein P is the prediction rate and R is the recall rate. TP indicates that the original positive example sentence is predicted to be positive, TN indicates that the original negative example sentence is predicted to be negative, FP indicates that the original negative example sentence is predicted to be positive, and FN indicates that the original positive example sentence is predicted to be negative.
The results are shown below:
dictionary for storing dictionary data Acc F1
NTUSD 0.793 0.642
PAD 0.826 0.714
As can be seen from the results, the PAD emotion dictionary is superior to NTUSD in both the accuracy Acc and F1 values, mainly because the PAD model captures emotion information of multiple dimensions of each word and can provide more information in the machine learning model, and therefore, compared with the traditional two-dimensional dictionary, the PAD emotion dictionary has better effect in emotion analysis tasks. It is also worth mentioning that the PAD emotion dictionary has the advantage of multi-dimensional representation of emotion, so that the use method and application scene thereof have more possibilities.
It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled, interpreted, declarative, and procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, libraries, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). Such a special-purpose circuit may be referred to as a computer processor, even if it is not a general-purpose processor.
The above examples mainly illustrate the emotion dictionary construction method and system, emotion recognition method and system, and computer-readable storage medium of the present invention. From the above description, it can be seen that the emotion dictionary construction method and system, the emotion recognition method and system, and the computer readable storage medium in some embodiments of the present invention can completely describe emotion from multiple dimensions by means of a psychological model, and furthermore, the dictionary construction mechanism of the present invention is efficient, and domain characteristics of a dictionary are considered in some examples of the present invention. By combining the above, the mechanism for performing the corpus emotion analysis is more accurate. Although only a few embodiments of the present invention have been described, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims (33)

1. An emotion dictionary construction method, characterized in that the method comprises:
preprocessing the corpus data to obtain a word list after word segmentation;
performing word vector training on the preprocessed corpus data to obtain word vectors;
determining a similarity of each word in a dictionary vocabulary comprising the participled vocabulary and each of a plurality of PAD seed emotion words, wherein the PAD seed emotion words have a plurality of corresponding standard PAD values; and
and determining the PAD value of each word in the dictionary vocabulary according to the standard PAD values and the similarity, and further forming the emotion dictionary.
2. The method of claim 1, wherein the corpus data comprises general corpus.
3. The method of claim 1, wherein the corpus data comprises domain-related corpora.
4. The method of claim 1, wherein the pre-processing comprises word segmentation, word decommissioning, and simplified and traditional unification.
5. The method of claim 1, wherein the plurality of PAD seed emotion words comprise happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
6. The method of claim 1, wherein the dictionary vocabulary further comprises words in the thesaurus using a word frequency above a first threshold.
7. The method of claim 1, wherein each of the plurality of PAD seed emotion words is expanded based on the word vector and a near word library to form a corresponding plurality of seed emotion word sets, and wherein the similarity is determined based on the plurality of seed emotion word sets.
8. The method of claim 7, wherein the attributed seed emotion word set is determined according to the similarity between the word vector and the PAD seed emotion words.
9. The method of claim 7, wherein the near word of each of the plurality of PAD seed emotion words is obtained through the near word library to expand it.
10. The method of claim 7, wherein the similarity between each word in the dictionary vocabulary and each of the plurality of PAD seed emotion words is determined according to a weighted average of the similarity between each word and the member of each of the corresponding plurality of seed emotion word sets.
11. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and
according to PAD max (SIM)i)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiStandard PAD value of PAD seed emotion word corresponding to maximum similarity。
12. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word;
if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and
and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
13. The method of claim 1, wherein the step of determining a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity comprises:
determining the maximum similarity of each word in the dictionary vocabulary and the corresponding PAD seed emotional word; and
according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
14. The method according to claim 13, characterized in that the parameter α is equal to 15 and the parameter β is equal to 0.17.
15. A system for constructing an emotion dictionary, the system comprising:
the preprocessing module is configured to preprocess the corpus data to obtain a word list after word segmentation;
a word vector training module configured to perform word vector training on the corpus data preprocessed by the preprocessing module to obtain a word vector;
a similarity determination module configured to determine a similarity of each word in a dictionary vocabulary comprising the participled vocabulary with each of a plurality of PAD seed sentiment words having a corresponding plurality of standard PAD values; and
an emotion dictionary generation module configured to determine a PAD value for each word in the dictionary vocabulary from the plurality of standard PAD values and the similarity, thereby forming the emotion dictionary.
16. The system according to claim 15, wherein said corpus data comprises general corpus.
17. The system according to claim 15, wherein said corpus data comprises domain-related corpora.
18. The system of claim 15, wherein the pre-processing includes word segmentation, stop word, and simplified and traditional unification.
19. The system of claim 15, wherein the plurality of PAD seed emotion words include happy, boring, dependent, slight, relaxed, anxious, mild and hostile.
20. The system of claim 15, wherein the dictionary vocabulary further comprises words in the thesaurus using a word frequency above a first threshold.
21. The system of claim 15, further comprising an emotion word expansion module configured to expand each of the plurality of PAD seed emotion words based on the word vector and a near word library to form a corresponding plurality of seed emotion word sets, and wherein the similarity determination module determines the similarity based on the plurality of seed emotion word sets.
22. The system of claim 21, wherein said emotion word expansion module determines the attributed seed emotion word set according to the similarity between said word vector and said plurality of PAD seed emotion words.
23. The system of claim 21, wherein said emotion word expansion module obtains a near word of each of said plurality of PAD seed emotion words through said near word library to expand it.
24. The system of claim 21, wherein the similarity determination module determines the similarity of each word in the dictionary vocabulary to each of the plurality of PAD seed emotion words based on a weighted average of the similarity of each word to members of each of a corresponding plurality of sets of seed emotion words.
25. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and
according to PAD max (SIM)i)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity is represented.
26. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word;
if the maximum similarity of the word is greater than or equal to a second threshold, then max (SIM) according to PADi)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max(SIMi) Representing said maximum degree of similarity, PADiRepresenting the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity; and
and if the maximum similarity of the word is smaller than the second threshold, all dimensions of the PAD value of the word are 0.
27. The system of claim 15, wherein the emotion dictionary generation module is configured to determine a maximum similarity of each word in the dictionary vocabulary and its corresponding PAD seed emotion word; and
according to PAD sigmoid (α max (SIM)i)-β)×PADiDetermining the PAD value of the word, wherein PAD represents the PAD value of the word, max (SIM)i) Representing said maximum degree of similarity, PADiAnd α and β are parameters for representing the standard PAD value of the PAD seed emotion word corresponding to the maximum similarity.
28. The system of claim 27, wherein the parameter α is equal to 15 and the parameter β is equal to 0.17.
29. A method for emotion recognition using the method according to any one of claims 1-14 or the emotion dictionary constructed according to the system according to any one of claims 15-28, the method comprising:
preprocessing the corpus to be identified;
determining evaluation words in the preprocessed corpus;
mapping the evaluation words to target words in the emotion dictionary, and determining PAD values of the target words;
determining the PAD value of the corpus according to the PAD value of the target word; and
and determining the emotion type of the corpus according to the PAD value of the corpus.
30. The method of claim 29, wherein determining the PAD value of the corpus based on the PAD value of the target word comprises: and taking the weighted average value of the PAD values of the target words as the PAD value of the corpus.
31. A system for emotion recognition using an emotion dictionary constructed according to the method of any one of claims 1 to 14 or the system of any one of claims 15 to 28, the system comprising:
a preprocessing module configured to preprocess a corpus to be recognized;
an evaluation word determination module configured to determine an evaluation word in the corpus preprocessed by the preprocessing module;
a target word PAD value determination module configured to map the evaluation word to a target word in the emotion dictionary and determine a PAD value of the target word;
a corpus PAD value determining module configured to determine a PAD value of the corpus according to the PAD value of the target word; and
and the emotion type determining module is configured to determine the emotion type of the corpus according to the PAD value of the corpus.
32. The system of claim 31, wherein said corpus PAD value determination module takes a weighted average of PAD values of said target words as PAD values of said corpus.
33. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-14, 29, 30.
CN202010073983.7A 2020-01-21 2020-01-21 Emotion dictionary construction method and system, emotion recognition method and system and storage medium Pending CN111538834A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010073983.7A CN111538834A (en) 2020-01-21 2020-01-21 Emotion dictionary construction method and system, emotion recognition method and system and storage medium
PCT/CN2020/107688 WO2021147298A1 (en) 2020-01-21 2020-08-07 Sentiment lexicon construction method and system, sentiment recognition method and system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010073983.7A CN111538834A (en) 2020-01-21 2020-01-21 Emotion dictionary construction method and system, emotion recognition method and system and storage medium

Publications (1)

Publication Number Publication Date
CN111538834A true CN111538834A (en) 2020-08-14

Family

ID=71973104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010073983.7A Pending CN111538834A (en) 2020-01-21 2020-01-21 Emotion dictionary construction method and system, emotion recognition method and system and storage medium

Country Status (2)

Country Link
CN (1) CN111538834A (en)
WO (1) WO2021147298A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648015A (en) * 2022-03-15 2022-06-21 北京理工大学 Dependency relationship attention model-based aspect-level emotional word recognition method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163191A (en) * 2011-05-11 2011-08-24 北京航空航天大学 Short text emotion recognition method based on HowNet
CN102184232A (en) * 2011-05-11 2011-09-14 北京航空航天大学 Chinese vocabulary emotion modeling method based on pleasure, arousal and dominance (PAD)
CN105956095A (en) * 2016-04-29 2016-09-21 天津大学 Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary
CN108563635A (en) * 2018-04-04 2018-09-21 北京理工大学 A kind of sentiment dictionary fast construction method based on emotion wheel model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3903993B2 (en) * 2004-02-05 2007-04-11 セイコーエプソン株式会社 Sentiment recognition device, sentence emotion recognition method and program
CN102663139B (en) * 2012-05-07 2013-04-03 苏州大学 Method and system for constructing emotional dictionary
CN103678278A (en) * 2013-12-16 2014-03-26 中国科学院计算机网络信息中心 Chinese text emotion recognition method
CN109376251A (en) * 2018-09-25 2019-02-22 南京大学 A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163191A (en) * 2011-05-11 2011-08-24 北京航空航天大学 Short text emotion recognition method based on HowNet
CN102184232A (en) * 2011-05-11 2011-09-14 北京航空航天大学 Chinese vocabulary emotion modeling method based on pleasure, arousal and dominance (PAD)
CN105956095A (en) * 2016-04-29 2016-09-21 天津大学 Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary
CN108563635A (en) * 2018-04-04 2018-09-21 北京理工大学 A kind of sentiment dictionary fast construction method based on emotion wheel model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐小阳 等: "基于网络文本挖掘的投资者情绪对股票市场风险的预警研究", 江苏大学出版社, pages: 83 - 89 *
曹海涛: "基于PAD模型的中文微博情感分析研究", 《中国优秀硕士学位论文全文数据库》, no. 08, pages 139 - 95 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114648015A (en) * 2022-03-15 2022-06-21 北京理工大学 Dependency relationship attention model-based aspect-level emotional word recognition method
CN114648015B (en) * 2022-03-15 2022-11-15 北京理工大学 Dependency relationship attention model-based aspect-level emotional word recognition method

Also Published As

Publication number Publication date
WO2021147298A1 (en) 2021-07-29

Similar Documents

Publication Publication Date Title
Jin et al. A compact statistical model of the song syntax in Bengalese finch
Herbelot et al. Building a shared world: Mapping distributional to model-theoretic semantic spaces
Futrell et al. Do RNNs learn human-like abstract word order preferences?
CN109977215B (en) Statement recommendation method and device based on associated interest points
CN113609264B (en) Data query method and device for power system nodes
CN110472022A (en) Dialogue method and device, storage medium and terminal based on deep learning
Ettinger et al. Evaluating vector space models using human semantic priming results
CN117094291B (en) Automatic news generation system based on intelligent writing
Arruti et al. Feature selection for speech emotion recognition in Spanish and Basque: on the use of machine learning to improve human-computer interaction
CN114722833B (en) Semantic classification method and device
Ravichander et al. How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing
Shah et al. Articulation constrained learning with application to speech emotion recognition
CN110069601A (en) Mood determination method and relevant apparatus
CN111538834A (en) Emotion dictionary construction method and system, emotion recognition method and system and storage medium
CN111783473B (en) Method and device for identifying best answer in medical question and answer and computer equipment
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
Engonopoulos et al. Generating effective referring expressions using charts
Wang et al. Computational models to study language processing in the human brain: A survey
Aakur et al. Going deeper with semantics: Video activity interpretation using semantic contextualization
CN116450855A (en) Knowledge graph-based reply generation strategy method and system for question-answering robot
CN115203356B (en) Professional field question-answering library construction method, question-answering method and system
CN114048319B (en) Humor text classification method, device, equipment and medium based on attention mechanism
CN114969375A (en) Method and system for giving artificial intelligence learning to machine based on psychological knowledge
Mao et al. Spatial versus graphical representation of distributional semantic knowledge.
CN113157932A (en) Metaphor calculation and device based on knowledge graph representation learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination