CN110569497A

CN110569497A - Opinion vocabulary expansion system and opinion vocabulary expansion method

Info

Publication number: CN110569497A
Application number: CN201811341060.4A
Authority: CN
Inventors: 萧瑞祥; 王雅诗
Original assignee: Tamkang University
Current assignee: Tamkang University
Priority date: 2018-06-06
Filing date: 2018-11-12
Publication date: 2019-12-13
Also published as: TWI675304B; TW202001619A

Abstract

The invention discloses an opinion vocabulary expansion system and an opinion vocabulary expansion method, wherein the opinion vocabulary expansion method comprises the following steps: calculating a plurality of domain representative words representing a target domain from the plurality of words; extracting a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination; dividing the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies; and selecting a plurality of positive seed words and a plurality of negative seed words from the plurality of domain representative words, and calculating the emotional tendency of each candidate opinion word of each cluster according to the positive seed words and the negative seed words.

Description

Opinion vocabulary expansion system and opinion vocabulary expansion method

Technical Field

The invention relates to an opinion vocabulary expansion system, in particular to an opinion vocabulary expansion system based on part of speech combination. The invention also relates to an opinion vocabulary expansion method adopted by the opinion vocabulary expansion system.

Background

the amplification and establishment of the opinion vocabulary are the basis in opinion analysis, and the part of speech judgment of the opinion vocabulary is also an important ring in opinion analysis; generally, there are three general ways to expand and build opinion vocabularies: (1) manual mode: intercepting and establishing the required opinion vocabulary in a manual mode; (2) dictionary-based approach: the existing opinion vocabularies are amplified by matching the existing dictionary with synonymy and antisense vocabulary resources or any resources with vocabulary relations; (3) corpus-based approach: the rules of the part of speech, the context and the like of the opinion vocabulary to be captured are known through a statistical or observation method, and the required opinion vocabulary is found in the corpus through a rule making mode.

however, the manual method for expanding and building the opinion vocabulary is inefficient, and cannot effectively increase the coverage of the opinion vocabulary, and the dictionary-based method and the corpus-based method also have the problem of being unable to effectively increase the coverage of the opinion vocabulary.

the word part of speech determination of the opinion vocabulary is generally performed by the above three methods. However, the part-of-speech determination of the opinion vocabulary by a manual method can achieve higher accuracy, but is less efficient; the dictionary-based mode and the corpus-based mode have the problem of low precision.

Therefore, how to propose an opinion vocabulary analysis technique, which can effectively improve various limitations of the prior art, has become an unbearable problem.

Disclosure of Invention

In view of the above problems in the prior art, it is an object of the present invention to provide an opinion vocabulary expansion system and an opinion vocabulary expansion method, so as to solve various problems in the prior art.

According to one aspect of the present invention, an opinion vocabulary expansion system is provided, which comprises a target domain vocabulary calculation module, an opinion vocabulary extraction module, an opinion vocabulary similarity grouping module, and an opinion vocabulary emotional tendency analysis module. The target domain vocabulary calculation module can calculate a plurality of domain representative vocabularies which represent a target domain from a plurality of vocabularies. The opinion vocabulary extraction module can extract a plurality of candidate opinion vocabularies from the plurality of vocabularies according to a part-of-speech combination. The opinion vocabulary similarity grouping module can select a plurality of positive seed vocabularies and a plurality of negative seed vocabularies from the plurality of domain representative vocabularies, and can calculate the emotional tendency of each candidate opinion vocabulary of each cluster according to the positive seed vocabularies and the negative seed vocabularies.

According to another aspect of the present invention, a method for expanding opinion vocabulary is provided, which comprises the following steps: calculating a plurality of domain representative words representing a target domain from the plurality of words; extracting a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination; dividing the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies; and selecting a plurality of positive seed words and a plurality of negative seed words from the plurality of domain representative words, and calculating the emotional tendency of each candidate opinion word of each cluster according to the positive seed words and the negative seed words.

In view of the above, the opinion vocabulary expansion system and the opinion vocabulary expansion method according to the present invention may have one or more of the following advantages:

(1) In an embodiment of the invention, the opinion vocabulary expansion system can extract candidate opinion vocabularies by special part-of-speech combinations including idiom types and adjective types, so that the coverage rate of the opinion vocabularies can be greatly improved.

(2) in an embodiment of the invention, the opinion vocabulary expansion system can analyze the emotional tendency of the opinion vocabulary through a more effective emotional tendency analysis step, so that the part-of-speech judgment accuracy of the opinion vocabulary can be greatly improved.

(3) In an embodiment of the invention, the opinion vocabulary expansion system can adopt a specially designed mechanism to more rapidly perform the expansion and establishment of the opinion vocabulary and the part of speech judgment of the opinion vocabulary, thereby greatly improving the efficiency.

Drawings

FIG. 1 is a block diagram of an opinion vocabulary expansion system according to a first embodiment of the present invention.

Fig. 2 is a flowchart of a first embodiment of the present invention.

FIG. 3 is a block diagram of an opinion vocabulary expansion system according to a second embodiment of the present invention.

fig. 4 is a flowchart of a second embodiment of the present invention.

Description of reference numerals: 1-opinion vocabulary extension system; 11-a data preprocessing module; 12-a target domain vocabulary calculation module; 13-opinion vocabulary extraction module; 14-opinion vocabulary similarity clustering module; 15-invalid opinion vocabulary filtering module; 16-opinion vocabulary emotional tendency analysis module; d-a review database; S21-S25, S41-S46-step flow.

Detailed Description

Embodiments of the opinion vocabulary expansion system and opinion vocabulary expansion method according to the present invention will be described below with reference to the related drawings, in which components may be exaggerated or reduced in size or in scale for clarity and convenience in illustration. In the following description and/or claims, when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present; when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present, and other words used to describe the relationship between the elements or layers should be interpreted in the same manner. For ease of understanding, the same components in the following embodiments are illustrated with the same reference numerals.

Please refer to fig. 1, which is a block diagram of an opinion vocabulary expansion system according to a first embodiment of the present invention. As shown in the figure, the opinion vocabulary expansion system 1 may include a data preprocessing module 11, a target domain vocabulary calculating module 12, an opinion vocabulary extracting module 13, an opinion vocabulary similarity grouping module 14, and an invalid opinion vocabulary filtering module 15.

The data preprocessing module 11 can obtain a plurality of product review articles from the review database D; the plurality of product review articles may be obtained by an automated web crawler. Then, the data preprocessing module 11 may perform word segmentation and part-of-speech tagging on the product review articles through a word segmenter to generate a plurality of words; in one embodiment, the word segmenter may be a word segmentation algorithm (e.g., Jieba).

The target domain vocabulary calculating module 12 can calculate a plurality of domain representative vocabularies representing a target domain from the vocabularies; in one embodiment, the target domain vocabulary calculation module 12 may calculate the domain representatives representing the target domain from the plurality of vocabularies by a word frequency-inverse document frequency (TF-IDF) algorithm.

The opinion vocabulary extraction module 13 can extract a plurality of candidate opinion vocabularies from the plurality of vocabularies according to the part-of-speech combination; in one embodiment, the part-of-speech combination may be generated according to the definition of a word-breaking algorithm (e.g., Jieba); for example, the part-of-speech combination may include an verb type, a verb type, an adverb plus verb type, and an adverb plus adverb type, and may further be derived into an idiom type and an adjective type.

The opinion vocabulary similarity clustering module 14 may classify the candidate opinion vocabularies into a plurality of clusters according to the similarities of the candidate opinion vocabularies; in one embodiment, the opinion vocabulary similarity clustering module 14 may employ a Single-Pass (Single-Pass) algorithm and a Levenshtein Distance (Levenshtein Distance) algorithm to calculate the similarity of the candidate opinion vocabularies and may divide the candidate opinion vocabularies into the clusters.

The invalid opinion vocabulary filtering module 15 may calculate various inter-Point Mutual Information (PMI) combinations of the candidate opinion vocabularies and the domain representative vocabularies, respectively, to filter out part of the invalid opinion vocabularies from the candidate opinion vocabularies.

As can be seen from the above, the opinion vocabulary expansion system 1 may extract candidate opinion vocabularies by special part-of-speech combinations including an verb type, a verb type, an adverb plus verb type, and an adverb plus adverb type, and the part-of-speech combinations may further derive idiom types and adjective types, thereby greatly increasing the coverage of the opinion vocabularies.

Of course, the above description is only an example, and the components of the opinion vocabulary expansion system 1 and the coordination relationship thereof may also vary according to the actual requirement, and the invention is not limited thereto.

Please refer to fig. 2, which is a flowchart illustrating a first embodiment of the present invention. As shown in the figure, the opinion vocabulary expansion method adopted by the opinion vocabulary expansion system 1 can comprise the following steps:

Step S21: the word segmentation and part-of-speech tagging are carried out on a plurality of product review articles through a word segmentation device to generate a plurality of words.

Step S22: a plurality of domain representative words representing a target domain are calculated from the plurality of words.

Step S23: extracting a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination.

Step S24: and dividing the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies.

Step S25: and respectively calculating the mutual information between various pairwise combinations of the candidate opinion vocabularies and the field representative vocabularies so as to filter out partial invalid opinion vocabularies by the candidate opinion vocabularies.

Please refer to fig. 3, which is a block diagram of an opinion vocabulary expansion system according to a second embodiment of the present invention, wherein the embodiment takes the food field and the makeup field as examples. As shown in the figure, the opinion vocabulary expansion system 1 may include a data preprocessing module 11, a target domain vocabulary calculating module 12, an opinion vocabulary extracting module 13, an opinion vocabulary similarity grouping module 14, an invalid opinion vocabulary filtering module 15 and an opinion vocabulary emotional tendency analyzing module 16.

The data preprocessing module 11 can obtain a plurality of food and makeup product review articles from the review database D; the multiple food and makeup product review articles can be obtained through an automatic web crawler, and invalid information in the articles can be filtered. Since word segmentation and part-of-speech tagging are required before the subsequent steps are performed, the data preprocessing module 11 can generate a plurality of words by segmenting and part-of-speech tagging the plurality of food and cosmetic product review articles through a word segmentation algorithm (e.g., Jieba), wherein the part-of-speech tagging is shown in table 1 below:

TABLE 1

The definitions of the symbols appearing in table 1 are derived from the part-of-speech table of Jieba, and should be well known to those skilled in the art, and therefore are not described herein in detail.

In order to determine the correlation between the vocabulary and the field of food and cosmetic, the vocabulary capable of representing food and cosmetic must be calculated. The target field vocabulary calculation module 12 can mark vocabularies with parts of speech as nouns in all food and makeup product comment articles after word segmentation as subsequent operation vocabularies so as to assist the subsequent steps in matching candidate opinion vocabularies; next, the target domain vocabulary calculation module 12 may use a word frequency-inverse document frequency (TF-IDF) algorithm to obtain the TF-IDF result of each vocabulary selected in the previous step in each food and cosmetic product review article, and record several previous representative vocabularies of each food and cosmetic product review article; then, the target domain vocabulary calculation module 12 may use the number of times that the representative vocabulary becomes the article representative word as a threshold, and determine the domain tendency of the representative vocabulary according to the probability that the vocabulary appears in the review articles in the food and makeup domains to find out a plurality of domain representative vocabularies in the food and makeup domains; in this embodiment, the stage excludes the vocabulary with the probability of becoming the representative vocabulary close to 50% in two fields, which means that the representative vocabulary has no representativeness of the food or cosmetic fields, and the output of this stage is shown in table 2:

Vocabulary and phrases	A tendency in the field of food	Tendency of beauty makeup	Mainly representative of the field
				Effect	0.20％	99.80％	Beauty makeup
Service	99.96％	0.04％	Food
				Skin and skin	0.01％	99.99％	Beauty makeup
dining with food	99.95％	0.05％	Food
				skin(s)	0.04％	99.96％	beauty makeup

TABLE 2

the opinion vocabulary extraction module 13 can extract a plurality of candidate opinion vocabularies from the plurality of vocabularies according to the part-of-speech combination; in one embodiment, the part-of-speech combination may be generated according to the definition of a word-breaking algorithm (e.g., Jieba), as shown in Table 3:

TABLE 3

The definitions of the symbols appearing in table 3 are derived from the part-of-speech table of Jieba, and should be well known to those skilled in the art, and therefore are not described herein in detail.

The part-of-speech combination can be further extended into idiom types, as shown in Table 4:

TABLE 4

The part-of-speech combination may further be further extended by adjective type, as shown in Table 5:

Rules	Examples of such applications are
		N+A	Texture/freshness
A+N	skin tone/evenness
		V+A	Not enough/lasting
A+V	easy/absorb
		ADV+A	Super smooth and tender

TABLE 5

The Opinion vocabulary extraction module 13 searches the related vocabulary rule related to the adjective before and after the adjective with the reference point, and the combination with the name word is restored to the remaining adjectives in the subsequent steps, and the other combinations keep the form of the meaning phrase (Opinion Phrases).

in tables 4 and 5, N represents a noun; i represents an exclamation word; ADV denotes adverb; u represents a help word; v represents a verb; a denotes an adjective.

The opinion vocabulary similarity clustering module 14 may calculate similarities of the candidate opinion vocabularies by using a Single-Pass (Single-Pass) algorithm and a Levenshtein Distance (Levenshtein Distance) algorithm, and may divide the candidate opinion vocabularies into the clusters, wherein the formula of the Levenshtein Distance (Levenshtein Distance) algorithm is as follows:

Levenstein distance 1-number of edits/Max (string 1 length, string 2 length) … … … … … … … (1)

The "number of editing times" of the numerator in formula (1) refers to the number of operations to edit the target-aligned phrase [ character string 1, character string 2] to be the same, wherein the operations covered by editing include: "character insertion, character deletion, and character replacement", and Max (length of character string 1, length of character string 2) of the denominator is the maximum value of the length of the character string in the matching phrase.

The Single-Pass algorithm may comprise the following steps: the method comprises the following steps: extracting a vocabulary from the vocabulary set, wherein the vocabulary becomes a first cluster under the condition of no clustering result, and the vocabulary also becomes a representative word of the first cluster; step two: taking out all the rest vocabularies, and performing character string similarity calculation (Levenshtein Distance) on the representative words of the existing clusters; step three: if the threshold value is reached, adding the grouping, and recalculating the common representative word with high frequency as the selection basis; step four: if the vocabulary calculated by the target can not be grouped, the vocabulary automatically establishes a cluster and takes the vocabulary as a representative word; step five: and repeating the second step to the fourth step until all the words are subjected to clustering operation. In the above manner, the opinion vocabulary similarity clustering module 14 may classify the candidate opinion vocabularies into the clusters.

Finally, the opinion vocabulary emotional tendency analysis module 16 may select a plurality of positive seed vocabularies and a plurality of negative seed vocabularies from the plurality of domain representative vocabularies, and may calculate emotional tendency of each candidate opinion vocabulary of each cluster according to the plurality of positive seed vocabularies and the plurality of negative seed vocabularies through an emotional tendency point mutual information (SO-PMI) algorithm; the emotional tendency point mutual information (SO-PMI) algorithm adopted in this embodiment is shown in formula (2):

Wherein, SO-PMI (word) represents the calculation result of mutual information algorithm between emotional tendency points.

In this embodiment, the seed set is shown in table 6:

TABLE 6

Therefore, the opinion vocabulary expansion system 1 can extract candidate opinion vocabularies through special part-of-speech combinations, and the part-of-speech combinations can further extend idiom types and adjective types, so that the coverage rate of the opinion vocabularies can be greatly improved; in addition, the opinion vocabulary expansion system 1 can perform emotion tendency analysis of the opinion vocabulary through more effective emotion tendency analysis steps, so that the accuracy and efficiency of part of speech judgment of the opinion vocabulary can be greatly improved. Therefore, the opinion vocabulary expansion system 1 can effectively improve the deficiency of the prior art.

It is worth mentioning that the amplification and establishment of the opinion vocabulary are usually performed manually, in a dictionary-based manner or in a corpus-based manner at present; however, the manual method for expanding and building the opinion vocabulary is inefficient, and cannot effectively increase the coverage of the opinion vocabulary, and the dictionary-based method and the corpus-based method also have the problem of being unable to effectively increase the coverage of the opinion vocabulary. On the contrary, according to the embodiment of the invention, the opinion vocabulary expansion system can extract the candidate opinion vocabulary through the special part-of-speech combination including idiom type and adjective type, thereby greatly improving the coverage rate of the opinion vocabulary.

At present, the part of speech of the opinion vocabulary is generally judged manually, based on a dictionary or based on a corpus. However, the part-of-speech determination of the opinion vocabulary by a manual method can achieve higher accuracy, but is less efficient; the dictionary-based mode and the corpus-based mode have the problem of low precision. On the contrary, according to the embodiment of the invention, the opinion vocabulary expansion system can perform emotional tendency analysis of the opinion vocabularies through a more effective emotional tendency analysis step, so that the part-of-speech judgment accuracy of the opinion vocabularies can be greatly improved, and the opinion vocabulary expansion system can adopt a specially designed mechanism to more rapidly perform the amplification and establishment of the opinion vocabularies and the part-of-speech judgment of the opinion vocabularies, so that the efficiency can be greatly improved. From the above, the present invention is a patent element with advancement.

Please refer to fig. 4, which is a flowchart illustrating a second embodiment of the present invention. As shown in the figure, the opinion vocabulary expansion method adopted by the opinion vocabulary expansion system 1 can comprise the following steps:

step S41: the multiple product review articles are subjected to word segmentation and part-of-speech tagging through a word segmentation algorithm (such as Jieba) to generate multiple words.

Step S42: and calculating a plurality of domain representative words representing a target domain from the plurality of words by a word frequency-reverse file frequency algorithm.

Step S43: extracting a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination.

Step S44: and dividing the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies through a single clustering algorithm and a Levensstein distance algorithm.

Step S45: and respectively calculating the mutual information between various pairwise combinations of the candidate opinion vocabularies and the field representative vocabularies so as to filter out partial invalid opinion vocabularies by the candidate opinion vocabularies.

Step S46: selecting a plurality of positive seed words and a plurality of negative seed words from the plurality of field representative words, and calculating the emotional tendency of each candidate opinion word of each cluster according to the positive seed words and the negative seed words through an emotional tendency point mutual information algorithm.

In summary, according to the embodiment of the invention, the opinion vocabulary expansion system can extract the candidate opinion vocabularies by the special part-of-speech combination including idiom type and adjective type, so as to greatly improve the coverage of the opinion vocabularies.

In addition, according to the embodiment of the invention, the opinion vocabulary expansion system can carry out emotional tendency analysis on the opinion vocabularies through more effective emotional tendency analysis steps, so that the part-of-speech judgment accuracy of the opinion vocabularies can be greatly improved.

In addition, according to the embodiment of the invention, the opinion vocabulary expansion system can adopt a specially designed mechanism to more rapidly perform the augmentation and establishment of the opinion vocabulary and the part of speech judgment of the opinion vocabulary, so that the efficiency can be greatly improved.

The foregoing is by way of example only, and not limiting. Any other equivalent modifications or variations without departing from the spirit and scope of the present invention should be included in the protection scope of the present application.

Claims

1. An opinion vocabulary expansion system, comprising:

The target field vocabulary calculation module is used for calculating a plurality of field representative vocabularies representing a target field from the vocabularies;

an opinion vocabulary extraction module, which extracts a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination;

The opinion vocabulary similarity grouping module is used for grouping the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies; and

And the opinion vocabulary emotional tendency analysis module selects a plurality of positive seed vocabularies and a plurality of negative seed vocabularies from the plurality of domain representative vocabularies and calculates the emotional tendency of each candidate opinion vocabulary of each cluster according to the positive seed vocabularies and the negative seed vocabularies.

2. the system of claim 1, further comprising a data preprocessing module for generating the words by segmenting and word tagging product review articles with a word segmenter.

3. the opinion vocabulary expansion system of claim 2 wherein the word segmenter is a word segmentation algorithm.

4. The system of claim 1, further comprising an invalid opinion vocabulary filtering module for respectively calculating mutual information between each point of the candidate opinion vocabularies and each of the domain representative vocabularies in pairwise combination to filter out part of the invalid opinion vocabularies from the candidate opinion vocabularies.

5. The opinion vocabulary expansion system of claim 1 wherein the part-of-speech combinations are generated according to a definition of a word-breaking algorithm.

6. the opinion vocabulary expansion system of claim 5 wherein the part-of-speech combinations include an verb type, a verb type, an ancillary plus verb type, and an ancillary plus ancillary verb type.

7. The system of claim 6, wherein the part of speech combination further comprises a word type and an adjective type.

8. The system of claim 1, wherein the target domain vocabulary calculation module calculates the domain representatives representing the target domain from the plurality of vocabularies by a word frequency-inverse file frequency algorithm.

9. The system of claim 1, wherein the opinion vocabulary similarity clustering module calculates the similarity of the candidate opinion vocabularies by using a one-time clustering algorithm and a Levensian distance algorithm, and divides the candidate opinion vocabularies into the clusters.

10. The system of claim 1, wherein the opinion vocabulary emotion tendency analysis module calculates emotion tendencies of the candidate opinion vocabularies of each cluster according to the positive seed vocabularies and the negative seed vocabularies through an emotion tendency point mutual information algorithm.

11. An opinion vocabulary expansion method is characterized by comprising the following steps:

calculating a plurality of domain representative words representing a target domain from the plurality of words;

extracting a plurality of candidate opinion vocabularies from the vocabularies according to a part-of-speech combination;

Dividing the candidate opinion vocabularies into a plurality of clusters according to the similarity of the candidate opinion vocabularies; and

Selecting a plurality of positive seed words and a plurality of negative seed words from the plurality of domain representative words, and calculating the emotional tendency of each candidate opinion word of each cluster according to the positive seed words and the negative seed words.

12. The method of claim 11, further comprising the steps of:

And performing word segmentation and part-of-speech tagging on a plurality of product review articles through a word segmentation device to generate a plurality of words.

13. The method of claim 12, wherein the word segmentation unit is a word segmentation algorithm.

14. The method of claim 11, further comprising the steps of:

And respectively calculating point-to-point mutual information of various pairwise combinations of the candidate opinion vocabularies and the field representative vocabularies so as to filter partial invalid opinion vocabularies from the candidate opinion vocabularies.

15. The method of claim 11, wherein the part-of-speech combination is generated according to a definition of a word-breaking algorithm.

16. The method of claim 15, wherein the part-of-speech combinations include an verb type, a verb type, an ancillary plus verb type, and an ancillary plus verb type.

17. The method of claim 16, wherein the part-of-speech combination further includes a type of a word and a type of an adjective.

18. The method of claim 11, wherein the plurality of domain representatives representing the target domain are calculated by a word frequency-inverse file frequency algorithm.

19. The method of claim 11, wherein the similarity of the candidate opinion vocabularies is calculated by a one-time clustering algorithm and a Levensian distance algorithm and divides the candidate opinion vocabularies into the clusters.

20. The method of claim 11, wherein the emotional tendency of each candidate opinion vocabulary in each cluster is calculated by an emotional tendency point mutual information algorithm.