CN114153967A

CN114153967A - Public opinion classification optimization method for long text

Info

Publication number: CN114153967A
Application number: CN202111060615.XA
Authority: CN
Inventors: 唐亮; 曹特磊; 赵伟
Original assignee: Social Touch Beijing Technology Co ltd
Current assignee: Social Touch Beijing Technology Co ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2022-03-08

Abstract

The invention discloses a public opinion classification optimization method for long texts, which comprises the following steps: a. carrying out public opinion judgment on an input text by using a traditional bert fine-tuned model, and judging whether the length of the text exceeds a set length threshold or not for the text judged to be neutral public opinion; b. if the judgment result does not exceed the preset judgment result, maintaining the original public opinion judgment result, and if the judgment result exceeds the preset judgment result, performing more detailed public opinion analysis; c. and simultaneously and respectively sending the current text to the pre-trained and post-fine-tuned bert models to obtain semantic vectors of each character in the current text before and after fine tuning. In the application, the change of character semantics of a bert model before and after fine adjustment is utilized, and the method is applied to a public opinion classification task aiming at long texts; by identifying the text segments with public opinion tendencies, the probability that the whole text segments are judged to be neutral is reduced, and the detailed public opinion tendencies of the users are better identified.

Description

Public opinion classification optimization method for long text

Technical Field

The invention relates to the technical field of text public opinion classification, in particular to a public opinion classification optimization method for long texts.

Background

When the public sentiment classification is carried out on texts with more characters and longer space, the commonly used bert model in the industry at present often gives a 'neutral' judgment. This is because, on the one hand, in the long content, most paragraphs are actually objective statements with neutral tendency, and a small amount of text segments expressing public opinion tendency are merely mixed in, and are not easy to find even if being read manually; on the other hand, when classifying text opinions, the bert model simply gives a judgment of the opinion tendency from the whole text, and may be considered as a weighted average of the opinion tendencies of the whole text, and the longer the text is, the more likely the probability of the judgment being positive or negative is lowered. Therefore, when the long texts are classified, important public sentiment fragments carried in the long texts are ignored, and the overall neutral public sentiment judgment is given.

Specifically, the bert model is a massive text data and a large-scale computing server cluster accumulated by large companies such as google, Tencent, Huashi and the like, and learns what context each character usually appears in, namely the semantic meaning of each character (represented by a floating point number vector of hundreds of dimensions) by constructing a labeled training sample (namely randomly 'covering' a certain character from a section of text, using an original correct character as a target value of a positive sample, and using other randomly selected characters as target values of a negative sample) and predicting the covered real character by using a multi-layer semantic vector model (deep learning model). The trained model is called a pre-trained model, generally comprises thousands to tens of thousands of common characters, represents the semantic meaning by hundreds of dimensions of floating point vectors, and generally supports the superposition depth of 12 layers at most.

The pre-training model is used as a basic model of downstream natural language tasks (text classification, named entity recognition, relation extraction, file creation and the like). The downstream natural language tasks utilize the own and a small amount of training samples to learn the context environment, the character semantic relationship and the like of the current task by 'fine tuning' (fine tuning) the pre-training model, namely adjusting the semantic vector value of each character (combination) through the prediction error. And mapping the semantic vectors derived from the bert (fine tuning) model into the solution space of the target problem by adding additional network connection layers on the basis of the bert model (e.g. adding network layers of positive, negative and middle three outputs in the public opinion classification task). And judging three classified public opinions, namely positive, negative and middle three public opinion categories, and obtaining corresponding three probability values of values in [0,1 ]. The category with the largest probability value is the public opinion judgment of the current input text, and the value can be regarded as the probability or confidence of the current category.

Here, the bert model, including pre-training and fine-tuning, the data format of the output semantic vector (array) is generally: the first vector is a semantic vector of the whole input article (which is also a semantic vector commonly used by a downstream public opinion classification task), and the subsequent N vectors are semantic vectors of each character of the current input article (including unknown characters, placeholder characters and the like). The dimensions of the N +1 semantic vectors are consistent, and are generally 768 dimensions.

Experiments show that in the application of text public sentiment classification, after fine adjustment of a small number of public sentiment samples, semantic vectors of characters (combinations) with public sentiment tendency have obvious change compared with those before fine adjustment (pre-training models) in the last N characters (excluding unknown and space-occupying nonsensical characters and the like), namely, the vector distance has larger change compared with other characters (combinations) without public sentiment tendency.

By the regularity phenomenon, the method is applied to a public opinion classification task aiming at long texts, and text segments with large semantic vector distance change before and after fine adjustment are extracted from the long texts and serve as text segments for expressing the public opinion tendency of an author, so that the overall neutral judgment caused by excessive characters is avoided, and the public opinion information of a user is ignored.

Disclosure of Invention

The invention aims to: in order to solve the problems, a public opinion classification optimization method of a long text is provided.

In order to achieve the purpose, the invention adopts the following technical scheme:

the public opinion classification optimization method for the long text comprises the following steps:

a. carrying out public opinion judgment on an input text by using a traditional bert fine-tuned model, and judging whether the length of the text exceeds a set length threshold or not for the text judged to be neutral public opinion;

b. if the judgment result does not exceed the preset judgment result, maintaining the original public opinion judgment result, and if the judgment result exceeds the preset judgment result, performing more detailed public opinion analysis;

c. simultaneously and respectively sending the current text to the pre-trained and post-fine-tuned bert models to obtain semantic vectors of each character in the current text before and after fine tuning;

d. comparing and finding out characters with large distance change of semantic vectors, namely characters with public opinion tendency;

e. extracting characters which are adjacent to the public opinion characters in position and have a short semantic distance according to the semantic vector of the finely adjusted model so as to extract text segments with public opinion tendency and complete semantics;

f. carrying out public opinion classification on the extracted public opinion fragments by using a fine-tuned public opinion model;

g. and (4) combining the original text length and the original public opinion score of the full text to give final public opinion judgment information.

Preferably, the threshold value in the step a is 300.

Preferably, the process in the step d is as follows:

traversing each character of the input text one by one, respectively taking out semantic vectors of the current character after pre-training and fine-tuning, calculating the cosine distance of the two vectors, comparing with the calculation value of the formula 1, if the value is smaller than the value, determining that the current character has larger semantic change before and after fine-tuning, and determining that the character has public opinion tendency; otherwise, it is regarded as having no public sentiment tendency, formula 1: 1-1/log (N/m) where N is the number of characters of the current text; m is a coefficient, the sensitivity to semantic distance change is adjusted, the current setting is 4, and characters with larger semantic distance change and position indexes of the characters in the text are extracted.

Preferably, the process in the step e is as follows:

d, respectively expanding the public sentiment characters extracted from the step d and the positions of the public sentiment characters in the original text to the left end and the right end; for public opinion character strings with connected positions, respectively traversing and expanding from the first and last character positions of the character string to the left and right, judging whether the character is a punctuation mark or other stop characters for newly traversed characters, and if so, stopping traversing and expanding on the side; if the traversal length of the current side exceeds the set threshold value of the traversal length, stopping traversal; otherwise, calculating the semantic distance between the newly traversed character and the adjacent character in the public opinion segment, wherein only the finely adjusted semantic vector is needed, and the cosine distance of the vector is also used, so that whether the newly traversed character should be added into the current public opinion segment can be judged by using a fixed distance threshold, the semantic distance threshold of the adjacent character is set to be 0.75 at present, if the cosine similarity of the semantic is more than 0.75, the newly traversed character is considered to be similar to the meaning of the adjacent public opinion segment character, or the characters are often combined together, and the newly traversed character should be added into the finally extracted public opinion segment as a fixed collocation; otherwise, the character is considered to belong to another semantic segment, is irrelevant to the meaning of the currently extracted public opinion segment, is excluded from the current public opinion segment, and the traversal expansion of the side is stopped.

Preferably, the process in the g step is:

if the original three-classification public sentiment of the original text takes on the value (Pn, Pm, Pp), wherein Pn, Pm, Pp are probabilities of being judged to be negative, neutral and positive respectively. If the original public sentiment of the original text is judged to be neutral and is a long text, the public sentiment values of k public sentiment fragments extracted in the step e are (Pni, Pmi, Ppi), wherein Pni, Pmi and Ppi are probabilities that the judgment of the ith public sentiment fragment is negative, neutral and positive respectively, and the value range [1, k ] of i;

and the k extracted public opinion segments have public opinion values weighted and accumulated according to length as follows:

pns, Pms and Pps are respectively negative, middle and positive public sentiment values of k public sentiment fragments after weighted accumulation, Li is the character length of the current public sentiment fragment, and N is the character length of the original text;

and then accumulating the public sentiment into the original public sentiment value of the original text to obtain:

(Pnr,Pmr,Ppr)＝(Pn+Pns,Pm+Pms,Pp+Pps)

in order to unify values in the range of [0,1], the values can be normalized as follows:

(Pn,Pm,Pp)＝(Pnr/(Pnr+Pmr+Ppr),Pmr/(Pnr+Pmr+Ppr),Ppr/(Pnr+Pmr+P pr))。

in summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

in the application, the change of character semantics of a bert model before and after fine adjustment is utilized, and the method is applied to a public opinion classification task aiming at long texts; by identifying the text segments with public opinion tendencies, the probability that the whole text segments are judged to be neutral is reduced, and the detailed public opinion tendencies of the users are better identified.

Drawings

FIG. 1 is a schematic flow chart of step a provided in accordance with an embodiment of the present invention;

fig. 2 shows a schematic flow chart of steps b-g provided according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the present invention provides a technical solution:

a. carrying out public opinion judgment on an input text by using a traditional bert fine-tuned model, judging whether the length of the text is beyond a set length threshold value, wherein the threshold value is 300, namely whether the text has the length of 300 characters;

d. the characters with large distance change of semantic vectors, namely the characters with public opinion tendency are found out through comparison,

and d, traversing each character of the input text one by one, respectively taking out semantic vectors of the current character after pre-training and fine-tuning, and calculating cosine distances of the two vectors (the value range is [0,1], the more the value range is closer to 1, the more the value range is similar, the more the value range is closer to 0, the more the value range is dissimilar). Comparing with the calculated value of the formula 1, if the value is smaller than the calculated value, determining that the current character has larger semantic change before and after fine adjustment, and determining that the character has public opinion tendency; otherwise, it is regarded as having no public sentiment tendency, formula 1: 1-1/log (N/m) where N is the number of characters of the current text; m is a coefficient, the sensitivity to semantic distance change is adjusted, the current setting is 4, characters with large semantic distance change are extracted, and the characters and the position indexes of the characters in the original text are indexed, so that the characters are combined according to the positions of the characters and expanded to two ends, and text segments with complete semantics are obtained;

d, respectively expanding the public sentiment characters extracted from the step d and the positions of the public sentiment characters in the original text to the left end and the right end; traversing and expanding public opinion character strings with the positions connected together from the first and last character positions of the character strings to the left and right respectively;

for the newly traversed character, judging whether the character is a punctuation mark or other stop characters, and if so, stopping the traversal expansion of the side; if the traversal length of the current side exceeds a set traversal length threshold (for example, 8), stopping traversal, otherwise, calculating the semantic distance between the currently newly traversed character and the character which is already in the public opinion segment and is adjacent;

here, only the semantic vector after fine tuning is used, and the vector cosine distance is also used. Whether the newly traversed character is added into the current public opinion segment can be judged by using a fixed distance threshold, and the semantic distance threshold of the current adjacent character is set to be 0.75;

if the cosine similarity of the semantics is more than 0.75, the newly traversed character is considered to be similar to the meanings of the characters of the adjacent public opinion segments, or the characters are often combined together to appear, and the characters are added into the finally extracted public opinion segments as fixed collocation; otherwise, the character is considered to belong to another semantic segment, is irrelevant to the meaning of the currently extracted public opinion segment, is excluded from the current public opinion segment, and the traversal expansion of the side is stopped.

g. combining the original text length and the original public opinion score of the full text to give final public opinion judgment information;

if the original three-classification public sentiment of the original text takes on the value (Pn, Pm, Pp), wherein Pn, Pm, Pp are probabilities of being judged to be negative, neutral and positive respectively. If the original public sentiment of the original text is judged to be neutral and is a long text, the public sentiment values of k (k > ═ 0) public sentiment segments extracted in the step e are (Pni, Pmi, Ppi), wherein Pni, Pmi and Ppi are the probabilities of the i-th public sentiment segment which is judged to be negative, neutral and positive respectively. The value range of i [1, k ].

pns, Pms and Pps are respectively negative, middle and positive public sentiment values of k public sentiment segments after weighted accumulation, Li is the character length of the current public sentiment segment, and N is the character length of the original text.

(Pnr,Pmr,Ppr)＝(Pn+Pns,Pm+Pms,Pp+Pps)

(Pn,Pm,Pp)＝(Pnr/(Pnr+Pmr+Ppr),Pmr/(Pnr+Pmr+Ppr),Ppr/(Pnr+Pmr+P pr))。

experimental analysis:

through a data comparison experiment, in public opinion classification aiming at long texts, the strategy can better give an overall public opinion value with higher discrimination by analyzing public opinion fragments contained in the public opinion classification; and by returning the specific public opinion segment contained in the public opinion segment, the public opinion tendency and other information expressed by the user can be better identified.

An example of partial data (length truncation) is as follows:

it can be seen that the original opinion of the original text in the above example is more neutral (the value judged to be neutral is larger); however, after the detailed public opinion segments are extracted by the method and the public opinion values of the public opinion segments are weighted and accumulated into the final public opinion result, the overall public opinion value and the judgment result have more obvious public opinion tendency (the value in the positive direction or the negative direction is larger). And moreover, according to the extracted public opinion fragments, the public opinion tendency of user details can be better mined, and the result dimension of data insight is enriched.

The previous description of the embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The public opinion classification optimization method for the long text is characterized by comprising the following steps of:

2. The method as claimed in claim 1, wherein the threshold in the step a is 300.

3. The method for optimizing the public opinion classification of the long text according to claim 1, wherein the process in the step d is as follows: traversing each character of the input text one by one, respectively taking out semantic vectors of the current character after pre-training and fine-tuning, calculating the cosine distance of the two vectors, comparing with the calculation value of formula 1, if the value is smaller than the value, determining that the current character has larger semantic change before and after fine-tuning, judging that the character has public opinion tendency, otherwise, determining that the character has no public opinion tendency;

the formula 1 is specifically: 1-1/log (N/m) where N is the number of characters of the current text; m is a coefficient, the sensitivity to semantic distance change is adjusted, the current setting is 4, and characters with larger semantic distance change and position indexes of the characters in the text are extracted.

4. The method as claimed in claim 1, wherein the process in the step e is as follows:

d, respectively expanding the public sentiment characters extracted from the step d and the positions of the public sentiment characters in the original text to the left end and the right end; for public opinion character strings with connected positions, respectively traversing and expanding from the first and last character positions of the character string to the left and right, judging whether the character is a punctuation mark or other stop characters for newly traversed characters, and if so, stopping traversing and expanding on the side;

if the traversal length of the current side exceeds the set threshold value of the traversal length, stopping traversal; otherwise, calculating the semantic distance between the newly traversed character and the adjacent character in the public opinion segment, and judging whether the newly traversed character should be added into the current public opinion segment by using a fixed distance threshold value by using the vector cosine distance;

setting the semantic distance threshold of the adjacent characters to be 0.75, and if the cosine similarity of the semantics is greater than 0.75, considering that the newly traversed characters are similar to the meanings of the adjacent public opinion segment characters, or frequently combined together, and adding the characters as fixed collocation into the finally extracted public opinion segment; otherwise, the character is considered to belong to another semantic segment, is irrelevant to the meaning of the currently extracted public opinion segment, is excluded from the current public opinion segment, and the traversal expansion of the side is stopped.

5. The method for optimizing the public opinion classification of the long text according to claim 1, wherein the process in the step g is as follows:

if the original three-classification public sentiment of the original text takes on the value of (Pn, Pm, Pp), wherein Pn, Pm, Pp are probabilities of being judged to be negative, neutral and positive respectively; if the original public sentiment of the original text is judged to be neutral and is a long text, the public sentiment values of k public sentiment fragments extracted in the step e are (Pni, Pmi, Ppi), wherein Pni, Pmi and Ppi are probabilities that the judgment of the ith public sentiment fragment is negative, neutral and positive respectively, and the value range [1, k ] of i;

(Pnr,Pmr,Ppr)＝(Pn+Pns,Pm+Pms,Pp+Pps)

(Pn,Pm,Pp)＝(Pnr/(Pnr+Pmr+Ppr),Pmr/(Pnr+Pmr+Ppr),Ppr/(Pnr+Pmr+Ppr))。