CN115759088A

CN115759088A - Text analysis method and storage medium for comment information

Info

Publication number: CN115759088A
Application number: CN202310033845.XA
Authority: CN
Inventors: 赵习枝; 王苑; 张福浩; 欧尔格力; 仇阿根; 张朝坤; 李彬; 索菲; 朱鹏; 陶坤旺; 陆文
Original assignee: Qinghai Province Geospatial And Natural Resources Big Data Center; Chinese Academy of Surveying and Mapping
Current assignee: Qinghai Province Geospatial And Natural Resources Big Data Center; Chinese Academy of Surveying and Mapping
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-03-07

Abstract

A text analysis method for comment information and a storage medium thereof are disclosed, firstly, text comment data are preprocessed, text feature vectors are obtained by text vectorization on the preprocessed data, data noise reduction is carried out by using a self-encoder model, and then high-level feature vectors of the text comment information are extracted through a long-term and short-term memory network, so that text analysis of the comment information is realized. According to the invention, the data is subjected to noise reduction processing through the AE model, redundant features in the data are eliminated, and the efficiency of comment information analysis is effectively improved; by adopting the LSTM model, the document information is effectively utilized, so that the characteristics are more judgment, and the accuracy of comment information analysis is improved.

Description

Text analysis method and storage medium for comment information

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text analysis method and a storage medium for comment information.

Background

With the development of internet technology, social media platforms become an important way for the public to publish opinions and communicate information. A large amount of social media comment data are collected from the Internet, valuable information in the social media comment data is mined, and the like of the people on the preference degree of a certain product or the attention degree and emotional change of a certain social phenomenon can be obtained.

Because social media websites are various in types and have too large amount of comment information, the arrangement and analysis of the comment information only by manpower can be a difficult task, so that the text comment information analysis needs to be further explored, complicated language features are automatically learned from a large amount of text data by adopting a more automatic and intelligent method and the text analysis is carried out, a large amount of manpower and material resources are saved, and the text comment analysis efficiency and accuracy are improved.

Disclosure of Invention

The invention aims to provide a text comment analysis method aiming at the problems of low efficiency, large workload, low accuracy and the like in the working mode of manual text comment analysis, so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text analysis method of comment information comprises the following steps:

text comment data preprocessing step S110:

preprocessing the text comment data, filtering out irrelevant information, and performing word segmentation processing on the text comment data;

text comment vector extraction and processing step S120:

text feature vectors are obtained by text vectorization of the preprocessed text comment data, data noise reduction is carried out by using a self-encoder model, and then high-level feature vectors of text comment information are extracted through an LSTM model to represent the text comment data;

calculating emotion prediction results of the text comments S130:

receiving the high-level feature vector of the text comment information extracted in step S120, and calculating an emotion prediction result of the text comment.

Optionally, in step S110, the text comment data preprocessing specifically includes: and deleting punctuation marks and blank spaces by adopting a regular expression, introducing a field dictionary into the text data, and performing word segmentation processing on the data.

Optionally, in step S120, the self-encoder model is an unsupervised learning model, which can eliminate redundant features in data, reduce noise in data, and improve efficiency of comment information analysis.

The output vector of the self-encoder model

The calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

in order to be a function of the ReLU,

in order to input the feature vector of the text,

to represent

The weight matrix of (a) is determined,

is composed of

The bias term of (d);

optionally, in step S120, the LSTM model is a bidirectional improved recurrent neural network, and a bidirectional coding structure with stronger semantic ability is used to train the corpus, so as to implement deep bidirectional representation of corpus training.

Optionally, in step S120, the LSTM model is composed of 3 gate structures and 1 state unit, where the 3 gate structures include an input gate, a forgetting gate, and an output gate;

wherein the input gate receives two inputs, the output of the last-in-time LSTM model

And input of the current time

Output of input gate at time t

The calculation formula is as follows:

wherein the content of the first and second substances,

in order to be a sigmoid function,

a weight matrix representing the input gate,

indicating that two vectors are concatenated into one longer vector,

is the bias term of the input gate;

output of the forgetting gate

Also receiving the output result of the last time LSTM model

And input of the current time

And determining whether to discard information from the state unit, the output calculation formula is:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is a weight matrix of the forgetting gate,

meaning that two vectors are concatenated into one longer vector,

is a biased term for a forgetting gate.

Instantaneous state cell value at current time

Expressed as:

wherein the content of the first and second substances,

is that

The weight matrix of (a) is determined,

representing the concatenation of two vectors into one longer vector, tanh represents the hyperbolic tangent activation function,

is that

The bias term of (c).

The status cell at the current time

Receiving values for the input gate and the forget gate, expressed as:

the cell state at the previous time is initialized to 1.

Output gate

For controlling the output of the LSTM state unit, the expression:

is a function of the sigmoid and is,

is a weight matrix of the output gates,

meaning that two vectors are concatenated into one longer vector,

is the bias term for the output gate.

Finally, outputting the state unit of the LSTM model at the current moment

Expressed as:

。

optionally, in step S130, an output of the LSTM model, which is a high-level feature vector of the text comment information extracted in step S120, is received

Obtaining emotion prediction result of text comment through softmax function

The calculation formula is as follows:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is an emotion prediction result

The weight matrix of (a) is determined,

is an emotion prediction result

The bias term of (c). When the temperature is higher than the set temperature

When so, the emotion prediction is positive.

Further, the present invention also discloses a storage medium for storing computer-executable instructions, which, when executed by a processor, perform the above-mentioned text analysis method for comment information.

Compared with the prior art, the invention has the following advantages:

1) Because the AE model is adopted for data noise reduction, redundant features in data can be eliminated, and text analysis efficiency of comment information is improved.

2) The invention adopts the LSTM model and effectively utilizes the document information, thereby enabling the characteristics to be more judgment and improving the accuracy of text analysis of comment information.

Drawings

Fig. 1 is a flowchart of a text analysis method of comment information according to a specific embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The invention is characterized in that: and (3) carrying out data noise reduction by using an automatic encoder model (AE model), eliminating redundant features in the data, and extracting a high-level feature vector of the text comment information by using an LSTM model to realize text analysis of the comment information.

Referring to fig. 1, a flowchart of a text analysis method of comment information according to an embodiment of the present invention is shown, including the following steps:

text comment data preprocessing step S110:

and preprocessing the text comment data, filtering out irrelevant information, and performing word segmentation processing on the text comment data.

Specifically, in step S110, the text comment data preprocessing specifically includes: and deleting punctuation marks and blank spaces by adopting a regular expression, introducing a field dictionary into the text data, and performing word segmentation processing on the data.

Text comment vector extraction and processing step S120:

and for the preprocessed text comment data, text feature vectors are obtained by text vectorization, data noise reduction is carried out by using an Auto-Encoder model (Auto-Encoder), and then high-level feature vectors of text comment information are extracted through a Long Short-Term Memory network (LSTM) model to represent the text comment data.

Specifically, in step S120, the self-encoder model is an unsupervised learning model, which can eliminate redundant features in data, reduce noise in data, and improve efficiency of comment information analysis.

The output vector of the self-encoder model

The calculation formula is as follows:

wherein the content of the first and second substances,

in order to be a function of the ReLU,

for the feature vector of the text to be input,

represent

The weight matrix of (a) is determined,

is composed of

The bias term of (c).

The LSTM model is a bidirectional improved recurrent neural network, and a bidirectional coding structure with stronger semantic ability is adopted to train the corpus so as to realize deep bidirectional representation of corpus training.

Specifically, in step S120, the LSTM model is composed of 3 gate structures and 1 state unit, where the 3 gate structures include an input gate, a forgetting gate, and an output gate;

wherein the input gate receives two inputs, the output of the last-in LSTM model

And input of the current time

Output of input gate at time t

The calculation formula is as follows:

in order to be a sigmoid function,

a weight matrix representing the input gate,

meaning that two vectors are concatenated into one longer vector,

is the bias term of the input gate;

output of the forgetting gate

Also receiving the output result of the last time LSTM model

And input of the current time

is a function of the sigmoid and is,

is a weight matrix of the forgetting gate,

meaning that two vectors are concatenated into one longer vector,

is a biased term for a forgetting gate.

Instantaneous state cell value at current time

Expressed as:

wherein the content of the first and second substances,

is that

The weight matrix of (a) is determined,

is that

The bias term of (1).

The status cell at the current time

Receiving values for the input gate and the forget gate, expressed as:

wherein the content of the first and second substances,

the cell state at the previous time is initialized to 1.

Output gate

For controlling the output of the LSTM state unit, the expression:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is a weight matrix of the output gates,

indicating that two vectors are concatenated into one longer vector,

is the bias term for the output gate.

Finally, outputting the state unit of the LSTM model at the current moment

Expressed as:

。

calculating emotion prediction results of the text comments S130:

Specifically, in step S130, the input of the LSTM model, which is the high-level feature vector of the text comment information extracted in step S120, is receivedGo out

Obtaining emotion prediction result of text comment through softmax function

The calculation formula is as follows:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is an emotion prediction result

The weight matrix of (a) is determined,

is an emotion prediction result

The bias term of (c).

When in use

Then, the emotion prediction is positive.

Further, the present invention also discloses a storage medium for storing computer-executable instructions, which, when executed by a processor, perform the above-mentioned text analysis method of comment information.

Compared with the prior art, the text analysis method of the comment information provided by the invention has the following advantages:

1) According to the invention, the AE model is adopted for data noise reduction, so that redundant features in the data can be eliminated, and the text analysis efficiency of comment information is improved.

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A text analysis method for comment information is characterized by comprising the following steps:

text comment data preprocessing step S110:

text comment vector extraction and processing step S120:

text feature vectors are obtained by text vectorization on the preprocessed text comment data, data noise reduction is carried out by using a self-encoder model, and then high-level feature vectors of text comment information are extracted through an LSTM model to represent the text comment data;

calculating emotion prediction results of the text comments S130:

2. The text analysis method according to claim 1, wherein:

in step S110, the text comment data preprocessing specifically includes: and deleting punctuation marks and blank spaces by adopting a regular expression, introducing a field dictionary into the text data, and performing word segmentation processing on the data.

3. The text analysis method of claim 1, wherein:

in step S120, the self-encoder model is an unsupervised learning model,

the output vector of the self-encoder model

The calculation formula is as follows:

wherein the content of the first and second substances,

in order to be a function of the ReLU,

for the feature vector of the text to be input,

to represent

The weight matrix of (a) is determined,

is composed of

The bias term of (1).

4. The text analysis method of claim 3, wherein:

in step S120, the LSTM model is a bidirectional improved recurrent neural network, and a bidirectional coding structure with a stronger semantic ability is used to train the corpus, so as to implement deep bidirectional representation of corpus training.

5. The text analysis method of claim 4, wherein:

in step S120, the LSTM model is composed of 3 gate structures and 1 state unit, where the 3 gate structures include an input gate, a forgetting gate, and an output gate;

And input of the current time

Output of input gate at time t

The calculation formula is as follows:

in order to be a sigmoid function,

a weight matrix representing the input gates is shown,

meaning that two vectors are concatenated into one longer vector,

is the bias term of the input gate;

output of the forgetting gate

Also receiving the output result of the last time LSTM model

And input of the current time

is a function of the sigmoid and is,

is a weight matrix of the forgetting gate,

indicating that two vectors are concatenated into one longer vector,

is a biased term for a forgetting gate;

instantaneous state unit value of current time

Expressed as:

is that

The weight matrix of (a) is determined,

is that

The bias term of (a);

the status cell at the current time

Receiving values for the input gate and the forget gate, expressed as:

the cell state at the previous moment is initialized to 1;

output gate

For controlling the output of the LSTM state unit, the expression:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is a weight matrix of the output gates,

indicating that two vectors are concatenated into one longer vector,

is the bias term of the output gate;

finally, outputting the state unit of the LSTM model at the current moment

Expressed as:

。

6. the text analysis method of claim 5, wherein:

in step S130, the output of the LSTM model, which is the high-level feature vector of the text comment information extracted in step S120, is received

Obtaining emotion prediction results of text comments through softmax function

The calculation formula is as follows:

wherein the content of the first and second substances,

is a function of the sigmoid and is,

is an emotion prediction result

The weight matrix of (a) is determined,

is an emotion prediction result

The bias term of (1).

7. The text analysis method of claim 6, wherein:

when the temperature is higher than the set temperature

When so, the emotion prediction is positive.

8. A storage medium storing computer-executable instructions, characterized in that:

the computer-executable instructions, when executed by a processor, perform a method of text analysis of review information as recited in any of claims 1-7.