CN114861629B

CN114861629B - Automatic judgment method for text style

Info

Publication number: CN114861629B
Application number: CN202210475512.8A
Authority: CN
Inventors: 陈峥; 陈建树
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-04-04
Anticipated expiration: 2042-04-29
Also published as: CN114861629A

Abstract

The invention relates to an automatic judgment method of a text style, which belongs to the technical field of artificial intelligence, obtains an automatic text style judgment model through the flow of text style judgment of specific style label extraction, deep learning model training tuning and the like, deploys the judgment model in a text judgment system, increases the efficiency of text screening, does not need retraining the model when a new label is added, better accords with human understanding, and has good expandability.

Description

Automatic judgment method for text style

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an automatic text style judging method.

Background

Currently in some automated text generation systems, there is a need to obtain text that satisfies high level stylistic constraints, such as lyrics that require the text to implement a certain emotion. Generally, the computing power of today's devices can automatically generate a large amount of target text in a short time, but there are few automated methods for screening and evaluating text. The common situation is that manual screening is needed, on one hand, a large amount of texts can make a human screening staff dazzling, and the screening quantity which can be carried out in a certain time is extremely limited, so that the working efficiency is low; on the other hand, the consumption of physical strength, the loss of spirit and even the condition of mood greatly influence the subjective judgment of a manual screener, and further influence the judgment of results. Therefore, manual screening has two major disadvantages: 1. The labor and time costs are high; 2. the evaluation result is greatly influenced by subjective factors.

In the prior art, a small number of methods for automatically screening and evaluating text classification by using a machine are also available, and the method mainly comprises the steps of setting K labels in advance, abstracting a text classification task into a K-dimensional classification model, inputting each sample into the model to obtain a K-dimensional vector, and enabling each dimension to represent the probability that the corresponding label is true. There are two disadvantages to this approach: 1. when the model needs to judge a plurality of tags, the probability of other tags is influenced by the maximum probability tag due to the mutual exclusion characteristic, and when all tags are false, the maximum probability tag is still given; 2. when new labels need to be added, data needs to be reconstructed and the model needs to be retrained, expansibility is limited, and efficiency is low.

Disclosure of Invention

In order to solve the technical problems, the invention provides an automatic text style evaluation method, and aims to reduce heavy workload and labor time cost of manual evaluation, improve screening efficiency, reduce misjudgment rate and improve expansibility of a classification model.

In order to achieve the above object, the present invention provides an automatic text style judging method, which comprises the following steps:

step 1), carrying out syntactic analysis on the existing text comment data by using a syntactic analysis tool HanLP to obtain an analyzed data result; step 2), single data style label extraction:

a) Self-defining Node nodes, recursively constructing a multi-branch tree by using a bottom-up method, restoring a data result obtained by analysis into a tree structure of a phrase structure tree, and recording phrase properties and word contents of each Node by using a hash table A;

b) Taking the node type as VP and the number of words contained in the VP as a screening rule to extract labels, filtering the data nodes in the hash table A according to the screening rule to obtain a result meeting the condition, and storing the result meeting the condition in another hash table B;

step 3), constructing a full-volume style label set: performing the operation of the step 2) on each piece of data in the database to obtain a final hash table B, performing reverse ordering according to the frequency of phrases, and taking the first K phrases as style labels of model training data, wherein the style labels need to be judged by using a longest common substring algorithm, two labels with high text similarity cannot be used, and the model is an ALBERT pre-training model;

step 4), model training data construction: constructing a binary data set by using a negative sampling mode, keeping the proportion balance of positive and negative samples, marking one positive sample of the data as 1, then randomly selecting a style label which does not appear in the comment, constructing a negative sample by using the same splicing mode, and marking the label as 0;

and 5) training and optimizing the deep learning model, finely adjusting the ALBERT pre-training model by adopting a deep learning model training frame, and performing performance verification on a verification set.

Preferably, step 5) specifically comprises the following steps:

a. disorganizing the sequence in the constructed data set and sequentially inputting the sequence into an ALBERT pre-training model in small batches;

b. the ALBERT pre-training model carries out pre-processing operation on input, converts the input into one-hot vectors, carries out embedding operation, and then embeds position information and fragment information, wherein the fragment id of a label text is 0, and the fragment id of a composition text is 1;

c. inputting the preprocessed result into a neural network, respectively operating with three weight matrixes to obtain Q, K, V three matrixes, and respectively operating Q, K, V through a self-attention module to obtain an attention score matrix between each character and other characters, wherein the operation mode is as follows:

wherein, Z _i For the encoded vector, T is the matrix transpose, M is the mask matrix, d _k Is a single-headed attention hiding layer vector dimension, i is a positive integer from 1 to n;

d. using multi-head attention to Z ₁ ～Z _n Splicing the input matrix X and the multi-head attention input matrix X together, and then transmitting the input matrix X into a linear layer to obtain a final output Z with the same dimension as the multi-head attention input matrix X;

e. after the final output Z of the input matrix X in the same dimension is obtained, residual error connection is carried out by utilizing the final output Z of the multi-head attention module and the X, then layer normalization operation is carried out, and the input of each layer of neurons is converted into the mean value variance which is converted into the standard normal distribution;

f. a feedforward module in the ALBERT pre-training model processes the result by using two layers of full connection layers, so that the output dimension is consistent with the input dimension, then residual connection and layer normalization operation are performed again, the output is used as the input of the next cycle, and N times of cycles are performed;

g. sending CLS vectors in the ALBERT pre-training model into a linear layer, activating, performing loss operation by adopting a two-class cross entropy loss function, and performing model parameter optimization by back propagation;

h. and repeating the steps a-g until the model training is completed.

The invention has the beneficial effects that:

1) The style label is extracted in a syntactic analysis mode, and the obtained label text is more consistent with human understanding;

2) The invention uses the understandable text to splice with the input text to construct two classification tasks, and the model obtains the understanding of the label text, so that the addition of a new label does not need to retrain the model, thereby having good expansibility and greatly improving the text classification efficiency;

3) The invention can greatly reduce labor and time cost, improve screening efficiency and reduce misjudgment rate.

Drawings

FIG. 1 is an exemplary diagram of a phrase structure tree for style keyword extraction according to the present invention;

FIG. 2 is a schematic diagram of the classification method of the ALBERT model of the present invention.

Detailed Description

Fig. 1 shows an example of the present invention performing style keyword extraction to obtain Phrase structure tree, which is illustrated by using only one short sentence "xiaoming to go to the middle and see the electronic product" due to limited space, wherein the Phrase structure tree always contains words of the sentence as its leaf nodes, and other non-leaf nodes represent the constituent components of the sentence, usually Verb phrases (Verb Phrase, VP) and Noun phrases (Noun Phrase, NP).

The present invention will be described in detail below with reference to a comment (as an example of existing text comment data) corresponding to a composition, the contents of which are: the prose is light in meaning, durable in taste and rich in meaning, and the writer sings and rings the praise to the youth from different angles. The article language is heavy and frustrated, is rich in romantic colors, and is said to be a successful practice. ".

The invention provides an automatic judgment method of a text style, which comprises the following steps:

step 1), carrying out syntactic analysis on the existing text comment data by using a syntactic analysis tool HanLP, and obtaining a data result of a list structure in python as follows: <xnotran> [ [ 'IP', [ [ 'IP', [ [ 'NP', [ [ 'DP', [ [ '_', [ '' ] ], [ 'CLP', [ [ '_', [ '' ] ] ] ] ] ], [ 'NP', [ [ '_', [ '' ] ] ] ] ] ], [ '_', [ ',' ] ], [ 'IP', [ [ 'IP', [ [ 'IP', [ [ 'NP', [ [ '_', [ '' ] ] ] ], [ 'VP', [ [ 'VP', [ [ '_', [ '' ] ] ] ], [ '_', [ '' ] ], [ 'VP', [ [ '_', [ '' ] ] ] ] ] ] ] ], [ '_', [ ',' ] ], [ 'VP', [ [ '_', [ '' ] ], [ 'NP', [ [ 'CP', [ [ 'CP', [ [ 'IP', [ [ 'VP', [ [ '_', [ '' ] ] ] ] ] ], [ '_', [ '' ] ] ] ] ] ], [ 'NP', [ [ '_', [ '' ] ] ] ] ] ] ] ] ] ], [ '_', [ ',' ] ], [ 'IP', [ [ 'NP', [ [ '_', [ '' ] ] ] ], [ 'VP', [ [ 'PP', [ [ '_', [ '' ] ], [ 'NP', [ [ 'CP', [ [ 'CP', [ [ 'ADJP', [ [ '_', [ '' ] ] ] ], [ '_', [ '' ] ] ] ] ] ], [ 'NP', [ [ '_', [ '' ] ] ] ] ] ] ] ], [ '_', [ ',' ] ], [ 'VP', [ [ 'VRD', [ [ '_', [ '' ] ], [ '_', [ '' ] ] ] ], </xnotran> "[ ', [ ' is ' ] ], [ ' NP ', [ ' DNP ', [ ' PP ', [ ' is ' to ' ] ], [ ' NP ', [ ' is ' ] ] ] ] ] ], the present invention is directed to the compositions of the present invention, wherein the compositions are in the form of [ ', [ ', [ ' ] ] ] ], [ ' ] ] ] ] ] ] ] ], [ '. <xnotran> '] ], [' IP ', [ [' NP ', [ [' _ ', [' '] ] ] ], [' IP ', [ [' IP ', [ [' IP ', [ [' NP ', [ [' _ ', [' '] ] ] ], [' VP ', [ [' VCD ', [ [' _ ', [' '] ], [' _ ', [' '] ] ] ] ] ] ] ], [' _ ', [', '] ], [' VP ', [ [' _ ', [' '] ], [' NP ', [ [' ADJP ', [ [' _ ', [' '] ] ] ], [' NP ', [ [' _ ', [' '] ] ] ] ] ] ] ] ] ], [' _ ', [', '] ], [' IP ', [ [' VP ', [ [' ADVP ', [ [' _ ', [' '] ] ] ], [' VP ', [ [' _ ', [' '] ], [' NP ', [ [' QP ', [ [' _ ', [' '] ], [' CLP ', [ [' _ ', [' '] ] ] ] ] ], [' CP ', [ [' CP ', [ [' IP ', [ [' VP ', [ [' _ ', [' '] ] ] ] ] ], [' _ ', [' '] ] ] ] ] ], [' NP ', [ [' _ ', [' '] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ], [' _ ', ['. </xnotran> ' ] ] ], wherein underlining indicates that the word is a terminal leaf node;

step 2), single data style label extraction:

through the step 2), phrases meeting the conditions, such as ' shallow appearance and human taste resistance ', ' rich implications ', ' depression and pause and contusion ', and rich romantic colors ' can be obtained;

step 3), constructing a full-volume style label set: performing the operation of the step 2) on each piece of data in the database to obtain a final hash table B, performing reverse ordering according to the frequency of phrases, taking the first K phrases as style labels of model training data (the value of K is determined according to subsequent experiments), judging the style labels by using a longest common substring algorithm, and not using two labels with high text similarity (the length of the longest common substring can be defined by itself), wherein the model is an ALBERT pre-training model;

step 4), model training data construction: constructing a binary data set by using a negative sampling mode, and keeping proportion balance of positive samples and negative samples, wherein one positive sample of data is exemplified by the article of "[ CLS ]: light and durable for people to look for the flavor. [ SEP ]18 years old, a flower of normal age, when the wind of youth blows, people always have the young breath and overflow our lives … …' with the label of 1; then randomly selecting a style label which does not appear in the comment, such as 'good at citing typical cases', constructing a negative sample by using the same splicing mode, wherein the label of the negative sample is 0; because the number of labels is relatively large and positive samples are sparse, a data set is constructed by using a negative sampling mode;

step 5), training and optimizing a deep learning model, finely adjusting an ALBERT pre-training model by adopting a deep learning model training frame, and performing performance verification on a verification set, wherein the method specifically comprises the following steps:

b. the method comprises the steps that an ALBERT pre-training model carries out preprocessing operation on input, converts the input into one-hot vectors and carries out embedding (embedding) operation, and then embeds position information and fragment information, wherein the fragment id of a label text is 0, and the fragment id of a composition text is 1;

c. inputting the preprocessed result into a neural network, and respectively operating the preprocessed result with three weight matrixes to obtain Q, K, V three matrixes, and respectively operating Q, K, V through a self-attention module to obtain an attention score matrix between each character and other characters, wherein the operation mode is as follows:

wherein Z is _i For the encoded vector, T is matrix transpose, M is mask matrix, d _k Is a vector dimension of a single-head attention hiding layer, and i is a positive integer from 1 to n;

d. z is expressed by Multi-Head Attention (Multi-Head Attention) ₁ ～Z _n Splicing (concat) together, and then introducing a Linear (Linear) layer to obtain a final output Z with the same dimension as a Multi-Head Attention (Multi-Head Attention) input matrix X;

e. the method comprises the steps that an added & Norm Layer is arranged in an ALBERT pre-training model and consists of an added part and a Norm part, after a final output Z of an input matrix X in the same dimension is obtained, residual error connection (added) is carried out on the final output Z and the X of a Multi-Head Attention (Multi-Head Attention) module, then Layer Normalization (Layer Normalization) operation is carried out, and the input of each Layer of neurons is converted into a mean variance which is converted into a standard normal distribution Layer Norm (X + Z);

f. a Feed-Forward (Feed Forward) module in the ALBERT pre-training model processes a result by using two layers of full connection layers, so that the output dimension is consistent with the input dimension, then residual connection and layer normalization operation are performed once again, the output is used as the input of the next cycle, and the cycle is performed for N times.

g. Sending CLS vectors in an ALBERT pre-training model into a linear layer, activating, performing loss (loss) operation by adopting a two-class cross entropy loss function, and performing model parameter optimization by back propagation, wherein the loss (loss) operation formula is as follows:

loss＝y _n ·log(x _n )+(1-y _n )·log(1-x _n )，

wherein, y _n Is a real label with a value range of {0,1}, x _n Is the probability that the sample output by the model is positive, and the value range is (0,1);

h. and repeating the steps a-g until the model training is completed.

The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An automatic judgment method for text style is characterized by comprising the following steps:

step 1), carrying out syntactic analysis on the existing text comment data by using a syntactic analysis tool HanLP to obtain a data result of a list structure in python;

step 2), single data style label extraction:

b) Taking the node type as VP and the number of words as 3-5 as a screening rule to extract a label, filtering the data nodes of the hash table A according to the screening rule to obtain a result meeting the condition, and storing the result meeting the condition in another hash table B, wherein the VP is a verb phrase;

step 3), constructing a full-volume style label set: performing the operation of the step 2) on each piece of data in the database to obtain a final hash table B, performing reverse sequencing according to the frequency of the phrases, and taking the first K phrases as style labels of model training data, wherein the model is an ALBERT pre-training model;

step 4), model training data construction: constructing a binary data set by using a negative sampling mode, keeping the proportion balance of positive and negative samples, identifying one positive sample of the data by using a label 1, then randomly selecting a style label which does not appear in the comment, constructing a negative sample by using the same splicing mode, and identifying the negative sample by using a label 0;

step 5), deep learning model training and tuning: and (3) fine tuning the ALBERT pre-training model by adopting a deep learning model training frame, and performing performance verification on a verification set.

2. The automatic judgment method of text style according to claim 1, wherein in step 3), style labels are judged repeatedly by using the longest common substring algorithm, and two labels with high text similarity cannot be used.

3. The method for automatically judging the style of text according to claim 2, wherein the step 5) comprises the following steps:

a. disordering the sequence in the constructed data set and sequentially inputting the sequence into an ALBERT pre-training model in small batches;

c. inputting the result after the preprocessing operation into a neural network, respectively operating with three weight matrixes to obtain Q, K, V three matrixes, and Q, K, V respectively obtaining an attention score matrix between each character and other characters through a self-attention module, wherein the operation mode is as follows:

e. after the final output Z of the input matrix X in the same dimension is obtained, residual error connection is carried out by utilizing the final output Z of the multi-head attention module and the X, then layer normalization operation is carried out, the input of each layer of neurons is converted into the mean value variance, and the mean value variance is converted into the standard normal distribution LayerNorm (X + Z);

f. a feedforward module in the ALBERT pre-training model processes a result by using two layers of full connection layers, so that the dimension of output is consistent with that of input, then residual connection and layer normalization operation are performed again, the output is used as the input of the next cycle, and the cycle is performed for N times;

g. sending CLS vectors in an ALBERT pre-training model into a linear layer, activating, performing loss operation by adopting a two-class cross entropy loss function, and performing model parameter optimization by back propagation, wherein the loss, namely the loss, has the operation formula as follows:

loss＝y _n ·log(x _n )+(1-y _n )·log(1-x _n )，

h. and repeating the steps a-g until the model training is completed.