CN114925687A - Chinese composition scoring method and system based on dynamic word vector representation - Google Patents

Chinese composition scoring method and system based on dynamic word vector representation Download PDF

Info

Publication number
CN114925687A
CN114925687A CN202210536676.7A CN202210536676A CN114925687A CN 114925687 A CN114925687 A CN 114925687A CN 202210536676 A CN202210536676 A CN 202210536676A CN 114925687 A CN114925687 A CN 114925687A
Authority
CN
China
Prior art keywords
composition
chinese
chinese composition
word
scoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210536676.7A
Other languages
Chinese (zh)
Inventor
蔡远利
刘美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210536676.7A priority Critical patent/CN114925687A/en
Publication of CN114925687A publication Critical patent/CN114925687A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese composition scoring method and a system based on dynamic word vector representation, wherein the Chinese composition scoring method based on the dynamic word vector representation comprises the following steps: acquiring a Chinese composition text to be scored after word segmentation; inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model, and outputting a scoring result through the Chinese composition scoring model. The method of the invention is based on deep learning, introduces Bidirectional Long Short-Term Memory (BilSTM) network, and dynamically trains Chinese word vector of composition, finally can ensure accuracy of composition scoring.

Description

Chinese composition scoring method and system based on dynamic word vector representation
Technical Field
The invention belongs to the technical field of text classification in natural language processing, and particularly relates to a Chinese composition scoring method and system based on dynamic word vector representation.
Background
The Chinese represents excellent traditional culture of China and has profound cultural implications and mental strength. In a Chinese examination, composition takes a considerable proportion, the score of the composition determines the Chinese achievement of students to a large extent, and the traditional manual batch modification of the composition has some problems (such as large workload of teachers, single feedback of composition batch modification, subjectivity and the like). Education throughout the world has gradually adopted computer instead of manual paper marking and developed an Automated Essay Scoring (AES) system.
Composition scoring by computer in early stage is mainly based on statistics and machine learning; specifically, the composition scoring standard and teaching research are analyzed, the composition quality evaluation standard is summarized, defined and measurable characteristic indexes related to the composition quality are extracted, and then the composition scoring is realized by adopting a machine learning mode, such as regression and classification technology. With the development of natural language processing, particularly natural language understanding, the automatic composition scoring method adopts more statistics and deeper grammar and semantic features, for example, technologies such as potential semantic analysis and vector space model are adopted to extract features from the levels of language, content, structure and the like, and finally, a machine learning mode is adopted to score compositions. However, the composition evaluation method based on statistics and machine learning requires manual text feature extraction, which is too high in cost, too simple in scoring mechanism, and solidified in scoring mode, and cannot judge complex structures and contents.
In recent years, with the development of deep learning, a plurality of models for establishing scores by adopting deep learning appear; for example, existing schemes for scoring composition using deep learning techniques mainly use Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), and achieve better results than those based on statistics and machine learning algorithms. Specifically, in the prior art, the Word segmentation is performed on the Chinese text, the composition is divided into Word units, and then the Word vector is used for random initialization or Word2Vec Word vector initialization; after the word vectors of the text are obtained, the word vectors are input into a convolutional neural network or a cyclic neural network for feature extraction, and finally, a layer of linear layer is adopted to map the output vectors into a fraction between 0 and 1. The composition evaluation method based on deep learning does not need to manually extract text features, features are automatically trained by a neural network, the model can extract the features on the levels of language, content, structure and the like, the generalization capability is stronger, and more complex language structures can be captured. However, the existing method for scoring Chinese composition has some disadvantages, which are mainly reflected in:
(1) the existing Chinese word vector represents roughness; specifically, it is explained that an english word vector is different from a chinese word vector, and in an english language, generally, a word corresponds to a word vector of a fixed dimension; however, in chinese, a "word" has a function of expressing complete semantics, and generally, each word corresponds to a word vector with a fixed dimension; the number of words is far larger than that of characters, if the word vector is trained by taking the word as a unit, not only is the storage space consumed, but also a large amount of computing resources are consumed to train the word vector, so that the Chinese word vector is inaccurate, and the scoring result is not ideal; if the existing large-scale Chinese pre-training word vector model is used, because the word vectors in different fields are different, the scoring result is not ideal when the existing large-scale pre-training word vector is directly used for Chinese composition scoring.
(2) A recurrent neural network model, such as a Long Short-Term Memory (LSTM) network, can only predict the output at the next time based on the timing information of the previous time; that is, the neural network model can only capture the relationship between a word and the word preceding the word in a sentence. However, in some sentences, the output at the current moment is not only related to the previous state, but also possibly related to the future state; for example, to understand the meaning of a word in a sentence, it needs to be determined not only according to the foregoing but also considering the following content, so that the determination based on the context is really achieved.
Based on the above statements and analysis, the existing automatic scoring method for Chinese composition cannot achieve the ideal effect on the accuracy of the scoring result.
Disclosure of Invention
The invention aims to provide a Chinese composition scoring method and system based on dynamic word vector representation, so as to solve one or more technical problems. The method of the invention is based on deep learning, introduces a Bidirectional Long Short-Term Memory (BilSTM) network, and dynamically trains Chinese word vectors of the composition, thereby finally ensuring the accuracy of composition scoring.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a Chinese composition scoring method based on dynamic word vector representation, which comprises the following steps:
acquiring Chinese composition texts to be evaluated after word segmentation;
inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model, and outputting a scoring result through the Chinese composition scoring model;
wherein, the Chinese composition scoring model comprises:
the input layer is used for inputting the Chinese composition text after word segmentation processing, and converting words in the Chinese composition text into a sequence matrix based on a pre-trained binary word vector dictionary;
the embedded layer is used for inputting the sequence matrix and converting the sequence matrix into a word vector matrix for output;
the BilSTM layer is used for inputting the word vector matrix, extracting the characteristics and outputting the composition context characteristic vector;
the full connection layer is used for inputting the composition context feature vector and outputting a classification result vector;
and the output layer is used for inputting the classification result vector and outputting a grading result.
The method is further improved in that the step of obtaining the pre-trained Chinese composition scoring model comprises the following steps:
acquiring a training sample set; each training sample in the training sample set comprises a scoring label and a Chinese composition text of the sample after word segmentation;
when training is updated, inputting the Chinese composition text of the sample after word segmentation processing in the selected training sample into the Chinese composition scoring model to obtain a scoring predicted value; and calculating the difference value between the score predicted value and the score label in the selected training sample, calculating loss by adopting a cross entropy loss function, updating the parameters of the Chinese composition score model, achieving a preset convergence condition, and obtaining the pre-trained Chinese composition score model.
The method of the invention is further improved in that the Chinese composition scoring model further comprises:
a Dropout layer disposed between the BilSTM layer and the fully-connected layer; the neural network is used for inputting the composition context feature vector output by the BilSTM layer and randomly enabling the neuron output with a preset proportion to be 0 to obtain a processed composition context feature vector; the processed composition context feature vector is used for inputting the full-connection layer.
The method of the present invention is further improved in that the step of obtaining the pre-trained binary word vector dictionary comprises:
acquiring a new training sample set composed of the Chinese composition texts of the samples after word segmentation processing of each training sample in the training sample set adopted by the Chinese composition scoring model;
and based on the new training sample set, adopting a Word2Vec function in a genesis library to realize the training of Word vectors, and obtaining the pre-trained binary Word vector dictionary.
The method of the invention is further improved in that the Word2Vec function utilizes an algorithm of context prediction of the central Word or an algorithm of context prediction of the central Word.
The method of the invention is further improved in that the step of acquiring the Chinese composition text to be scored after word segmentation processing specifically comprises the following steps:
acquiring an initial Chinese composition to be evaluated;
carrying out data cleaning on the initial Chinese composition to be scored to enable the format to be uniform and obtain the Chinese composition after the data cleaning;
and (4) performing word segmentation on the Chinese composition after the data cleaning by adopting a jieba.cut () function to obtain a Chinese composition text to be scored after word segmentation.
The invention provides a Chinese composition scoring system based on dynamic word vector representation, which comprises:
the data acquisition module is used for acquiring Chinese composition texts to be evaluated after word segmentation;
the result acquisition module is used for inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model and outputting a scoring result through the Chinese composition scoring model;
wherein, the Chinese composition scoring model comprises:
the input layer is used for inputting the Chinese composition text after word segmentation processing, and converting words in the Chinese composition text into a sequence matrix based on a pre-trained binary word vector dictionary;
the embedded layer is used for inputting the sequence matrix and converting the sequence matrix into a word vector matrix for output;
the BilSTM layer is used for inputting the word vector matrix, extracting the characteristics and outputting composition context characteristic vectors;
the full connection layer is used for inputting the composition context feature vector and outputting a classification result vector;
and the output layer is used for inputting the classification result vector and outputting a grading result.
The system of the invention is further improved in that the step of obtaining the pre-trained Chinese composition scoring model comprises the following steps:
acquiring a training sample set; each training sample in the training sample set comprises a scoring label and a Chinese composition text of the sample after word segmentation;
when training is updated, inputting the Chinese composition text of the sample after word segmentation processing in the selected training sample into the Chinese composition scoring model to obtain a scoring predicted value; and calculating the difference value between the score predicted value and the score label in the selected training sample, calculating loss by adopting a cross entropy loss function, updating the parameters of the Chinese composition score model, achieving a preset convergence condition, and obtaining the pre-trained Chinese composition score model.
The system of the invention is further improved in that the Chinese composition scoring model further comprises:
the Dropout layer is arranged between the BiLSTM layer and the full connection layer; the neural network is used for inputting the composition context feature vector output by the BilSTM layer and randomly enabling the neuron output with a preset proportion to be 0 to obtain a processed composition context feature vector; the processed composition context feature vector is used for inputting the full-connection layer.
The system of the present invention is further improved in that the step of obtaining the pre-trained binary word vector dictionary comprises:
acquiring a Chinese composition text of a sample after word segmentation processing of each training sample in a training sample set adopted by the Chinese composition scoring model to form a new training sample set;
and based on the new training sample set, adopting a Word2Vec function in a genim library to realize the training of Word vectors, and obtaining the pre-trained binary Word vector dictionary.
Compared with the prior art, the invention has the following beneficial effects:
in the method provided by the invention, aiming at the problem that the long-short term memory network can only predict the output of the next moment according to the time sequence information of the previous moment, the bidirectional long-short term memory network is introduced, and the feature indexes of the composition are extracted by utilizing the context information of the words, so that the instantaneity and the accuracy of composition scoring can be ensured.
In addition, aiming at the problem of inaccurate word vectors caused by using the existing large-scale Chinese pre-training word vectors, the invention introduces a training mode of dynamic word vectors, and in the training process, the word vector of each Chinese word is not fixed and is dynamically changed along with the model training process, so that the aim of improving the composition scoring accuracy can be finally achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic flowchart of a Chinese composition scoring method based on dynamic word vector representation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a neural network structure of a Chinese composition scoring model based on dynamic word vector representation according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a hidden layer of a bidirectional long and short term memory network according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention is described in further detail below with reference to the accompanying drawings:
referring to fig. 1, in the method for scoring a chinese composition based on dynamic word vector representation according to the embodiment of the present invention, an automatic chinese composition scoring model is designed, so that a composition can be input, and the final model can provide a composition accurate scoring level. The Chinese composition scoring process realized by the invention is shown in figure 1 and comprises a training process of a Chinese composition scoring model and a forecasting process of composition scoring;
in the training process of the Chinese composition scoring model, firstly, a scored composition database is preprocessed, then Word2Vec training Word vectors are used for the processed texts, the training Word vectors are input into a bidirectional long-short term memory network for feature extraction, and finally, a layer of full connection layer is adopted for the model, and the output of a plurality of neurons is mapped into a (0, 1) interval, so that multi-classification is carried out. And finally, storing the structures and parameters of the trained word vector model, the trained feature extraction network model and the trained classification linear layer model for model prediction.
In the Chinese composition scoring process, namely the model prediction process, the composition to be scored is preprocessed firstly, then the word vector model stored in the training process is used for carrying out word vector representation on the composition text, the word vector of the text is input into the trained model for feature extraction and scoring, and the score grade of the composition is output by the last layer of the model.
The embodiment of the invention provides a Chinese composition scoring method based on dynamic word vector representation, which can assist teachers in correcting compositions, assist students in improving writing and composition cognitive abilities and provide a new solution for deep learning applied to intelligent education; the method can ensure the instantaneity and the accuracy of composition scoring, train a composition scoring model on line, and directly use the model for composition in prediction. In addition, the composition scoring standard does not need to be deeply researched, the composition quality evaluation standard does not need to be summarized, only rough understanding is needed, and the method has the characteristic of less manual participation; the introduced bidirectional long-short term memory network utilizes the context information of the words to extract the feature indexes of the composition, and can effectively utilize the information of the text context so as to make the feature indexes for composition scoring more accurate; in the composition scoring training process, the word vectors are dynamically trained, and the finally trained word vectors can better represent the relationship among languages, contents and structures of words, so that the word vectors are more accurate.
In summary, in the method provided by the embodiment of the present invention, the bidirectional long-short term memory network in deep learning is used to train and predict scores for a chinese composition, and meanwhile, the word vectors are dynamically trained in the training process of the model, so that the utilization of the model for input information can be improved, and finally, the instantaneity and accuracy of composition scoring can be ensured.
The following describes the training and prediction process in detail.
(1) Data preprocessing, comprising:
the first step is to carry out data cleaning on the scored composition; for example, some composition is an empty file and needs to be deleted; the format of the text is uniformly modified into utf-8 coding. The scored composition is arranged according to the format of < content, result >, and is stored in the CSV file one by one. Wherein, content represents the content of composition, and result represents the scoring of composition; can be divided into three grades of excellence, goodness and medium.
The second step is to divide words for the composition text; cut () function carries on the word segmentation to the composition data, there are three kinds of modes in the word segmentation of jieba: the method comprises the following steps of (1) an accurate mode (segmenting sentences as accurately as possible), a full mode (segmenting all words which can be formed into words in the sentences and a search engine mode (segmenting words segmented by the accurate mode), and the embodiment of the invention exemplarily adopts a default accurate mode.
The final preprocessed composition text can be expressed as
Figure BDA0003648548350000081
Wherein x is i Representing the ith composition text, y i A score representing the composition is shown, and,
Figure BDA0003648548350000082
respectively representing each word after word segmentation and stop word removal in the ith composition text, T xi Indicating the length of the ith text sequence.
(2) Training a word vector comprising:
word2Vec is a representation of words that is trained on context using context information to represent words as fixed-dimension vectors. Compared with a one-hot coding representation mode, each Word of Word2Vec is not a sparse vector with only one position being 1 and the rest positions being 0, but is a dense fixed-dimension vector. The Word2Vec Word vector intuitively reduces storage and computation overhead. Secondly, in the aspect of deep semantic understanding, the trained word vectors can utilize context information to judge and find out similar words. Word2Vec has a total of two implementations: predicting the core word CBOW (continue Bag of word) with the context and predicting the context Skip-Gram with the core word. The embodiment of the invention specifically adopts a context prediction headword CBOW model.
The CBOW model predicts a headword with background words before and after the headword. The CBOW model needs to maximize the probability that a given background word generates any of the core words, i.e.
Figure BDA0003648548350000091
The maximum likelihood estimation of the above equation is equivalent to minimizing the following loss function:
Figure BDA0003648548350000092
wherein, T xi Indicates the length of the ith text sequence,
Figure BDA0003648548350000093
the t-th word representing the ith text sequence, and m represents the size of the sliding window, i.e. m words before and after the central word.
With V and U eachA vector representing a central word and a background word; set the central word
Figure BDA0003648548350000094
Index c in dictionary, background word
Figure BDA0003648548350000095
The index is o in the dictionary, and these two vectors of all words in the dictionary are the model parameters to be learned by the CBOW model. To implant the model parameters into the loss function, the probability of a given background word in the loss function generating a core word is expressed using the model parameters. The probability that a given background word in the loss function generates a core word is defined by the Softmax function,
Figure BDA0003648548350000096
when the text sequence length T xi Larger, a shorter subsequence is typically sampled randomly at each iteration to compute the penalty on that subsequence. Then, a gradient of the word vector is calculated from the loss and the word vector is iterated. Calculating the logarithm of conditional probability related to any background word vector v in the above formula by differentiation oj The gradient of (j ═ 1, …,2m) is,
Figure BDA0003648548350000101
in the invention, Word2Vec function in genim library is adopted to realize Word vector training, and the input data is composition text stored in CSV file after data preprocessing. And finally, the trained word vectors are in a dictionary form, each word corresponds to one word vector, and the word vectors are stored in a binary file. The form of the word vector is such that,
w 1 →[w 1,1 ,w 1,2 ,…,w 1,n-1 ,w 1,n ]
w 2 →[w 2,1 ,w 2,2 ,…,w 2,n-1 ,w 2,n ]
………;
w T-1 →[w T-1,1 ,w T-1,2 ,…,w T-1,n ]
w T →[w T,1 ,w T,2 ,…,w T,n-1 ,w T,n ]
wherein w i Means word, [ w ] i,1 ,w i,2 ,…,w i,n-1 ,w i,n ]Representing a word vector, T being the number of words in the dictionary, and n being the dimension of the word vector.
(3) Feature extraction and composition scoring, comprising:
referring to fig. 2, the embodiment of the present invention extracts features of a composition text by using a bidirectional long short term memory (BiLSTM) network based on dynamic word vectors, connects a full connection layer, maps outputs of a plurality of neurons into a (0, 1) interval, performs multi-classification, and outputs grades (excellent, good, medium) of the composition, thereby completing scoring of the composition, and the structure is shown in fig. 2.
In the embodiment of the invention, the Embedding layer is introduced to realize the learning of the dynamic word vector; the Embedding layer contains word vectors of all the words in step (2) (training word vectors), i.e., word vector dictionaries.
When inputting a word-segmented composition in network, firstly, according to binary word vector dictionary 1 ,word 2 ,…,word M ]Conversion to [ index 1 ,index 2 ,…,index M ]Index therein i Representing the ith word in the composition i The number of sequences in a word vector dictionary, M represents the number of words in the text. Then [ index ] is compared 1 ,index 2 ,…,index M ]Inputting the word vector into the Embedding layer to search the word vector corresponding to each word, and outputting the word vector as [ x ] <1> ,x <2> ,…,x <M> ]。
The embodiment of the invention has two advantages by using the Embedding layer, one advantage is that the time for converting word vectors is saved: after the composition text is input, each word is converted into an index without searching a word vector dictionary every time, and the words in the composition text can be converted into word vectors only by carrying out matrix multiplication once; another advantage is that dynamic word vectors can be implemented: the word vector dictionary is put into an Embedding layer, parameters in each word vector can change along with model training in the training process of the model, and the word vector can represent each word more accurately along with the continuous improvement of the classification accuracy in the training process of the model.
In the embodiment of the invention, BilSTM is a typical structure in a Recurrent Neural Network (RNN). Simple RNNs can theoretically establish dependencies between states over long time intervals, but due to gradient explosion or disappearance problems, only short term dependencies can actually be learned, while LSTM can well alleviate gradient disappearance and gradient explosion problems. However, LSTM can predict the output at the next time only based on the time sequence information of the previous time, and BiLSTM uses the information of the input, before the input sequence, and after the input sequence, and extracts the feature index of the composition by combining the context information of the words, so as to better mine the dependency relationship of the text.
Referring to fig. 3, fig. 3 shows the structure of a single BiLSTM cell, with a BiLSTM network processing data bi-directionally with twice the number of hidden layer cells compared to LSTM. A single BiLSTM hidden layer contains three gates: gamma-shaped ufo And respectively updating the gate, the forgetting gate and the output gate. The corresponding expression is:
Figure BDA0003648548350000111
Γ f =σ(W f [a <t-1> ,x <t> ]+b f )
Γ u =σ(W u [a <t-1> ,x <t> ]+b u );
Γ o =σ(W o [a <t-1> ,x <t> ]+b o )
Figure BDA0003648548350000112
a <t> =Γ o *tanhc <t>
wherein, W c ,W f ,W u ,W o ,b c ,b f ,b u ,b o The parameter is a parameter to be learned, sigma represents a sigmoid activation function, tanh represents a hyperbolic tangent function, and x represents a product of corresponding elements of a matrix or a vector. Forget gate decides what information to discard from neuron state, it looks at a <t-1> (previous hidden state) and x <t> (currently input) and is in state c <t-1> Each of the numbers in (1) outputs a number between 0 and 1; the Sigmoid layer of the entry gate decides which values will be updated; then, a Tanh layer creates a candidate vector c <t> The vector will be added to the state of the neuron; combining these two vectors determines how much new and old information to remember based on the values of the forget gate and the input gate: the last state value c <t-1> Multiplication by Γ f Thereby expressing the desired forgetting portion. Adding gamma to the obtained value u Multiplication by
Figure BDA0003648548350000121
A new state value is obtained. Finally, the output gate decides what to output, the last output state a <t> The output y can be obtained through a Softmax function <t>
In the embodiment of the invention, a Dense layer is introduced to realize the scoring of Chinese composition. Output of BiLSTM network
Figure BDA0003648548350000122
And (3) accessing a Dense layer, wherein the number of input neurons of the Dense layer is 2N, and the number of output neurons is 3. Wherein, N represents the number of the neurons in the long-period memory network in the forward direction or the backward direction in the BilSTM network, and 3 represents the grade of the final composition scoring, which is respectively superior, superior and middle.
In addition, to prevent overfitting, a Dropout layer was added between the BiLSTM layer and the density layer. The Dropout layer is mainly used for randomly disabling how many neurons of the layer are disabled, so that the diversity of the model is increased, and overfitting can be prevented. The layer can be used only when the model is trained, and Dropout is not used in the test model so as to ensure the accuracy of the model.
In the embodiment of the invention, the optimization algorithm and the evaluation index comprise: the optimization algorithm is a method for adjusting model parameters in a process of learning a model, and for a neural network model, the optimization method adopted nowadays is mainly a method based on Gradient Descent, and mainly includes a random Gradient Descent (SGD), a Momentum method (Momentum), an Adaptive Gradient Descent algorithm (AdaGrad), an AdaDelta method, an Adaptive Momentum method (Adam), and the like. Adam is a self-adaptive learning method, which dynamically adjusts the learning rate of each parameter by using first moment estimation and second moment estimation of a gradient. The evaluation index refers to an index for evaluating the quality of the model, and the accuracy (precision), the Recall (Recall) and the F1 score of the multi-classification problem are adopted by the embodiment of the invention to measure the accuracy of the score of the model on the Chinese composition. The higher the accuracy and recall rate, the more reliable the model scores Chinese compositions, but the accuracy and recall rate are in a mutually exclusive relationship, so F1 score is added for comprehensive consideration.
In the training process of the model, firstly, words in the composition text after word segmentation in the database are converted into a sequence matrix according to the sequence number in the binary word vector dictionary; the model inputs the sequence matrix into an Embedding layer to convert the text into a word vector matrix, then the word vector matrix is input into a BilSTM layer to carry out feature extraction, and finally the Dense layer realizes the scoring of the Chinese composition; the model reduces the loss value of the loss function based on a gradient descent method, further dynamically adjusts word vector parameters and model parameters, and finally achieves the purpose of improving the scoring performance index of the model composition.
In the prediction process of the model, namely the scoring process of unscored compositions, the embodiment of the invention firstly carries out jieba word segmentation on the compositions, then converts the compositions into a sequence matrix according to a word vector dictionary and inputs the sequence matrix into the prediction model. The model will automatically go through the Embedding layer, the BilSTM layer and the Dense layer, and finally output the score grade of the composition. The word vector parameters and model parameters in the prediction model are parameters obtained after model training.
The specific embodiment is as follows:
the experimental design and result analysis in the embodiment of the invention are as follows: the Chinese composition scoring experiment based on the dynamic word vector representation provided by the embodiment of the invention mainly verifies the effectiveness and accuracy of the method for scoring Chinese compositions.
The experimental environment is as follows: intel (R) core (TM) i5-10500 CPU @3.10GHz 3.10GHz processor, 16.0GB memory, 64 bits of Windows10 system. The experimental software was Jupyter notewood, and the deep learning framework used TensorFlow2.2. The libraries and versions of the functions used in the experiments are shown in table 1.
TABLE 1 function library names and versions
Figure BDA0003648548350000131
The data set selection and processing comprises the following steps: the embodiment of the invention uses 201907 composition texts with data of grade one to grade six of pupils, and each composition text has corresponding content and grade scores. Converting the composition text coding mode into utf-8, removing stop words by using a stop word list, and performing word segmentation on composition data by adopting an accurate mode through a cut () function in a jieba library; and finally, storing the preprocessed data set into the CSV document.
The Word vector training uses the Word2Vec () function in the genism library, the input is the preprocessed CSV document, the output is the trained binary Word vector dictionary, and the dictionary contains 137559 words. The Word2Vec Word vector training adopts a CBOW algorithm, the vector dimension is set to be 200, the maximum distance between a current Word and a predicted Word in a sentence is 5, and the iteration times are 8 rounds. The final training resulted in a word vector dictionary size of 137559 × 200.
The parameters of the Chinese composition scoring model are shown in Table 2. To prevent over-fitting during training, a Dropout mechanism is added between the BiLSTM layer and the sense layer, i.e., how many neurons of the layer are made inoperative randomly during training (i.e., the output of the neuron is set directly to 0), and the value is set to 0.2, i.e., 20% of the neurons are made inoperative randomly, and the mechanism can only be used during training of the model and cannot be used during testing, i.e., the value is set to 0.
TABLE 2 model structural parameters
Figure BDA0003648548350000141
Figure BDA0003648548350000151
And setting a network according to the parameters, constructing a Chinese composition scoring model, and using 95% of data in the data set as a training set training network. After the model training is completed, 5% of data in the data set is used as a test set test model, and the evaluation index value of the obtained model on the test set is shown in table 3. As can be seen from the table, when the Chinese composition scoring model based on the dynamic word vector representation is used for scoring the composition, the accuracy rate can reach 92.18%, the F1 index of the model reaches 0.9209, and the model can score the Chinese composition more accurately.
TABLE 3 model prediction evaluation index
Figure BDA0003648548350000152
It should be noted that the model parameters in table 2, such as Dropout, initial learning rate, have been selected by grid search to obtain the optimal parameters. The Dropout search range is [0.1,0.2,0.3], and the search range of the initial learning rate is [0.01,0.001,0.0001 ]. Under the same Chinese scoring model based on the dynamic word vector characterization, the evaluation index values of different parameters on the test set are shown in Table 4. As can be seen from the table, if Dropout is selected to be 0.2, and the initial learning rate is selected to be 0.001, the Chinese composition scoring model based on the dynamic word vector representation can achieve the optimal performance.
TABLE 4. prediction and evaluation indexes of different model parameters
Figure BDA0003648548350000153
Figure BDA0003648548350000161
In order to show the superiority of the Chinese composition evaluation model based on the dynamic word vector representation, the model is compared with a static word vector, a Convolutional Neural Network (CNN) and a one-way long-short period memory network (LSTM). The word vectors use the same trained binary word vector dictionary, and the parameters of the static word vectors do not change along with model training; the convolutional neural network uses a one-dimensional convolution with a convolution kernel size of 3. The parameters in the model are all searched as the optimal parameters through the grids, and the selected data sets are the same. The evaluation index values of the different models on the test set are shown in table 5. As can be seen from the table, the accuracy of composition scoring and F1 score can be improved by adopting a composition characterization method based on dynamic word vectors and a feature extraction method of a bidirectional long-short period memory network.
TABLE 5 prediction and evaluation indexes of different model structures
Figure BDA0003648548350000162
In summary, the invention discloses a Chinese composition scoring method based on dynamic word vector representation, which can achieve the purposes of inputting a Chinese composition and finally outputting a composition scoring grade. The composition scoring model provided by the embodiment of the invention is based on Word2Vec Word vectors, an Embedding layer is introduced into a neural network structure to dynamically train the Word vectors, and the trained Word vectors can more accurately represent words; a bidirectional long-short period memory (BilSTM) network is used in a composition scoring model, the BilSTM network uses information before and after input sequences, and characteristic indexes of compositions are extracted by combining context information of words, so that the dependency relationship of texts can be better mined. Experiments show that the Chinese composition model based on the dynamic word vector representation can dynamically train word vectors, excavate scoring characteristics in composition texts and finally score input compositions, and compared with other static word vector models or convolutional neural network models, the accuracy of composition scoring is improved.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details not careless or careless in the apparatus embodiment, please refer to the method embodiment of the present invention.
In another embodiment of the present invention, a Chinese composition scoring system based on dynamic word vector representation is provided, which includes:
the data acquisition module is used for acquiring the Chinese composition text to be scored after word segmentation processing;
the result acquisition module is used for inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model and outputting a scoring result through the Chinese composition scoring model;
wherein, the Chinese composition scoring model comprises:
the input layer is used for inputting the Chinese composition text after word segmentation processing, and converting words in the Chinese composition text into a sequence matrix based on a pre-trained binary word vector dictionary;
the embedded layer is used for inputting the sequence matrix and converting the sequence matrix into a word vector matrix for output;
the BilSTM layer is used for inputting the word vector matrix, extracting the characteristics and outputting composition context characteristic vectors;
the full connection layer is used for inputting the composition context feature vector and outputting a classification result vector;
and the output layer is used for inputting the classification result vector and outputting a grading result.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A Chinese composition scoring method based on dynamic word vector representation is characterized by comprising the following steps:
acquiring a Chinese composition text to be scored after word segmentation;
inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model, and outputting a scoring result through the Chinese composition scoring model;
wherein, the Chinese composition scoring model comprises:
the input layer is used for inputting the Chinese composition text after word segmentation processing, and converting words in the Chinese composition text into a sequence matrix based on a pre-trained binary word vector dictionary;
the embedded layer is used for inputting the sequence matrix and converting the sequence matrix into a word vector matrix for output;
the BilSTM layer is used for inputting the word vector matrix, extracting the characteristics and outputting the composition context characteristic vector;
the full connection layer is used for inputting the composition context feature vector and outputting a classification result vector;
and the output layer is used for inputting the classification result vector and outputting a grading result.
2. The method according to claim 1, wherein the step of obtaining the pre-trained chinese composition scoring model comprises:
acquiring a training sample set; each training sample in the training sample set comprises a scoring label and a Chinese composition text of the sample after word segmentation;
when training is updated, inputting the Chinese composition text of the sample after word segmentation processing in the selected training sample into the Chinese composition scoring model to obtain a scoring predicted value; and calculating the difference value between the score predicted value and the score label in the selected training sample, calculating loss by adopting a cross entropy loss function, updating the parameters of the Chinese composition score model, reaching a preset convergence condition, and obtaining the pre-trained Chinese composition score model.
3. The method of claim 1, wherein the chinese composition scoring model further comprises:
a Dropout layer disposed between the BilSTM layer and the fully-connected layer; the neural network is used for inputting the composition context feature vector output by the BilSTM layer and randomly enabling the neuron output with a preset proportion to be 0 to obtain a processed composition context feature vector; the processed composition context feature vector is used for inputting the full-connection layer.
4. The method of claim 1, wherein the obtaining of the pre-trained binary word vector dictionary comprises:
acquiring a new training sample set composed of the Chinese composition texts of the samples after word segmentation processing of each training sample in the training sample set adopted by the Chinese composition scoring model;
and based on the new training sample set, adopting a Word2Vec function in a genim library to realize the training of Word vectors, and obtaining the pre-trained binary Word vector dictionary.
5. The method of claim 4, wherein the Word2Vec function utilizes a context-predictive census algorithm or a census context-predictive algorithm.
6. The method for scoring the Chinese composition based on the dynamic word vector representation according to claim 1, wherein the step of obtaining the Chinese composition text to be scored after the word segmentation process specifically comprises the steps of:
acquiring an initial Chinese composition to be evaluated;
carrying out data cleaning on the initial Chinese composition to be evaluated to enable the format to be uniform and obtain the Chinese composition after the data cleaning;
and (4) performing word segmentation on the Chinese composition after the data cleaning by adopting a jieba.cut () function to obtain a Chinese composition text to be scored after word segmentation.
7. A Chinese composition scoring system based on dynamic word vector representation is characterized by comprising the following components:
the data acquisition module is used for acquiring the Chinese composition text to be scored after word segmentation processing;
the result acquisition module is used for inputting the Chinese composition text to be scored after word segmentation into a pre-trained Chinese composition scoring model and outputting a scoring result through the Chinese composition scoring model;
wherein, the Chinese composition scoring model comprises:
the input layer is used for inputting the Chinese composition text after word segmentation processing, and converting words in the Chinese composition text into a sequence matrix based on a pre-trained binary word vector dictionary;
the embedded layer is used for inputting the sequence matrix and converting the sequence matrix into a word vector matrix for output;
the BilSTM layer is used for inputting the word vector matrix, extracting the characteristics and outputting the composition context characteristic vector;
the full connection layer is used for inputting the composition context feature vector and outputting a classification result vector;
and the output layer is used for inputting the classification result vector and outputting a grading result.
8. The system according to claim 7, wherein the obtaining of the pre-trained chinese composition scoring model comprises:
acquiring a training sample set; each training sample in the training sample set comprises a scoring label and a Chinese composition text in the sample after word segmentation;
when training is updated, inputting the Chinese composition text of the sample after word segmentation processing in the selected training sample into the Chinese composition scoring model to obtain a scoring predicted value; and calculating the difference value between the score predicted value and the score label in the selected training sample, calculating loss by adopting a cross entropy loss function, updating the parameters of the Chinese composition score model, achieving a preset convergence condition, and obtaining the pre-trained Chinese composition score model.
9. The system of claim 7, wherein the chinese composition scoring model further comprises:
the Dropout layer is arranged between the BilSTM layer and the full connection layer; the neural network is used for inputting the composition context feature vector output by the BilSTM layer and randomly enabling the neuron output with a preset proportion to be 0 to obtain a processed composition context feature vector; the processed composition context feature vector is used for inputting the full-connection layer.
10. The system of claim 7, wherein the pre-trained binary word vector dictionary obtaining step comprises:
acquiring a new training sample set composed of the Chinese composition texts of the samples after word segmentation processing of each training sample in the training sample set adopted by the Chinese composition scoring model;
and based on the new training sample set, adopting a Word2Vec function in a genim library to realize the training of Word vectors, and obtaining the pre-trained binary Word vector dictionary.
CN202210536676.7A 2022-05-17 2022-05-17 Chinese composition scoring method and system based on dynamic word vector representation Pending CN114925687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210536676.7A CN114925687A (en) 2022-05-17 2022-05-17 Chinese composition scoring method and system based on dynamic word vector representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210536676.7A CN114925687A (en) 2022-05-17 2022-05-17 Chinese composition scoring method and system based on dynamic word vector representation

Publications (1)

Publication Number Publication Date
CN114925687A true CN114925687A (en) 2022-08-19

Family

ID=82807970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210536676.7A Pending CN114925687A (en) 2022-05-17 2022-05-17 Chinese composition scoring method and system based on dynamic word vector representation

Country Status (1)

Country Link
CN (1) CN114925687A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629211A (en) * 2023-02-25 2023-08-22 浙江研几网络科技股份有限公司 Writing method and system based on artificial intelligence
CN117520817A (en) * 2023-11-08 2024-02-06 广州水沐青华科技有限公司 Power fingerprint identification model training method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629211A (en) * 2023-02-25 2023-08-22 浙江研几网络科技股份有限公司 Writing method and system based on artificial intelligence
CN116629211B (en) * 2023-02-25 2023-10-27 浙江研几网络科技股份有限公司 Writing method and system based on artificial intelligence
CN117520817A (en) * 2023-11-08 2024-02-06 广州水沐青华科技有限公司 Power fingerprint identification model training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN109697232B (en) Chinese text emotion analysis method based on deep learning
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN107358948B (en) Language input relevance detection method based on attention model
CN111339255B (en) Target emotion analysis method, model training method, medium, and device
CN111858931B (en) Text generation method based on deep learning
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN112883738A (en) Medical entity relation extraction method based on neural network and self-attention mechanism
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN114925687A (en) Chinese composition scoring method and system based on dynamic word vector representation
CN112906397B (en) Short text entity disambiguation method
CN110298044B (en) Entity relationship identification method
CN111858878B (en) Method, system and storage medium for automatically extracting answer from natural language text
CN111078866A (en) Chinese text abstract generation method based on sequence-to-sequence model
CN112420024A (en) Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN112784013B (en) Multi-granularity text recommendation method based on context semantics
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN113343690A (en) Text readability automatic evaluation method and device
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN109271636A (en) The training method and device of word incorporation model
CN115630653A (en) Network popular language emotion analysis method based on BERT and BilSTM
CN111783464A (en) Electric power-oriented domain entity identification method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination