CN113343690B

CN113343690B - Text readability automatic evaluation method and device

Info

Publication number: CN113343690B
Application number: CN202110692831.XA
Authority: CN
Inventors: 于东; 唐玉玲; 张宇飞
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2024-03-12
Anticipated expiration: 2041-06-22
Also published as: CN113343690A

Abstract

The invention discloses an automatic text readability assessment method and device, wherein the automatic text readability assessment method comprises the following steps: constructing a Chinese character difficulty level table suitable for a Chinese native language learner; the Chinese character difficulty level table comprises Chinese characters with difficulty to be evaluated and difficulty levels corresponding to the Chinese characters with the difficulty to be evaluated; acquiring the difficulty level information of each Chinese character in the text to be evaluated according to the Chinese character difficulty level table; combining the difficulty level information of the Chinese characters with a graphic neural network to realize automatic assessment of the readability of the text to be assessed; wherein the readability assessment of sentences is converted into graph node classification tasks and the readability assessment of paragraphs and chapters is converted into graph classification tasks. The method has more pertinence to the analysis of the text and has better evaluation effect.

Description

Text readability automatic evaluation method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to an automatic text readability assessment method and device.

Background

Readability studies are one of the important topics in the linguistic and psychological fields. The readability analysis of text is the heart of readability research. The task of readability analysis is to give a text, by analyzing the text, give the difficulty value of the text or determine to which level the text is fit for the reader.

The initial readability analysis mainly requires experienced experts or teachers to subjectively evaluate the text, and the method has strong subjectivity, different standards of the evaluator and different purposes, and the evaluation results are often different. According to the difference of analysis ideas and key technologies, the automatic analysis method of readability is divided into a formula method, a classification method and a sequencing method. The formula method: predicting the difficulty value of the text by taking some language features affecting the difficulty of the text as variables through establishing a linear equation, wherein the used features are generally shallow language features such as word length, sentence length and the like; classification method: taking the prediction of the text difficulty as a classification task, learning a series of text features with distinctiveness from texts with different grades, constructing a classification model, and estimating the difficulty grade of the text according to the learned result by the classification model after inputting a new text without a label; sequencing method: the method has the defects that a specific difficulty value or difficulty level cannot be given.

Classification-based methods are the most common method for automatic assessment of current readability. In machine learning, classification is defined as: given a set of training examples { X ] ₁ ,X ₂ ,...,X _n Each training instance has a class label. Training model f (X->Y) to make class predictions for new instances. A large number of researches show that besides the characteristics of shallow sentence length, word length and the like, the automatic analysis of the readability based on the classification method can consider more language characteristics, such as vocabulary familiarity, syntax complexity and the like, and the evaluation result is more than that of a readability formulaAccurate placement, and significant advantages in distinguishing high-difficulty text.

The existing automatic analysis model based on classification is a traditional statistical machine learning model, such as an N-element word membership model, a support vector machine SVM, and a deep learning model based on RNN, CNN and a transducer, which is typically Bilstm, textCNN and Bert, sequentially according to the development sequence.

The N-gram membership model is a statistical language model based on word probabilities, which treats a text as a series of character sequences, and assumes that the level of readability of the text is related to the word of the text, and that the level of readability of the text is independent of each other. In the training stage, the method firstly counts the probability that each N-element word string belongs to each level according to training sample data. In the prediction stage, for a text T with an unknown level, the membership degree of all the levels is calculated, and the level with the largest membership degree is taken as the difficulty level matched with the text.

The support vector machine is a statistical learning theory based on a structural risk minimization principle proposed by Cortes and the like, and is mainly used for classifying problems. The support vector machine is modeled in combination with various linguistic features that characterize the difficulty. These features are either text features learned from N-grams or various shallow or deep linguistic features summarized and extracted, such as lexical, syntactic features, etc.

The traditional machine learning method combines various language difficulty characteristics, can well automatically evaluate the readability of the text, but only considers the difficulty characteristics of the text on language word application, and lacks consideration of semantic difficulty information.

The deep learning method has good performance on the readability automatic analysis task, the former automatically learns the difficulty characteristic on the Bilstm to perform model training to obtain good effect, the Lstm is good at capturing longer sequence information, two parts of information are input in each time step, one part of the information is reserved information in the previous period, and the other part of the information is original information corresponding to the current time step, so that the Lstm can acquire the information of the whole sequence in the last time step, and discard the information which is considered to be useless by the model. For the automatic analysis task of the readability, factors affecting the readability are related to the whole text sequence, and the difficulty information for representing the whole text can be effectively captured through the sequence feature extractor of the RNN kernel, but the representation capability is limited, and when a longer text sequence is encountered, the feature capturing capability also appears to be a fly-over.

Convolutional neural networks also have a prominent performance on text classification tasks, with better results being obtained for text using one-dimensional convolution. The method is characterized in that a convolutional neural network is applied to a readability automatic analysis task, a plurality of convolutional check text vectors with different sizes are utilized to conduct one-dimensional convolution to extract key information in the text, word numbers with different sizes and lengths can be processed each time, a convolution kernel is sequentially slid from top to bottom, maximum pooling or average pooling operation is conducted after convolution operation, the output of the process is the difficulty characteristic automatically learned by the method, and the method is similar to Ngram with multiple windows, so that local correlation in the text can be better captured. In the task of automatic analysis of readability, the factor affecting the readability of a piece of text is not only the information of the whole sequence, but one or more words or phrases with higher difficulty may be the direct cause of increasing the difficulty level of the text. Based on this starting point TextCNN, which is able to capture local relevance in text, is just a suitable model. But the method can capture local difficulty information and lose a large amount of important information, and is lack of grasp of global difficulty information, so that improvement on effect is limited.

Bert is a transform-based bi-directional pre-trained language model that is encoded using an Encoder in a transform. The Transformer utilizes an attention mechanism that enables the model to notice specific parts of the input when constructing the output vector, giving high attention weights to important content and low attention weights to unimportant content. The Bert obtains a plurality of information including difficulty information and semantic information by performing unsupervised pre-training in a large corpus, and then performs model fine adjustment on a text readability automatic analysis task in a downstream task, so that not only can the learned priori knowledge be used, but also the model can be adapted on a specific task. The method is obviously improved in effect compared with the prior method, the defects of a model taking RNN and CNN as kernels are overcome to a certain extent, the similar problems as the prior method exist, the interpretability of the model is low, and the effect improvement space is limited.

Disclosure of Invention

The invention provides a text readability automatic evaluation method and device, which are used for solving the technical problems of indirect language difficulty characteristics, low interpretation and limited effect lifting space of the Chinese character layer in the prior art.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a text readability automatic assessment method, including:

constructing a Chinese character difficulty level table suitable for a Chinese native language learner; the Chinese character difficulty level table comprises Chinese characters with difficulty to be evaluated and difficulty levels corresponding to the Chinese characters with the difficulty to be evaluated;

based on the Chinese character difficulty level table, obtaining difficulty level information of each Chinese character in the text to be evaluated;

the text to be evaluated is patterned, and the difficulty level information of Chinese characters in the text to be evaluated is combined with a graphic neural network to realize automatic evaluation of the readability of the text to be evaluated; wherein the readability assessment of sentences is converted into graph node classification tasks and the readability assessment of paragraphs and chapters is converted into graph classification tasks.

Further, the construction of the Chinese character difficulty level table suitable for the Chinese native language learner comprises the following steps:

collecting the appendices of the raw words and the words required to be mastered after each text in the Chinese teaching materials of one to nine grades;

the collected new words and word appendices are assembled into a Chinese character table, and the grades which are first appeared in the teaching material and are required to be mastered are used as the corresponding difficulty grades of the current Chinese character, so that all Chinese characters are divided into nine difficulty grades, and a Chinese character difficulty grade table suitable for Chinese native language learners is constructed.

Further, based on the Chinese character difficulty level table, obtaining difficulty level information of each Chinese character in the text to be evaluated comprises:

for Chinese characters which do not appear in the Chinese character difficulty level table, the difficulty level is represented by 0.

Further, when the text to be evaluated is a sentence, the text to be evaluated is patterned, and the difficulty level information of the Chinese characters in the text to be evaluated is combined with the graphic neural network to realize automatic evaluation of the readability of the text to be evaluated, which comprises the following steps:

respectively taking a fixed length and vectorizing a number sequence formed by difficulty level information corresponding to all Chinese characters in each sentence to obtain a first feature vector which corresponds to each sentence and is used for representing the difficulty features of the Chinese characters;

carrying out global heterogeneous graph construction on a sentence corpus formed by all sentences, and obtaining a second feature vector which corresponds to each sentence and is used for representing the structural features of each sentence from the constructed global heterogeneous graph;

independent composition is carried out on each sentence to obtain an independent graph corresponding to each sentence, semantic representation vectors of each word of the sentence are initialized through a pre-training language model, and a third vector which corresponds to each sentence and is used for representing semantic features of the sentence is obtained through node information interaction in the independent graph corresponding to each sentence;

And respectively carrying out fusion processing on the first feature vector, the second feature vector and the third feature vector corresponding to each sentence to obtain a fusion feature vector corresponding to the current sentence, and classifying the difficulty level of the corresponding sentence through the fusion feature vector so as to realize automatic evaluation of sentence readability.

Further, the step of respectively performing length-fixing and vectorization operations on the number columns formed by the difficulty level information corresponding to all the Chinese characters in each sentence to obtain a first feature vector corresponding to each sentence and used for representing the difficulty features of the Chinese characters comprises the following steps:

when the sentence is insufficient in length, representing the insufficient part by a preset character, and setting the difficulty level of the insufficient part to be 0; when the sentence length is exceeded, the excess portion is intercepted and discarded.

Further, the global heterogeneous graph construction is performed on a sentence corpus formed by all sentences, including:

firstly, constructing a global heterogram for the whole sentence corpus, and if the sentence corpus contains N pieces of sentence data, having N sentence nodes; then, word segmentation and duplication removal are carried out on the whole sentence corpus, word disabling operation is carried out, M words are obtained, and M word nodes are obtained;

The constructed heterogram comprises two node relations, wherein the two node relations represent edges between nodes and weights on the edges; one of the node relationships is a relationship between sentence nodes and word nodes, and if the sentence i contains the word j, the sentence i corresponds to the sentence node S _i Word node W corresponding to word j _j Edges exist between the words, and the weights of the edges are TF-IDF values of the words j relative to the sentence i; the other node relation is a relation between word nodes, and the relation between the word nodes is obtained through the word co-occurrence relation; if the PMI value of the two words in the fixed sliding window is larger than 0, edges exist between the two words, and the weight of the edges is the PMI value of the two words.

Further, the independent composition is performed on each sentence to obtain an independent graph corresponding to each sentence, the semantic representation vector of each word of the sentence is initialized through the pre-training language model, and a third vector corresponding to each sentence and used for representing the semantic feature of the sentence is obtained through graph node information interaction in the independent graph corresponding to each sentence, which comprises the following steps:

taking words as nodes, and independently composing each sentence; wherein, the edges between word nodes are represented by word co-occurrence relations; for each word node, semantic information of each word is obtained through a pre-training language model to serve as initial state information of the corresponding word node, when the state of each word node is updated, state information of other adjacent nodes is fused to serve as memory information of a GRU network, and then the initial state information of the current word node is combined to update; after obtaining all updated word node information, processing attention mechanisms on all word nodes, so as to obtain a third vector which corresponds to the current sentence and is used for representing semantic features of the current sentence.

Further, when the text to be evaluated is a paragraph or a chapter, the text to be evaluated is patterned, and the difficulty level information of the Chinese characters in the text to be evaluated is combined with the graphic neural network to realize automatic evaluation of the readability of the text to be evaluated, which comprises the following steps:

firstly, dividing sentences of a text to be evaluated, and then dividing words and dividing words of each divided sentence; and then taking the words as nodes, firstly composing the sentence to obtain the readability characteristic representation of the sentence, and then taking the sentence as the node to compose the whole text to be evaluated, thereby realizing the readability evaluation of the text to be evaluated.

Further, with words as nodes, the sentence is patterned to obtain a readability characteristic representation of the sentence, including:

taking the first word of a sentence as a main node, and establishing a directed graph by the other word nodes according to the line text sequence to obtain a sentence graph model; the method comprises the steps that a main node establishes bidirectional connection with all word nodes in a current sentence;

performing fixed length and vectorization operation on a sequence consisting of difficulty level information corresponding to all Chinese characters in a sentence to obtain sentence difficulty vectors corresponding to the sentence and used for representing the difficulty characteristics of the Chinese characters;

fusing the sentence difficulty vector corresponding to the sentence with the graph semantic vector output by the sentence graph model to obtain a readable characteristic representation for representing the current sentence difficulty information;

Taking sentences as nodes, and composing the whole text to be evaluated to realize the readability evaluation of the text to be evaluated, wherein the method comprises the following steps:

taking sentences as nodes, representing the readability characteristics corresponding to the sentences as node information, constructing a directed graph according to the sequence of the line text, obtaining a directed graph model corresponding to the text to be evaluated, and learning and updating the nodes through information transmission of the nodes in the directed graph model; and then reading out with an attention mechanism to obtain a representation vector representing the difficulty of the whole text to be evaluated, so as to realize the readability evaluation of the text to be evaluated.

On the other hand, the invention also provides an automatic text readability assessment device, which comprises:

the Chinese character difficulty information acquisition module is used for acquiring the difficulty level information of each Chinese character in the text to be evaluated based on a pre-constructed Chinese character difficulty level table; the Chinese character difficulty level table comprises Chinese characters with difficulty to be evaluated and difficulty levels corresponding to the Chinese characters with the difficulty to be evaluated;

the text readability automatic evaluation module based on the graphic neural network is used for composing the text to be evaluated, and combining the difficulty level information of the Chinese characters in the text to be evaluated, which is acquired by the Chinese character difficulty information acquisition module, with the graphic neural network to realize automatic evaluation of the readability of the text to be evaluated; wherein the readability assessment of sentences is converted into graph node classification tasks and the readability assessment of paragraphs and chapters is converted into graph classification tasks.

In yet another aspect, the present invention also provides an electronic device including a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.

The technical scheme provided by the invention has the beneficial effects that at least:

the Chinese character difficulty assessment method provides a very efficient and referent standard for the Chinese character cognition difficulty of the Chinese native language learner, and adds the Chinese character difficulty as a characteristic into the text, so that the characteristic extraction efficiency is improved, and a better assessment effect is achieved. Furthermore, the method provided by the invention respectively models the two types of texts by considering the difference of sentences and chapters in structure, and can be combined with a graph model with stronger interpretation, so that the analysis of the text by the model is more targeted, and the evaluation effect of the readability evaluation model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a sentence readability evaluation modeling flow chart provided by an embodiment of the present invention;

FIG. 2 is a diagram illustrating a process of obtaining a Chinese character difficulty feature representation of a sentence through a first channel according to an embodiment of the present invention;

FIG. 3 is a diagram showing the difference between the sentence difficulty and the Chinese character difficulty according to the embodiment of the present invention;

FIG. 4 is a process diagram of a global structural feature representation of a second-pass acquisition sentence provided by an embodiment of the present invention;

FIG. 5 is a process diagram of a diagram semantic feature representation of a channel three acquisition sentence provided by an embodiment of the present invention;

FIG. 6 is a flowchart of a chapter level text readability assessment modeling provided by an embodiment of the present invention;

FIG. 7 is a detailed process diagram of sentence-level composition provided by an embodiment of the present invention;

fig. 8 is a detailed process diagram of a chapter level composition using sentences as nodes according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

First embodiment

The embodiment provides an automatic text readability assessment method, which adopts a classification method to carry out modeling analysis on text readability. By composing the text, the difficulty information interaction among the nodes is enhanced, the interpretability of the model is enhanced, and the accuracy of automatic assessment of the readability is improved. The method may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the method comprises the following steps:

S1, constructing a Chinese character difficulty level table suitable for a Chinese native language learner; the Chinese character difficulty level table comprises Chinese characters with difficulty to be evaluated and difficulty levels corresponding to the Chinese characters with the difficulty to be evaluated;

it should be noted that the existing difficulty features are mainly indirect language features, such as the stroke number of Chinese characters, the occurrence frequency of Chinese characters and the like, and the features can reflect the difficulty of the Chinese characters to a certain extent, so that the readability of the whole text is judged, but the readability of the whole text is represented by the indirect difficulty features more directly, intuitively and accurately than the difficulty of the Chinese characters as the difficulty features of the text. Therefore, the embodiment provides a fine-grained Chinese character difficulty level assessment method for Chinese native language learners, and the Chinese character difficulty obtained by the method is used as a characteristic to carry out the next text readability automatic assessment.

S2, acquiring difficulty level information of each Chinese character in the text to be evaluated based on the Chinese character difficulty level table;

s3, composition is carried out on the text to be evaluated, and difficulty level information of Chinese characters in the text to be evaluated is combined with a graphic neural network to realize automatic evaluation of the readability of the text to be evaluated; wherein the readability assessment of sentences is converted into graph node classification tasks and the readability assessment of paragraphs and chapters is converted into graph classification tasks.

It should be noted that, aiming at the problems that the accuracy improvement of the deep learning classification model of the existing readability evaluation is not high and the model interpretability is not strong, the embodiment provides a method for evaluating the text readability by using a graph neural network model. The graph neural network method is to make a graph of a text, endow various different difficulty information to each graph node, and make information interaction between nodes to intuitively display the calculation process of factors related to readability, so that readability is intuitively understood, and the accuracy and the interpretability of a model are improved.

The embodiment provides that text readability evaluation is carried out by adding Chinese character difficulty as a characteristic, a composition modeling is carried out on the text by using a method of a graph neural network in the evaluation process, and a text readability level evaluation problem is converted into a node classification problem or a graph classification problem. The evaluation method of the present embodiment will be described in detail in three aspects: 1. introducing the construction of a Chinese character difficulty level table; 2. introducing and constructing a sentence readability evaluation model by using a graph neural network; 3. introduction the chapter text readability assessment model is constructed using a graph neural network.

1. Construction of Chinese character difficulty level table

Chinese characters as the finest granularity parts of text have very important influence on the text readability. The Chinese teaching material is used as the main way for learning Chinese characters by the learner in the mother language, the content of the Chinese teaching material is arranged according to the sequence from easy to difficult, the Chinese character difficulty is used as the important judgment basis for arranging the teaching material, and the Chinese teaching material has a great reference effect on the arranging sequence of the text of the teaching material. Therefore, the embodiment collects the new words and word appendices required to be mastered after each lesson based on the current Chinese teaching materials of one to nine grades, and combines the new words and word appendices into a Chinese character table, and takes the grade of the Chinese character which appears for the first time and is required to be mastered in the teaching materials as the difficulty grade of the Chinese character. Based on the above, the Chinese characters are divided into nine difficulty levels, and a Chinese character difficulty level table suitable for Chinese native language learners is constructed.

2. Sentence readability evaluation model for graph neural network

When the readability modeling is carried out on sentences, sentence difficulty representation vectors are obtained from three layers, wherein the sentence difficulty representation vectors are difficulty representation vectors RD_i taking Chinese character difficulty as characteristics respectively; obtaining a difficulty expression vector RC_i representing sentence structure characteristics from the global heterograph; acquiring a difficulty expression vector RS_i containing semantic information from the independent graph; the three different feature expression vectors are respectively obtained through three different channels, then the difficulty vector R (Si) representing the whole sentence is obtained by fusing the three feature vectors, and then the difficulty level classification is carried out on the R (Si), so that the sentence readability evaluation is realized.

As shown in fig. 1, the model is a three-channel readability model, and channels one, two and three represent three different feature representation methods respectively, and three different sentence feature representations are obtained through the three different methods. The first channel represents the character representation represented by the Chinese character difficulty of each sentence through mapping the Chinese character difficulty table proposed by the embodiment; the second channel representation carries out global heterograph construction on the whole sentence corpus to obtain global structural feature representation of each sentence; the three-channel representation is used for independently composing each sentence in the sentence corpus, initializing semantic representation vectors of each word of the sentence through a pre-training language model, and obtaining the graph semantic feature representation of each sentence through graph node information interaction.

The process of the first channel is shown in fig. 2 and 3, the mapping of Chinese character difficulty is carried out by looking up a table, and after the difficulty values corresponding to all Chinese characters in the sentence are obtained, the vectorization operation is carried out on the whole sentence according to the fixed length. A vector representation RD i of each sentence is obtained. In addition, when mapping Chinese character difficulty and representing features of sentences, in order to facilitate the reading and processing of subsequent models, the sentences need to be subjected to fixed-length processing, the part with insufficient sentence length is represented by 'unk', and the part with exceeding sentence length is intercepted and discarded. As shown in fig. 3, assume that the fixed length of the sentence is set to 15 to "today's weather is clear. For example, when the Chinese character difficulty map is performed, the difficulty level of the unregistered character which does not appear in the Chinese character difficulty table is represented by '0', and the difficulty level of 'unk' is also represented by '0'.

The process of the second channel is shown in fig. 4, in the second channel, sentence nodes are represented by S, word nodes are represented by W, the global iso-pattern includes n sentence nodes, m word nodes, then the total of the graph includes (m+n) nodes, the adjacency matrix of the graph is used to represent the information of the whole graph, the dimension of the adjacency matrix is (m+n) × (m+n), the relation between the nodes is included in the adjacency matrix, and the relation value between the node i and the node jThen it is Aij. Finally, the initial state of each node is the vectorization result of the row information in the matrix corresponding to the node in the adjacent matrix. The updating process of each node is that the initial state information of the node is combined with the information of the adjacent nodes to update, and after the information updating of all the nodes is completed, the final structural feature representation of each sentence node is the feature representation RC_i obtained by fusing the information of the adjacent word nodes of the sentence node together to perform information interaction updating. The specific process is as follows: the overall abnormal composition construction is carried out on the whole sentence corpus, and if the corpus contains N pieces of sentence data, N sentence nodes exist. And performing word segmentation and duplication removal on the whole sentence corpus, and performing word deactivation operation to obtain M words, so as to obtain M word nodes W. The heterogram contains two node relations, representing edges between nodes and weights on the edges. One is the relation between sentence nodes and word nodes, if the sentence i contains word j, sentence node S _i And word node W _j There are edges between which the weight of an edge is the TF-IDF value of the word relative to the sentence. TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. The main ideas of TF-IDF are: if a word or phrase appears frequently in one article TF is high and rarely in other articles, the word or phrase is considered to have good category discrimination and is suitable for classification. TFIDF is actually: TF is IDF, TF Term Frequency (Term Frequency) indicates the Frequency of occurrence of an entry in document d, and IDF reverse file Frequency (Inverse Document Frequency) indicates the Frequency of occurrence of an entry in each document in the current corpus.

TF-IDF(i,j)＝tf(i,j)×idf(i,j), (1)

Where i represents a document, j represents a term, tf (i, j) represents the frequency of occurrence of the term i in the document j. The D represents the size of the corpus and tdf (i, j) represents the frequency of occurrence of word i in document j in the current corpus.

The other is the relationship between word nodes, and the relationship between word nodes is obtained through the word co-occurrence relationship. If the PMI value of two words in the fixed sliding window is larger than 0, edges exist between the two words, and the weight of the edges is the PMI value. PMI is an abbreviation of Pointwise Mutual Information, representing point mutual information, derived from the concept of mutual information within the information theory, and is typically used to measure the correlation between two random variables, such as the correlation between two words.

Wherein, # W (i) represents the number of occurrences of vocabulary i under the fixed sliding window, # W (i, j) represents the number of simultaneous occurrences of vocabulary i, j under the fixed sliding window, # W represents the size of the sliding window. When PMI (i, j) is positive, the meaning of vocabulary i and vocabulary j has stronger semantic relevance, when PMI (i, j) is negative, the meaning of vocabulary i, j has lower semantic relevance,

the relational expression between the nodes is as follows:

after the graph is constructedWe then normalize the relation matrix vector a of each sentence across the graph to obtainSubstituting the global structural vector feature RC of each sentence into the graph convolutional network GCN to construct two layers of GCNs is as follows:

wherein W is ₀ Weight matrix representing first layer of graph neural network, W ₁ Representing a weight matrix of a second layer of the neural network, the ReLU is an activation function in the neural network that is used to increase the non-linear factors in the model and enhance the expressive power of the model.

The process of the third channel is shown in fig. 5, each sentence in the sentence corpus is respectively patterned, only word nodes are arranged in the diagram, the sentence is segmented, edges between the word nodes are represented by word co-occurrence relations, and the method is represented by the word node relations in the second channel. Initial state of word node itself (i represents the ith word node) is a representation containing semantic information obtained from a pre-trained language model, which mainly refers to a language model, such as a Bert model, which is obtained through unsupervised training in a large-scale expectation library and can represent a word general information representation in the field of natural language processing. The update of a word node is related to the information interaction between neighboring word nodes, taking the update of the second word node in fig. 5 as an example, the initial state is +.>The updating of the node achieves the aim of updating node information by fusing initial information of other nodes connected with the node and passing through a GGNN gating neural network, wherein the GGNN gating graph neural network is a graph neural network which can realize the function of graph node information interaction by taking GRU as a gating unitThe network, in which the GRU is known as Gate Recurrent Unit (gate cycle unit), which is a type of cyclic neural network (Recurrent Neural Network, RNN), is proposed to solve the problems of long-term memory and gradient in back propagation, and includes an update gate and a reset gate. The method comprises the following steps: for each word node, obtaining semantic information of each word as an initial state information representation of the word node by pre-training a language model >When the status of each word node is updated, the status information of other adjacent nodes is fused first>Memory information as GRU->Combining the initial state information of the word node, obtaining the information of the corresponding time step through an update gate and a reset gate, and obtaining the updated node state after a nonlinear layer>Because each step of the graph node updating operation can only obtain the first-order neighbor node information of each node, high-order feature interaction can be realized through the operation of a plurality of time steps, namely, one node can obtain the information of the node separated from the node by t steps through t steps. After obtaining all updated node information, the attention mechanism processing is carried out on all nodes, so that the graph semantic information representation RS_i of the sentence can be obtained.

a ^t ＝Ah ^t-1 W _a , (9)

z ^t ＝σ(W _z a ^t +U _z h ^t-1 +b _z ), (10)

r ^t ＝σ(W _r a ^t +U _r h ^t-1 +b _r ), (11)

Wherein A represents an adjacency matrix in the sentence graph, W, U represents a weight matrix in the training process, b represents a weight bias in the training process, z represents an update gate, r represents a reset gate, sigma represents a sigmoid function, and V represents the number of nodes in the graph structure.

After the characteristic representations RD_i, RC_i and RS_i of different layers of the sentence are respectively obtained through the three channels, the characteristics of the three layers are fused to obtain the final characteristic R (Si) of the sentence Si, and a softmax layer is input to realize the grade classification of the sentence. Classical cross entropy is used to define the loss function:

Where YD represents the labeled document set, ydf represents the labeling category, and Zdf is the predicted category.

3. Graph neural network chapter readability evaluation model

As shown in fig. 6, the thought of the chapter-level text readability evaluation modeling is mainly hierarchical modeling, each sentence in the chapter is first patterned from the bottom layer to obtain a Chinese character difficulty feature representation and a graph semantic representation of each sentence, then the chapter-level patterning is performed by taking the sentence as a node, and feature representations representing the whole chapter are obtained according to interactions performed on the feature representations of the nodes of each sentence, so that the readability level of the chapter text is evaluated. The specific implementation process is as follows:

considering the difference of the space and the sentence in the structure, we model the space and the sentence in the hierarchical readability. Firstly, sentence segmentation is carried out on the chapters, then, word segmentation and word segmentation are carried out on each sentence, then, from bottom to top, the word is taken as a node, readability modeling is carried out on each sentence in the chapters to obtain readability characteristic representation of the sentence, and then, the sentence is taken as a node, and modeling is carried out on the whole chapters, so that the aim of carrying out readability analysis on the chapters is achieved.

The details of the composition of each sentence in the chapters are shown in fig. 7, when modeling the sentence hierarchy, on one hand, the first word of each sentence is taken as the main node, the other word nodes build the directed graph according to the line text sequence, and the main node builds the bidirectional connection with all the word nodes in the sentence. The composition mode not only considers the front-back sequence of the articles to enable the sequential text sequence relation of the semantics to be well represented, but also enables the degree of the main node to effectively reflect readability information such as sentence length and the like of the sentence, so that the control of the model on sentence difficulty is relatively comprehensive. When the node information is updated interactively, the information of the main node does not participate in updating, the information of other nodes is updated according to the updating mode shown in fig. 5, and when one node is updated, the information of all neighbor nodes of the node is aggregated to obtain M ^t+1 The aggregation function is shown as the following formula (17), aggregation is realized through a layer of perceptron MLP, and then the aggregated information M is obtained ^t+1 Status information h with the node itself ^t Combining is achieved through a GRU, and the combining function is as shown in the following formula (18), so that updated information h of the node is obtained ^t+1 . After the information updating is carried out on all the nodes, the information of the main node is fused with the information of all the other nodes, and the semantic difficulty vector representation G_send_i representing the sentence can be obtained. On the other hand, each Chinese character in the sentence is subjected to difficulty mapping to obtain a sentence difficulty vector D_send_i taking the Chinese character difficulty as a characteristic, and then the D_send_i is fused with a graph semantic vector G_send_i output by a sentence graph model, so that a substitution is obtainedThe vectors representing the sentence represent R (Si).

M ^t+1 ＝MLP ^t+1 (D ^-1 Ah ^t ), (17)

h ^t+1 ＝GRU(h ^t ,M ^t+1 ), (18)

The GRU is a gated recurrent neural network, and in the formula (18), the function of realizing the combination function by the GRU is shown in the formulas (10) - (13).

As shown in fig. 8, after modeling at the sentence level, we can obtain the difficulty vector of each sentence in the chapter, then use the sentence as a node, use R (Si) as the initial state information of the node, construct the directed graph of the whole chapter according to the sequence of the line text, through the information aggregation and updating of the sentence nodes in the graph model, learn and interact the node information in the whole chapter, the implementation of aggregation is shown as formula (17), and the implementation of updating in combination with the node information is shown as formula (18). After interactive updating is carried out on all sentence nodes, a difficulty expression vector R (pi) representing the whole chapter can be obtained through a reading process with a attention mechanism, and the chapter can be subjected to level analysis through a multi-layer perceptron. Thereby performing a classification analysis of the readability level for the chapters.

To sum up, the embodiment provides a Chinese character difficulty level assessment method suitable for Chinese native language learners aiming at the problem that the language difficulty characteristics of the Chinese character level are not direct in the prior art, and the Chinese character difficulty obtained by the assessment method is used as an important characteristic for assessing the readability of Chinese text. Aiming at the problems of low interpretation and insignificant effect improvement of the existing deep learning model in the readability evaluation problem, the readability evaluation model based on the graph neural network, which is suitable for sentence-level texts and chapter-level texts, is respectively provided, and Chinese character difficulty features are integrated into the model, so that multi-dimensional, multi-channel and multi-level text readability modeling is realized. The method of the embodiment can more intuitively understand the readability and improve the accuracy and the interpretability of the model.

Second embodiment

The embodiment provides an automatic text readability assessment device, which comprises the following modules:

The text readability automatic evaluation apparatus of the present embodiment corresponds to the text readability automatic evaluation method of the above-described first embodiment; the functions implemented by the functional modules in the text readability automatic assessment apparatus of the present embodiment are in one-to-one correspondence with the flow steps in the text readability automatic assessment method of the first embodiment; therefore, the description is omitted here.

Third embodiment

The embodiment provides an electronic device, which comprises a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.

The electronic device may vary considerably in configuration or performance and may include one or more processors (central processing units, CPU) and one or more memories having at least one instruction stored therein that is loaded by the processors and performs the methods described above.

Fourth embodiment

The present embodiment provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of the first embodiment described above. The computer readable storage medium may be, among other things, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The instructions stored therein may be loaded by a processor in the terminal and perform the methods described above.

Furthermore, it should be noted that the present invention can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

It is finally pointed out that the above description of the preferred embodiments of the invention, it being understood that although preferred embodiments of the invention have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the invention are known, several modifications and adaptations can be made without departing from the principles of the invention, and these modifications and adaptations are intended to be within the scope of the invention. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A method for automatically evaluating text readability, comprising:

the text to be evaluated is patterned, and the difficulty level information of Chinese characters in the text to be evaluated is combined with a graphic neural network to realize automatic evaluation of the readability of the text to be evaluated; wherein, the readability evaluation of sentences is converted into graph node classification tasks, and the readability evaluation of paragraphs and chapters is converted into graph classification tasks;

When the text to be evaluated is a sentence, the text to be evaluated is patterned, and the difficulty level information of the Chinese characters in the text to be evaluated is combined with the graphic neural network to realize automatic evaluation of the readability of the text to be evaluated, and the method comprises the following steps:

independent composition is carried out on each sentence to obtain an independent graph corresponding to each sentence, semantic representation vectors of each word of the sentence are initialized through a pre-training language model, and a third feature vector which corresponds to each sentence and is used for representing semantic features of the sentence is obtained through node information interaction in the independent graph corresponding to each sentence;

2. The method for automatically evaluating the readability of text according to claim 1, wherein said constructing a chinese character difficulty level table suitable for a chinese native language learner comprises:

3. The method for automatically evaluating the readability of a text according to claim 2, wherein obtaining difficulty level information of each kanji in the text to be evaluated based on the kanji difficulty level table comprises:

4. The method for automatically evaluating the readability of text according to claim 1, wherein the step of respectively performing a length-fixing and vectorization operation on a sequence of difficulty level information corresponding to all Chinese characters in each sentence to obtain a first feature vector corresponding to each sentence and used for representing difficulty features of the Chinese characters comprises the steps of:

5. The method for automatically evaluating the readability of text according to claim 1, wherein the global heterogeneous map construction is performed on a sentence corpus composed of all sentences, and comprises:

the constructed heterogram comprises two node relations, wherein the two node relations represent edges between nodes and weights on the edges; one of the node relationships is sentence node and word nodeIf the sentence i contains the word j, the sentence node S corresponding to the sentence i _i Word node W corresponding to word j _j Edges exist between the words, and the weights of the edges are TF-IDF values of the words j relative to the sentence i; the other node relation is a relation between word nodes, and the relation between the word nodes is obtained through the word co-occurrence relation; if the PMI value of the two words in the fixed sliding window is larger than 0, edges exist between the two words, and the weight of the edges is the PMI value of the two words.

6. The method for automatically evaluating the readability of text according to claim 5, wherein the step of individually composing each sentence to obtain an independent graph corresponding to each sentence, initializing semantic representation vectors of each word of the sentence through a pre-training language model, and obtaining a third vector corresponding to each sentence and representing semantic features thereof through graph node information interaction in the independent graph corresponding to each sentence comprises:

7. The automatic text readability assessment method according to claim 1, wherein when the text to be assessed is a paragraph or chapter, the text to be assessed is patterned, and the difficulty level information of the Chinese characters in the text to be assessed is combined with a graphic neural network to realize automatic assessment of the readability of the text to be assessed, comprising:

8. The method for automatically evaluating the readability of a text according to claim 7, wherein the step of composing a sentence with words as nodes to obtain a readability feature representation of the sentence comprises:

9. An automatic text readability evaluation apparatus, comprising:

the text readability automatic evaluation module based on the graphic neural network is used for composing the text to be evaluated, and combining the difficulty level information of the Chinese characters in the text to be evaluated, which is acquired by the Chinese character difficulty information acquisition module, with the graphic neural network to realize automatic evaluation of the readability of the text to be evaluated; wherein, the readability evaluation of sentences is converted into graph node classification tasks, and the readability evaluation of paragraphs and chapters is converted into graph classification tasks;