CN115630140A

CN115630140A - English reading material difficulty judgment method based on text feature fusion

Info

Publication number: CN115630140A
Application number: CN202211364247.2A
Authority: CN
Inventors: 甘健侯; 王宇辰; 李子杰; 周菊香; 欧阳昭相; 陈恳
Original assignee: Yunnan Normal University
Current assignee: Yunnan Normal University
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-20
Anticipated expiration: 2042-11-02
Also published as: CN115630140B

Abstract

The invention relates to a method for judging difficulty of English reading materials based on text feature fusion, and belongs to the field related to natural language processing. Firstly, encoding an input English text aiming at an English reading material data set, inputting an encoded result into a pre-training language model, and calculating to obtain a feature vector containing semantic information. And then, carrying out part-of-speech tagging on the English text, inputting the obtained part-of-speech sequence into the LSTM, and calculating to obtain a feature vector containing grammar information. And counting relevant factors and extracting features of the relevant factors according to the factors influencing the difficulty degree of English reading materials. All the obtained feature vectors are spliced and input into a full-connection layer, and finally a numerical value from 0 to 1 is output through sigmoid to represent difficulty. The method can effectively judge the difficulty of English reading materials and better assist various self-adaptive learning services in English teaching.

Description

A method for judging the difficulty of English reading materials based on text feature fusion

技术领域technical field

本发明涉及一种基于文本特征融合的英语阅读材料难度判断的方法，属于自然语言处理技术领域。The invention relates to a method for judging the difficulty of English reading materials based on text feature fusion, and belongs to the technical field of natural language processing.

背景技术Background technique

英语作为被广泛学习的第二语言，阅读又作为英语学习中重要的组成部分，如何对英语阅读材料的难度进行准确的判断，使具有不同英语水平的人可以接受适合其自身英语水平的教育，进一步促进个性化学习显得尤为重要。English is widely learned as a second language, and reading is an important part of English learning. How to accurately judge the difficulty of English reading materials so that people with different English levels can receive education suitable for their own English level, It is particularly important to further promote personalized learning.

20世纪初期出现了有关测定英语阅读材料难易程度的研究，直到现在针对英语阅读材料难度判断的研究都是国内外相关研究者所关注的核心问题。因此，众多研究者针对影响英语阅读材料难易程度的因素进行了大量研究，总结出很多影响因素，产生了很多计算英语阅读材料难易程度的公式，这些公式长久以来，一直在帮助人们选择合适的英文文本。但是随着信息化的不断发展，所产生的文本愈发复杂，而通过制定规则的方法通常较为简单，不具备良好的泛化能力，因此无法取得良好的效果。In the early 20th century, studies on measuring the difficulty level of English reading materials appeared. Until now, the research on the difficulty judgment of English reading materials has been the core issue concerned by relevant researchers at home and abroad. Therefore, many researchers have conducted a lot of research on the factors that affect the difficulty of English reading materials, summed up many influencing factors, and produced many formulas for calculating the difficulty of English reading materials. English text of . However, with the continuous development of informatization, the generated texts become more and more complex, and the method of formulating rules is usually relatively simple and does not have good generalization ability, so good results cannot be achieved.

随着语言模型的不断发展，2018年10月谷歌提出了BERT(Bidirectional EncoderRepresentation from Transformers)模型，使得自然语言处理领域的发展进入了新的阶段。BERT是一种预训练的语言模型，它不像传统的语言模型一样只采用单向的语言模型或者是将两个单向语言模型进行浅层拼接的方式进行训练，而是采用MLM(masked languagemodel)对双向的Transformers进行与训练，生成深层的双向语言表征,并在11种不同的自然语言处理(Natural Language Processing,NLP)测试中表现优异。许多学者结合BERT进行自然语言处理领域中的其他任务都取得了较好的效果，这种将已经训练好的模型迁移到新的模型中进行训练的方式叫做迁移学习(Transfer learning)。考虑到大部分的任务具有一定的相关性，所以将已经学习到的参数通过某种方式传递给新模型，可以大大加快模型的效率。Fine-tuning作为迁移学习的其中一种方法，通过冻结预训练模型中的卷积层，训练其他的卷积层和全连接层，可以进一步提高模型的学习时间、降低了模型训练的成本。With the continuous development of language models, Google proposed the BERT (Bidirectional Encoder Representation from Transformers) model in October 2018, which brought the development of natural language processing into a new stage. BERT is a pre-trained language model. Unlike traditional language models, it only uses a one-way language model or two one-way language models for shallow splicing. Instead, it uses MLM (masked language model ) to conduct and train bidirectional Transformers, generate deep bidirectional language representations, and perform well in 11 different natural language processing (Natural Language Processing, NLP) tests. Many scholars have achieved good results in combining BERT with other tasks in the field of natural language processing. This method of migrating the trained model to a new model for training is called transfer learning. Considering that most of the tasks are related to a certain extent, passing the learned parameters to the new model in some way can greatly speed up the efficiency of the model. Fine-tuning is one of the methods of transfer learning. By freezing the convolutional layer in the pre-trained model and training other convolutional layers and fully connected layers, the learning time of the model can be further improved and the cost of model training can be reduced.

发明内容Contents of the invention

本发明要解决的技术问题是提供一种基于文本特征融合的英语阅读材料难度判断的方法，用于提高英语阅读材料难度判断的准确率和效率。The technical problem to be solved by the present invention is to provide a method for judging the difficulty of English reading materials based on text feature fusion, which is used to improve the accuracy and efficiency of judging the difficulty of English reading materials.

本发明通过总结语言学家在影响英语阅读材料难度因素的观点，并考虑到预训练语言模型在自然语言处理任务中的优势，提出了一种基于文本特征融合的英语阅读材料难度判断的方法，将多种文本特征进行融合，利用深度学习技术对英语阅读材料进行难度判断。The present invention proposes a method for judging the difficulty of English reading materials based on text feature fusion by summarizing the views of linguists on factors affecting the difficulty of English reading materials, and taking into account the advantages of pre-trained language models in natural language processing tasks. Integrate multiple text features and use deep learning technology to judge the difficulty of English reading materials.

本发明的技术方案是：一种基于文本特征融合的英语阅读材料难度判断的方法，首先针对英语阅读材料数据集，对输入的英文文本进行编码，将编码后的信息输入到已经训练好的预训练语言模型中，得到包含语义信息的特征向量；然后对输入的文本进行词性标注，将得到的词性序列输入到LSTM得到包含语法信息的特征向量；对影响英语阅读材料难度的因素进行统计并对其进行嵌入表示，将所有特征进行拼接后输入全连接层，最后经过sigmoid层输出得到一个0到1的数值表示难度。The technical solution of the present invention is: a method for judging the difficulty of English reading materials based on the fusion of text features. First, the input English text is encoded for the English reading material data set, and the encoded information is input into the pre-trained In training the language model, the feature vector containing semantic information is obtained; then the input text is part-of-speech tagged, and the obtained part-of-speech sequence is input into LSTM to obtain the feature vector containing grammatical information; the factors affecting the difficulty of English reading materials are counted and analyzed It performs embedding representation, splicing all the features into the fully connected layer, and finally outputs a 0 to 1 value to indicate the difficulty through the sigmoid layer.

所述英语阅读难度的判断具体步骤如下：The specific steps of judging the difficulty of English reading are as follows:

Step1：使用预训练语言模型提取文本的语义特征。Step1: Use the pre-trained language model to extract the semantic features of the text.

首先针对英语阅读材料数据集(使用Newsela数据集及自行采集的数据集进行实验)，对输入的英文文本进行编码，将编码后的信息输入到已经训练好的预训练语言模型中，得到包含语义信息的特征向量。Firstly, for the English reading material data set (using Newsela data set and self-collected data set for experiments), the input English text is encoded, and the encoded information is input into the pre-trained language model that has been trained, and the semantic content is obtained. The feature vector of information.

具体过程为首先提取句子中的词、句位置以及词位置等信息进行One-hot编码，输入预训练语言模型，获取语义特征向量，本发明的预训练模型选择Bert模型。The specific process is to first extract information such as word, sentence position and word position in the sentence to perform One-hot encoding, input the pre-training language model, and obtain the semantic feature vector. The pre-training model of the present invention selects the Bert model.

Step2：语法信息特征提取。Step2: Grammatical information feature extraction.

对文本进行词性标注，将得到的词性序列输入到LSTM得到包含语法信息的特征向量。The text is part-of-speech tagged, and the obtained part-of-speech sequence is input to LSTM to obtain a feature vector containing grammatical information.

Step3：统计信息特征提取。Step3: Statistical information feature extraction.

对影响英语阅读材料难度的因素进行统计并对其进行嵌入表示，将所有特征进行拼接后输入全连接层，最后经过sigmoid层输出得到一个0到1的数值表示难度。The factors that affect the difficulty of English reading materials are counted and embedded to represent them, all the features are spliced and input into the fully connected layer, and finally a 0 to 1 value is output through the sigmoid layer to indicate the difficulty.

Step4：难度预测。Step4: Difficulty prediction.

经过sigmoid层输出得到一个0到1的数值表示难度。After the sigmoid layer output, a value from 0 to 1 is obtained to indicate the difficulty.

所述Step1具体为：The Step1 is specifically:

Step1.1：假设当前输入的英文文本为S_t，S_t中包含n个单词，S_t＝{w₁,w₂,…,w_i,…,w_n}，其中w_i表示第i个单词。Step1.1: Assume that the current input English text is S _t , and S _t contains n words, S _t = {w ₁ ,w ₂ ,…, _wi ,…,w _n }, where w _i represents the i-th word word.

Bert模型通常在句首添加[CLS]用以表示一个段落的开始，在两个句子的中间添加[SEP]用于分隔句子。The Bert model usually adds [CLS] at the beginning of a sentence to indicate the beginning of a paragraph, and [SEP] in the middle of two sentences to separate sentences.

转化后的句子为S_BERT＝{[CLS],w₁,w₂,…,[SEP],…,w_n-2,w_n-1,w_n,[SEP]}。The transformed sentence is S _BERT ={[CLS],w ₁ ,w ₂ ,...,[SEP],..., _wn-2 , _wn-1 , _wn ,[SEP]}.

Step1.2：将S_BERT的最大长度设置为M，若S_t的长度不足M，则对S_BERT添加[PAD]进行补齐，补齐操作后的S_BERT为：Step1.2: Set the maximum length of S _BERT to M. If the length of S _t is less than M, then add [PAD] to S _BERT to make up. The S _BERT after the padding operation is:

S_BERT＝{[CLS],w₁,w₂,…,[SEP],…,w_n-2,w_n-1,w_n,[SEP],…,[PAD]}S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}

若S_t的长度大于M，则截断并舍去后续内容，截断操作后的S_BERT为：If the length of S _t is greater than M, then truncate and discard the subsequent content, and the S _BERT after truncation is:

S_BERT＝{[CLS],w₁,w₂,…,[SEP],…,w_M-2,w_M-1,w_M,[SEP]}S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _M-2 ,w _M-1 ,w _M ,[SEP]}

Step1.3：对S_BERT中的每一个内容进行embedding编码，即：Step1.3: Perform embedding encoding for each content in S _BERT , namely:

其中，

D_BERT表示预训练语言模型设定的嵌入维度。in,

D _BERT represents the embedding dimension set by the pre-trained language model.

Step1.4：对S_BERT中的内容进行句位置编码，即：Step1.4: Encode the sentence position of the content in S _BERT , namely:

S_{segmentembedding}＝{E_A,E_A,E_A,E_B,E_B,E_B,E_B,…,E_i,E_i}S _{segmentembedding} ＝{E _A ,E _A ,E _A ,E _B ,E _B ,E _B ,E _B ,…,E _i ,E _i }

其中，E_A表示第一句话，E_B表示第二句话，

后续句子以此类推，E_i表示第i句话。Among them, E _A represents the first sentence, E _B represents the second sentence,

Subsequent sentences are deduced by analogy, and E _i represents the i-th sentence.

Step1.5：对S_BERT中的内容进行词位置编码，即：Step1.5: Encode the word position of the content in S _BERT , namely:

S_{positionembedding}＝{E₁,E₂,E₃,…,E_i,…,E_n-2,E_n-1,E_n,…,E_M}S _{position embedding} ＝{E ₁ ,E ₂ ,E ₃ ,…,E _i ,…,E _n-2 ,E _n-1 ,E _n ,…,E _M }

其中，E_i表示第i个单词的位置编码，

Among them, E _i represents the position code of the i-th word,

Step1.6：将S_embedding、S_{segmrntembedding}、S_{positionembedding}输入到预训练语言模型(默认使用BERT)中，得到最后一层输出的特征向量O_BERT，即：Step1.6: Input S _embedding , S _{segmrntembedding} , and S _{positionembedding} into the pre-trained language model (BERT is used by default), and obtain the feature vector O _BERT output by the last layer, namely:

Step1.7：选取句向量有多种方案，如：1)取X_[CLS]作为句向量。2)对O_BERT进行平均池化，取其结果。3)对O_BERT进行最大池化，取其结果。4)将O_BERT的结果利用CNN进一步提取特征。5)将O_BERT的结果输入LSTM提取特征。在本发明的任务中，选取X_[CLS]作为句向量。Step1.7: There are many options for selecting the sentence vector, such as: 1) Take X _[CLS] as the sentence vector. 2) Perform average pooling on O _BERT and take the result. 3) Perform maximum pooling on O _BERT and take the result. 4) Use the result of _OBERT to further extract features using CNN. 5) Input the result of _OBERT into LSTM to extract features. In the task of the present invention, choose X _[CLS] as sentence vector.

所述Step2具体为：Described Step2 specifically is:

Step2.1：对于输入的文本S_t＝{w₁,w₂,w₃,…,w_n}，在句首添加[CLS]表示一句话的开始，在两个句子的中间添加[SEP]用于分隔句子，转化后的句子为：Step2.1: For the input text S _t = {w ₁ ,w ₂ ,w ₃ ,…,w _n }, add [CLS] at the beginning of the sentence to indicate the beginning of a sentence, and add [SEP] in the middle of the two sentences Used to separate sentences, the converted sentences are:

S_sen＝{[CLS],w₁,w₂,…,[SEP],…,w_n-2,w_n-1,w_n,[SEP],…,[PAD]}S _sen ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}

Step2.2：对S_sen进行词性标注，得到：Step2.2: Perform part-of-speech tagging on S _sen to get:

S_POS＝{[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}S _POS ＝{[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}

其中，[SPACE]表示[CLS]和[SEP]，[PRP]表示代词，[VBP]表示动词，[NNP]表示名词，[RB]表示程度副词，[JJ]表示形容词。Among them, [SPACE] represents [CLS] and [SEP], [PRP] represents a pronoun, [VBP] represents a verb, [NNP] represents a noun, [RB] represents an adverb of degree, and [JJ] represents an adjective.

Step2.3：对S_POS进行嵌入表示，得到E_POS，即：Step2.3: Embed S _POS to get E _POS , namely:

其中，D_POS表示词性token的嵌入维度。Among them, D _POS represents the embedding dimension of the part-of-speech token.

Step2.4：将E_pos输入LSTM，取最后一层输出结果O_pos作为句子词性序列的特征向量(即语法特征)，其中

Step2.4: Input E _pos into LSTM, and take the output result O _pos of the last layer as the feature vector (ie, grammatical feature) of the sentence part-of-speech sequence, where

Step2中，主要计算句子的语法特征，语法与词汇是英文文本难度区分的关键，因此需考虑语法的复杂性。本发明将句子的词性序列作为输入，使用LSTM学习序列的特征，从而实现语法的向量化表示并输入神经网络进行后续步骤的计算。在现有的方法中，语法信息主要通过统计关键词个数，统计关键词共现的方法得出，这种方法并不能完全表示该序列信息，因此该发明中使用LSTM可以更好地学习到语法特征。In Step2, the grammatical features of sentences are mainly calculated. Grammar and vocabulary are the key to distinguish the difficulty of English texts, so the complexity of grammar needs to be considered. The present invention takes the part-of-speech sequence of the sentence as input, and uses LSTM to learn the features of the sequence, thereby realizing the vectorized representation of the grammar and inputting it into the neural network for calculation in subsequent steps. In the existing methods, grammatical information is mainly obtained by counting the number of keywords and counting the co-occurrence of keywords. This method cannot fully represent the sequence information, so the use of LSTM in this invention can better learn grammatical features.

所述Step3具体为：The Step3 is specifically:

由于影响英语阅读材料难度程度的因素除了语义和语法之外，还需考虑句子长度、介词数量、平均单词长度等作为影响因素，则将这些因素进行统计并编码后输入模型。加入上述信息后，模型训练时收敛更快，同时使得模型的鲁棒性进一步提升。In addition to semantics and grammar, the factors that affect the difficulty of English reading materials also need to consider sentence length, number of prepositions, average word length, etc. After adding the above information, the convergence of the model training is faster, and the robustness of the model is further improved.

具体步骤如下：Specific steps are as follows:

Step3.1：对句子长度进行统计并进行嵌入操作：对于句子S_t＝{w₁,w₂,…,w_n}，则句子长度嵌入为

Step3.1: Make statistics on the sentence length and perform embedding operation: for the sentence S _t = {w ₁ ,w ₂ ,…,w _n }, the sentence length embedding is

其中，L表示该向量为句子长度的嵌入，n代表单词数量，D代表嵌入维度。Among them, L indicates that the vector is the embedding of the sentence length, n represents the number of words, and D represents the embedding dimension.

Step3.2：对介词数量进行统计并进行嵌入操作：对于句子S_t＝{w₁,w₂,…,w_n}，介词数量嵌入

Step3.2: Count the number of prepositions and perform embedding operations: For the sentence S _t = {w ₁ ,w ₂ ,…,w _n }, the number of prepositions is embedded

其中，P代表该向量为介词数量的嵌入，*代表具体数量，D代表嵌入维度。Among them, P represents that the vector is the embedding of the number of prepositions, * represents the specific number, and D represents the embedding dimension.

Step3.3：对平均单词长度进行统计并进行嵌入操：对于句子S_t＝{w₁,w₂,…,w₁}，介词数量嵌入

Step3.3: Make statistics on the average word length and perform embedding operation: For the sentence S _t ={w ₁ ,w ₂ ,…,w ₁ }, the number of prepositions is embedded

其中，A代表该向量为平均单词长度的嵌入，*代表具体数量，D代表嵌入维度。Among them, A represents the embedding of the vector as the average word length, * represents the specific number, and D represents the embedding dimension.

Step3.4：将

进行拼接作为句子的统计信息：Step3.4: will

Concatenate as sentence statistics:

其中，

in,

所述Step4具体为：Described Step4 is specifically:

Step4.1：将语义特征X_[CLS]、语法特征O_POS、统计信息特征O_STA进行拼接后输入全连接层，再输入sigmoid层预测结果并输出：Step4.1: Concatenate the semantic feature X _[CLS] , grammatical feature O _POS , and statistical information feature O _STA into the fully connected layer, then input the prediction result of the sigmoid layer and output:

Step4.2：计算损失：Step4.2: Calculate the loss:

其中，y_ic表示样本i的真实类别，等于c则取1，不等于则取0，p_ic表示观测样本i属于类别c的预测概率；Among them, y _ic represents the true category of sample i, if it is equal to c, it takes 1, if it is not equal to 0, p _ic represents the predicted probability that observed sample i belongs to category c;

Step4.3：使用Adam对损失进行优化，目的是使损失达到最小，当损失达到最小时，模型到达最好效果。Step4.3: Use Adam to optimize the loss, the purpose is to minimize the loss, when the loss reaches the minimum, the model reaches the best effect.

该部分将上述三种特征进行拼接后输入神经网络，使用sigmoid函数将输出限制在[0,1]之间，从而实现难易判断。In this part, the above three features are spliced and input into the neural network, and the sigmoid function is used to limit the output between [0,1], so as to realize the judgment of difficulty.

本发明的有益效果是：本发明在进行英文文本难易判断时，将文本的语义信息，语法信息，统计信息等特征综合考虑，与传统方法相比，本发明考虑了英文文本语义信息的重要性，使用LSTM学习文本的语法信息，同时使将传统的统计信息也输入神经网络进行计算。从而得到比传统方法效果更佳、鲁棒性更强的难易判断模型。The beneficial effects of the present invention are: the present invention comprehensively considers the semantic information, grammatical information, statistical information and other features of the text when judging the difficulty of the English text. Compared with the traditional method, the present invention considers the importance of the English text semantic information. It uses LSTM to learn the grammatical information of the text, and at the same time, the traditional statistical information is also input into the neural network for calculation. Thus, a difficult-easiness judgment model with better effect and stronger robustness than traditional methods is obtained.

附图说明Description of drawings

图1是本发明的步骤流程图。Fig. 1 is a flow chart of steps of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式，对本发明作进一步说明。The present invention will be further described below in combination with the accompanying drawings and specific embodiments.

实施例1：如图1所示，一种基于文本特征融合的英语阅读材料难度判断的方法，首先针对英语阅读材料数据集，对输入的英文文本进行编码，将编码后的信息输入到已经训练好的预训练语言模型中，得到包含语义信息的特征向量；然后对英文文本进行词性标注，将得到的词性序列输入到LSTM得到包含语法信息的特征向量；对影响英语阅读材料难度的因素进行统计并对其进行嵌入表示，将所有特征进行拼接后输入全连接层，最后经过sigmoid层输出得到一个0到1的数值表示难度。Embodiment 1: As shown in Figure 1, a method for judging the difficulty of English reading materials based on text feature fusion, firstly, for the English reading material data set, the input English text is encoded, and the encoded information is input into the trained In a good pre-trained language model, the feature vector containing semantic information is obtained; then the English text is part-of-speech tagged, and the obtained part-of-speech sequence is input into the LSTM to obtain the feature vector containing grammatical information; the factors that affect the difficulty of English reading materials are counted And embedding it, all the features are spliced and input into the fully connected layer, and finally a 0 to 1 value is output through the sigmoid layer to indicate the difficulty.

假设现有英语阅读材料的集合A，集合中有N条英文阅读材料的数据，则A＝{S₁,S₂,S₃,…,S_N}，其中S_i表示英语阅读材料集合中的第i条英语阅读材料文本。所述英语阅读难度的判断具体步骤如下：Suppose there is a collection A of English reading materials, and there are N pieces of data of English reading materials in the collection, then A={S ₁ , S ₂ , S ₃ ,...,S _N }, where S _i represents the number of English reading materials in the collection Article i English reading material text. The specific steps of judging the difficulty of English reading are as follows:

Step1：本发明的预训练模型选择Bert模型，预训练语言模型部分主要用于学习文本的语义信息，输入预训练语言模型需要三种特征，分别是每个词的特征、句位置特征以及词位置特征，将三种特征进行提取。Step1: The pre-training model of the present invention selects the Bert model. The pre-training language model part is mainly used to learn the semantic information of the text. Three features are required to input the pre-training language model, which are the features of each word, the sentence position feature and the word position Features, extract three features.

Step2：语法特征提取。Step2: Grammatical feature extraction.

Step4：难度预测。Step4: Difficulty prediction.

所述Step1具体为：The Step1 is specifically:

Step1.2：将S_BERT的最大长度设置为M，若S_t的长度不足M，则对S_BERT添加[PAD]进行补齐，补齐操作后的S_BERT＝{[CLS],w₁,w₂,…,[SEP],…,w_n-2,w_n-1,w_n,[SEP],…,[PAD]}。Step1.2: Set the maximum length of S _BERT to M, if the length of S _t is less than M, then add [PAD] to S _BERT to complete, and S _BERT after the padding operation = {[CLS],w ₁ , w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}.

若S_t的长度大于M，则截断并舍去后续内容，截断操作后的S_BERT＝{[CLS],w₁,w₂,…,[SEP],…,w_M-2,w_M-1,w_M,[SEP]}。If the length of S _t is greater than M, truncate and discard the subsequent content, S _BERT after the truncation operation = {[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _M-2 ,w _{M- 1} ,w _M ,[SEP]}.

其中

D_BERT表示预训练语言模型设定的嵌入维度。in

其中E_A表示第一句话，E_B表示第二句话，

后续句子以此类推，其中E_i表示第i句话。Where E _A represents the first sentence, E _B represents the second sentence,

Subsequent sentences are deduced by analogy, where E _i represents the i-th sentence.

其中E_i表示第i个单词的位置编码，

where E _i represents the position code of the i-th word,

所述Step2具体为：Described Step2 specifically is:

Step2.1：对于输入的文本S_t＝{w₁,w₂,w₃,…,w_m}，在句首添加[CLS]表示一句话的开始，在两个句子的中间添加[SEP]用于分隔句子，转化后的句子为：Step2.1: For the input text S _t = {w ₁ ,w ₂ ,w ₃ ,…,w _m }, add [CLS] at the beginning of the sentence to indicate the beginning of a sentence, and add [SEP] in the middle of the two sentences Used to separate sentences, the converted sentences are:

Step2.4：将E_pos输入LSTM，取最后一层输出结果O_pos作为句子语法信息的特征向量，其中

Step2.4: Input E _pos into LSTM, and take the output result O _pos of the last layer as the feature vector of sentence grammatical information, where

所述Step3具体为：The Step3 is specifically:

由于影响英语阅读材料难度程度的因素除了语义和语法之外，还需考虑句子长度、介词数量、平均单词长度等作为影响因素，则将这些因素进行统计并编码后输入模型，具体步骤如下：Since the factors that affect the difficulty of English reading materials are not only semantics and grammar, but also sentence length, number of prepositions, and average word length, etc. should be considered as influencing factors, these factors are counted and coded and then input into the model. The specific steps are as follows:

Step3.3：对平均单词长度进行统计并进行嵌入操：对于句子S_t＝{w₁,w₂,…,w_n}，介词数量嵌入

Step3.3: Make statistics on the average word length and perform embedding operation: For the sentence S _t = {w ₁ ,w ₂ ,…,w _n }, the number of prepositions is embedded

Step3.4：将

进行拼接作为句子的统计信息：Step3.4: will

Concatenate as sentence statistics:

其中，

in,

所述Step4具体为：Described Step4 is specifically:

Step4.2：计算损失：Step4.2: Calculate the loss:

其中，y_ic表示样本i的真实类别，等于c则取1，不等于则取0，p_ic表示观测样本i属于类别c的预测概率。Among them, y _ic represents the true category of sample i, which is 1 if it is equal to c, and 0 if it is not equal, and p _ic represents the predicted probability that observed sample i belongs to category c.

本实施例选取两个带有难易程度标记的英语阅读材料数据集CEFR和Newsela，以及一个本发明手工构建的数据集CEED。其中，CEFR和CEED为公开的分级英语阅读文本数据集，Newsela数据集为非公开的分级英语阅读文本数据集(可以在Newsela网站进行申请)。对三个数据集进行基本数据统计，统计结果如表1所示。其中Num表示数据集中所含文本的数量，Class表示等级类别数量。In this embodiment, two English reading material data sets CEFR and Newsela with difficulty marks are selected, and a data set CEED manually constructed by the present invention is selected. Among them, CEFR and CEED are public graded English reading text datasets, and the Newsela dataset is a non-public graded English reading text dataset (you can apply for it on the Newsela website). The basic data statistics are carried out on the three data sets, and the statistical results are shown in Table 1. Among them, Num represents the number of texts contained in the data set, and Class represents the number of grade categories.

表1数据集基本信息Table 1 Basic information of the dataset

(1)CEFR由1493个英文文本组成，这些文本按照欧洲共同参考框架(CEFR)级别A1，A2，B1，B2，C1，C2进行难度标注，难度从A1到C2依次递增。数据集中的英文文本取自在线免费资源，包括英国文化协会、ESLFast和CNN每日邮报数据集。英文文本的内容包含对话、描述、短篇故事、报纸故事和其他文章。(1) CEFR consists of 1493 English texts. These texts are labeled according to the level of difficulty of the Common European Framework of Reference (CEFR) A1, A2, B1, B2, C1, and C2, and the difficulty increases from A1 to C2. The English text in the dataset is taken from free online sources, including the British Council, ESLFast, and CNN Daily Mail datasets. The content of the English text includes dialogues, descriptions, short stories, newspaper stories and other articles.

(2)CEED从中考、高考、四级、六级、专四、专八等英语考试的469篇阅读真题中收集而成，难度分类如下：中考难度记为Z，高考难度记为G，四级难度记为S，六级难度记为L，专四难度记为E，专八难度记为B。难度从中考到专八依次递增。(2) CEED is collected from 469 real reading questions in English exams such as senior high school entrance examination, college entrance examination, CET-4, CET-6, CET-4 and CET-8. Level 6 difficulty is marked as S, level 6 difficulty is marked as L, special level 4 difficulty is marked as E, and level 8 difficulty is marked as B. Difficulty increases sequentially from the senior high school entrance examination to the eighth junior high school entrance examination.

(3)Newsela由10722个英文文本组成，按照美国K12教育的标准进行难度划分用数字2到12对每个英文文本进行难度标注，难度从2到12依次递增。(3) Newsela is composed of 10,722 English texts. Difficulty is divided according to the American K12 education standard. The difficulty of each English text is marked with numbers from 2 to 12, and the difficulty increases from 2 to 12.

本发明将数据集中的英文文本进行整理，整理过程如下：第一步，将每篇英文文本按段落进行读取；第二步，标注每个段落对应的难度等级；第三步，给每个段落打上难度标签，第四步，计算每个段落所包含的单词数、介词数、平均单词程度，最终整理成csv文件。整理后的数据集分别所含的段落数如下：CEFR中包含12096个段落，Newsela中包含227971个段落，CEED中包含3381个段落。The present invention arranges the English texts in the data set, and the arrangement process is as follows: the first step is to read each English text by paragraph; the second step is to mark the difficulty level corresponding to each paragraph; the third step is to give each The paragraphs are marked with difficulty tags. The fourth step is to calculate the number of words, prepositions, and average word levels contained in each paragraph, and finally organize them into a csv file. The number of paragraphs contained in the sorted data sets is as follows: CEFR contains 12096 paragraphs, Newsela contains 227971 paragraphs, and CEED contains 3381 paragraphs.

为在后续实验中更好的获取难度系数，对提取好的段落添加对应的难易标签。在CEFR数据集中，将A1、A2、B1、B2的难度标签设为0，C1、C2的难度标签设为1。在Newsela数据集中，将等级大于等于6的难度标签设为1，小于6的难度标签设为0。在CEED数据集中，因为分类存在相似性，本发明将数据集分为三个子集，将中考和高考的数据分为一个子集，简称CEED-EE；将四级和六级的数据分为一个子集，简称CEED-CET；将专四和专八的数据分为一个子集，简称CEED-TEM。其中将中考、四级、专四的难度标签设为0，高考、六级、专八的难度标签设为1。整理后的各数据集中所含的正负样本数量如表2所示。In order to better obtain the difficulty coefficient in subsequent experiments, add corresponding difficulty labels to the extracted paragraphs. In the CEFR dataset, the difficulty labels of A1, A2, B1, and B2 are set to 0, and the difficulty labels of C1 and C2 are set to 1. In the Newsela dataset, the difficulty label with a level greater than or equal to 6 is set to 1, and the difficulty label with a level less than 6 is set to 0. In the CEED data set, because of the similarity in classification, the present invention divides the data set into three subsets, the data of the high school entrance examination and the college entrance examination are divided into a subset, referred to as CEED-EE; the data of the fourth and sixth levels are divided into one Subset, referred to as CEED-CET; the data of Specialty 4 and Specialty 8 are divided into a subset, referred to as CEED-TEM. Among them, the difficulty labels of the senior high school entrance examination, level 4, and junior high school are set to 0, and the difficulty labels of the college entrance examination, level 6, and junior high school are set to 1. The number of positive and negative samples contained in each data set after sorting is shown in Table 2.

表2：正负样本数量Table 2: Number of positive and negative samples

本发明选取了近几年经典的面向Fill-mask任务的预训练语言模型，例如Bert，Bart，xlnet，roberta，xlm-roberta，进行测试，并与CNN、LSTM、BiLSTM进行对比。在参数设置上，使用pytorch 1.10版本，使用NVIDIA GeForce RTX 2080Ti GPU。预训练模型均从Huggingface获取。超参数的选取如下：Batchsize取{16，32，64}，学习率取{1e-3，1e-4，1e-5}，词嵌入维度取768。将不同的模型在不同的数据集上进行实验，实验结果如下：The present invention selects the classic pre-training language models for Fill-mask tasks in recent years, such as Bert, Bart, xlnet, roberta, and xlm-roberta, for testing, and compares them with CNN, LSTM, and BiLSTM. In parameter setting, use pytorch 1.10 version, use NVIDIA GeForce RTX 2080Ti GPU. All pre-trained models are obtained from Huggingface. The selection of hyperparameters is as follows: Batchsize is {16, 32, 64}, learning rate is {1e-3, 1e-4, 1e-5}, and word embedding dimension is 768. Different models were tested on different data sets, and the experimental results are as follows:

表3：在CEFR和Newsela中不同模型的实验结果Table 3: Experimental results of different models in CEFR and Newsela

从表3可以看出，在两个数据集中，本发明的方法(当使用BERT作为预训练语言模型时)在三个指标AUC,ACC,RMSE和两个数据集中结果都是最优的。在CEFR数据集中，在AUC、ACC、RMS指标上，本发明的方法都高于第二名，在AUC上提高了5.81％，ACC提高了7.02％，RMSE下降了5.14％。在Newsela数据集中，在AUC、ACC、RMS指标上，本发明的方法也都高于第二名，在AUC上提高了1.63％，ACC提高了1.04％，RMSE降低了1.15％。当数据集较小时(CEFR数据集)，预训练语言模型仅需要较少的数据便可以表现的更好。It can be seen from Table 3 that in the two data sets, the method of the present invention (when using BERT as the pre-trained language model) is the best in the three indicators AUC, ACC, RMSE and the two data sets. In the CEFR data set, in terms of AUC, ACC and RMS, the method of the present invention is higher than the second place, with AUC increased by 5.81%, ACC increased by 7.02%, and RMSE decreased by 5.14%. In the Newsela data set, on the AUC, ACC, and RMS indicators, the method of the present invention is also higher than the second place, and the AUC is increased by 1.63%, the ACC is increased by 1.04%, and the RMSE is reduced by 1.15%. When the data set is small (CEFR data set), the pre-trained language model can perform better with less data.

表4：在CEFR和Newsela中不同预训练语言模型的结果Table 4: Results of different pretrained language models in CEFR and Newsela

如表4所示，本发明比较了不同预训练语言模型的效果，这些预训练语言模型分别针对不同的任务改进和增强了BERT。从结果来看，BERT模型在CEFR数据集中能够取得最优的结果，在AUC、ACC、RMS指标上，BERT模型均高于第二名，在AUC上提高了0.35％，ACC提高了0.24％，RMSE降低了0.92％。XLNet模型可以在Newsela数据集上取得最优的结果，与BERT相比，在AUC上提高了0.37％，ACC上提高了0.58％，RMSE上降低了0.60％。然而，这些预训练模型的整体差距并不大，但结果都优于CNN和LSTM。As shown in Table 4, the present invention compares the effects of different pre-trained language models, which respectively improve and enhance BERT for different tasks. From the results, the BERT model can achieve the best results in the CEFR data set. In terms of AUC, ACC, and RMS indicators, the BERT model is higher than the second place, with an increase of 0.35% in AUC and 0.24% in ACC. RMSE decreased by 0.92%. The XLNet model can achieve the best results on the Newsela dataset. Compared with BERT, it improves AUC by 0.37%, ACC by 0.58%, and RMSE by 0.60%. However, the overall gap of these pre-trained models is not large, but the results are all better than CNN and LSTM.

表5：在CEED中不同模型的实验结果Table 5: Experimental results of different models in CEED

从表5可以看出，在两个数据集中，本发明的方法(当使用BERT作为预训练语言模型时)在三个指标AUC,ACC,RMSE和三个个数据集中结果都是最优的。在CEED-EE数据集中，在AUC、ACC、RMS指标上，本发明的方法都高于第二名，在AUC上提高了8.20％，ACC提高了4.71％，RMSE下降了7.05％。在CEED-CET数据集中，在AUC、ACC、RMS指标上，本发明的方法也都高于第二名，在AUC上提高了5.32％，ACC提高了3.77％，RMSE降低了1.95％。在CEED-TEM数据集上，在AUC、ACC指标上相较于第二名分别提高了9.09％和12.5％，在REMSE指标上下降了8.51％。It can be seen from Table 5 that in the two data sets, the method of the present invention (when using BERT as the pre-trained language model) is the best in the three indicators AUC, ACC, RMSE and three data sets. In the CEED-EE data set, the method of the present invention is higher than the second place in terms of AUC, ACC, and RMS indicators, with an increase of 8.20% in AUC, an increase of 4.71% in ACC, and a decrease of 7.05% in RMSE. In the CEED-CET data set, in terms of AUC, ACC, and RMS, the method of the present invention is also higher than the second place, with AUC increased by 5.32%, ACC increased by 3.77%, and RMSE decreased by 1.95%. On the CEED-TEM dataset, the AUC and ACC indicators have increased by 9.09% and 12.5% respectively compared with the second place, and the REMSE indicators have decreased by 8.51%.

表6：在CEED中不同预训练语言模型的结果Table 6: Results of different pretrained language models in CEED

如表6所示，本发明在CEED数据集中也比较了不同预训练语言模型的效果。从总体来看，RoBERTa在CEED的三个子集中均能取得较好的结果。在CEED-EE数据集中与BERT相比，在AUC上提高了5.06％，ACC提高了8.49％，RMSE降低了13.1％。在CEED-CET数据集中与BERT相比，在AUC上提高了3.99％，ACC提高了11.32％，RMSE降低了11.20％。在CEED-TEM数据集中与BERT相比，在AUC上提高了3.17％，ACC提高了3.12％，RMSE降低了4.35％。从整体上来看，这些预训练语言模型在三个指标上均优于CNN和LSTM。As shown in Table 6, the present invention also compares the effects of different pre-trained language models in the CEED dataset. Overall, RoBERTa can achieve better results in the three subsets of CEED. Compared with BERT in the CEED-EE dataset, it improves AUC by 5.06%, ACC by 8.49%, and RMSE by 13.1%. Compared with BERT in the CEED-CET dataset, it improves AUC by 3.99%, ACC by 11.32%, and RMSE by 11.20%. Compared with BERT in the CEED-TEM dataset, it improves AUC by 3.17%, ACC by 3.12%, and RMSE by 4.35%. On the whole, these pre-trained language models are better than CNN and LSTM in three indicators.

以上结合附图对本发明的具体实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The specific embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments. Variations.

Claims

1. A method for judging difficulty of English reading materials based on text feature fusion is characterized by comprising the following steps:

step1: firstly, encoding an input English text aiming at an English reading material data set, and inputting encoded information into a trained pre-training language model to obtain a feature vector containing semantic information;

step2: performing part-of-speech tagging on the text, and inputting the obtained part-of-speech sequence into an LSTM to obtain a feature vector containing grammatical information;

step3: extracting statistical information features; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, and inputting the spliced characteristics into a full-connection layer;

step4: and finally, obtaining a numerical value representing difficulty from 0 to 1 through sigmoid layer output, and finishing difficulty judgment.

2. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step1 is specifically:

step1.1: assume that the currently input English text is S _t ，S _t In which n words, S _t ＝{w ₁ ，w ₂ ，...，w _i ，...，w _n In which w _i Represents the ith word;

the converted sentence is S _BERT ＝{[CLS]，w ₁ ，w ₂ ，...，[SEP]，...，w _n-2 ，w _n-1 ，w _n ，[SEP]}；

Step1.2: will S _BERT Is set to M, if S _t If the length of (D) is less than M, the length of S is measured _BERT Addition of [ PAD]Performing filling, S after filling operation _BERT Comprises the following steps:

S _BERT ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _n-2 ，w _n-1 ，w _n ，[SEP]，…，[PAD]}

if S _t Is greater than M, truncating and discarding the subsequent content, truncating the operated S _BERT Comprises the following steps:

S _BERT ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _M-2 ，w _M-1 ，w _M ，[SEP]}

step1.3: to S _BERT Is embedding coded, namely:

wherein,

D _BERT representing the embedding dimension set by the pre-training language model;

step1.4: to S _BERT The content in (1) is sentence position coded, namely:

S _{segmentembedding} ＝{E _A ，E _A ，E _A ，E _B ，E _B ，E _B ，E _B ，...，E _i ，E _i }

wherein, E _A Denotes a first sentence, E _B The second sentence is represented by the first sentence,

subsequent sentences analogized in the same way, E _i Represents the ith sentence;

step1.5: to S _BERT The content in (1) is subjected to word position coding, namely:

S _{position embedding} ＝{E ₁ ，E ₂ ，E ₃ ，…，E _i ，…，E _n-2 ，E _n-1 ，E _n ，…，E _M }

wherein E is _i A position code representing the ith word,

step1.6: will S _embedding 、S _{segmrntembedding} 、S _{positionembedding} Inputting the result into a pre-training language model to obtain a feature vector O output by the last layer _BERT Namely:

step1.7: selecting X _[CLS] As a sentence vector.

3. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step2 is specifically:

step2.1: for the input text S _t ＝{w ₁ ，w ₂ ，w ₃ ，...，w _n Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:

S _sen ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _n-2 ，w _n-1 ，w _n ，[SEP]，…，[PAD]}

step2.2: to S _sen And performing part-of-speech tagging to obtain:

S _POS ＝{[SPACE]，[PRP]，[VBP]，[NNP]，…，[RB]，[JJ]，[SPACE]，…，[PAD]}

wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives;

step2.3: to S _POS Performing embedded representation to obtain E _POS Namely:

wherein D is _POS An embedding dimension representing a part of speech token;

step2.4: will E _pos Inputting LSTM, and taking the last layer of output result O _pos As feature vectors of sentence grammar information, wherein

4. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step3 is specifically:

step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n H, then the sentence length is embedded as

Wherein, L represents the embedding of the vector as the sentence length, n represents the number of words, and D represents the embedding dimension;

step3.2: counting the preposition number and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n }, number of prepositions embedded

Wherein, P represents the embedding of the vector as preposition number, which represents specific number, D represents the embedding dimension;

step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n }, number of prepositions embedded

Wherein, A represents the embedding of the vector with the average word length, which represents the specific number, and D represents the embedding dimension;

step3.4: will be provided with

Splicing is performed as statistical information of sentences:

wherein,

5. the method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step4 is specifically:

step4.1: applying semantic features X _[CLS] Grammatical feature O _POS Statistical information characteristic O _STA Inputting the spliced data into a full-connection layer, inputting a sigmoid layer prediction result and outputting:

step4.2: calculating the loss:

wherein, y _iC Representing the true class of sample i, if c is equal to 1, and if not equal to 0,p _iC Representing the predicted probability that the observation sample i belongs to the class c;

step4.3: adam is used to optimize the loss in order to minimize the loss, and when the loss is minimized, the model achieves the best results.