CN111078874B - Foreign Chinese difficulty assessment method based on decision tree classification of random subspace - Google Patents
Foreign Chinese difficulty assessment method based on decision tree classification of random subspace Download PDFInfo
- Publication number
- CN111078874B CN111078874B CN201911206414.9A CN201911206414A CN111078874B CN 111078874 B CN111078874 B CN 111078874B CN 201911206414 A CN201911206414 A CN 201911206414A CN 111078874 B CN111078874 B CN 111078874B
- Authority
- CN
- China
- Prior art keywords
- chinese
- article
- svm
- decision tree
- foreign
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法,根据文章的长度,易读性等特点生成86个统计特征,用svm进行分类,得到置信度1。将encoding特征,用svm进行分类,得到置信度2。将得到的2个置信度,融合作为新的特征,用决策树来进行分类。对于encoding特征数据:通过BERT模型提取的encoding的‑1层输出信息结果,然后再做average‑>max pooling处理,得到总共有768维特征,不需要做归一化。本发明避免了传统算法低效且欠拟合的问题,最合理的使用了所有信息,使得分类依据增多效果显著。本方法在对外汉语难度评估上取得了85.6%的准确率。
The invention discloses a decision tree classification method for assessing the difficulty of Chinese as a foreign language based on random subspace feature selection of svm and bert models. According to the length and readability of articles, 86 statistical features are generated and classified by svm to obtain Confidence level 1. The encoding feature is classified with svm to obtain a confidence level of 2. The two confidence levels obtained are fused as new features, and the decision tree is used for classification. For the encoding feature data: the output information result of the encoding-1 layer extracted by the BERT model, and then average->max pooling processing is performed to obtain a total of 768-dimensional features, which do not need to be normalized. The invention avoids the problems of inefficiency and underfitting of traditional algorithms, and uses all information most reasonably, so that the effect of increasing classification basis is remarkable. This method has achieved an accuracy rate of 85.6% in evaluating the difficulty of Chinese as a foreign language.
Description
技术领域technical field
本发明属于教育信息化领域,具体涉及一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法。The invention belongs to the field of educational informatization, and in particular relates to a method for evaluating difficulty of Chinese as a foreign language based on decision tree classification of random subspace feature selection of SVM and BERT models.
背景技术Background technique
众所周知,阅读应该循序渐进,从易到难。太难容易导致学生的自信心受挫,对阅读丧失兴趣。而过于简单,低水平重复,则不利于阅读能力的持续提升,无法满足升入大学后阅读复杂文本,开展相关研究的学术要求。总之,只有难度适合的才是最好的。随着中国的发展,中国在国际舞台上扮演的角色越来越重要了,这使得更多的人有了学习汉语的需求。学习汉语文本是最有效的方式之一,但是学习一定难度汉语文本需要汉语学习者自身具备一定的文化素养,若是汉语学习者不满足对应文本对其汉语功底要求,则会事倍功半并且会极大的打击汉语学习者的兴趣爱好。并且在培养汉语学习者的写作能力的时候,应该有针对性的提供各种文体给其参考,并且能基于汉语学习者所写的作文文体来进行评判打分。故汉语文本的分类是辅助汉语学习系统的关键技术。We all know that reading should be done step by step, from easy to difficult. Too difficult can easily lead to frustration of students' self-confidence and loss of interest in reading. Too simple and low-level repetition is not conducive to the continuous improvement of reading ability, and cannot meet the academic requirements for reading complex texts and conducting related research after entering university. In short, only the difficulty is suitable for the best. With the development of China, China is playing an increasingly important role on the international stage, which makes more people have the demand to learn Chinese. Learning Chinese texts is one of the most effective ways, but learning Chinese texts of a certain difficulty requires Chinese learners to have a certain level of cultural literacy. Crack down on the interests and hobbies of Chinese learners. And when cultivating the writing ability of Chinese learners, we should provide various styles for reference, and be able to judge and score based on the style of composition written by Chinese learners. Therefore, the classification of Chinese texts is the key technology to assist the Chinese learning system.
对外汉语分级读物的难易度指的是该级别的读物是否适合汉语语言程度达到该级别的汉语学习者来阅读,是否会出现读物过难,或者读物太容易的情况。The difficulty level of graded Chinese as a foreign language refers to whether the readings of this level are suitable for Chinese learners who have reached this level of Chinese language, and whether the readings are too difficult or too easy.
文本分类是利用计算机对文本集按照一定的分类体系或者标准来进行自动分类标记,根据其是否使用深度学习技术分为两大类,第一类是基于传统机器学习文本分类,第二类是基于深度学习文本分类。当然第二类中的文本分类技术中有些情况会使用深度学习的方法和传统机器学习的方法相结合。Text classification is to use computers to automatically classify and mark text sets according to certain classification systems or standards. According to whether deep learning technology is used or not, it is divided into two categories. The first category is based on traditional machine learning text classification, and the second category is based on Deep Learning Text Classification. Of course, some cases in the text classification technology in the second category will use the combination of deep learning method and traditional machine learning method.
90年代后期,传统机器学习飞速发展,对于文本分类问题形成了一套固有的模式,特征工程+分类器模型。这里的特征工程就是将文本中的信息提炼,使计算机可以轻松识别读取文本中的信息,通常特征工程分为三步,第一步文本预处理,第二步特征提取,第三步文本表示。分类器模型比较著名的有朴素贝叶斯分类算法、KNN、SVM、最大熵等等。In the late 1990s, traditional machine learning developed rapidly, forming a set of inherent models for text classification problems, feature engineering + classifier model. The feature engineering here is to extract the information in the text, so that the computer can easily recognize and read the information in the text. Usually, the feature engineering is divided into three steps, the first step is text preprocessing, the second step is feature extraction, and the third step is text representation. . The more famous classifier models include Naive Bayesian classification algorithm, KNN, SVM, maximum entropy and so on.
在基于深度神经网络的NLP方法中,文本中的字/词通常都用一维向量来表示(一般称之为“词向量”);在此基础上,神经网络会将文本中各个字或词的一维词向量作为输入,经过一系列复杂的转换后,输出一个一维词向量作为文本的语义表示。特别地,通常希望语义相近的字/词在特征向量空间上的距离也比较接近,如此一来,由字/词向量转换而来的文本向量也能够包含更为准确的语义信息。因此,BERT模型的主要输入是文本中各个字/词的原始词向量,该向量既可以随机初始化,也可以利用Word2Vector等算法进行预训练以作为初始值;输出是文本中各个字/词融合了全文语义信息后的向量表示。In the NLP method based on deep neural networks, the words/words in the text are usually represented by one-dimensional vectors (generally called "word vectors"); on this basis, the neural network will convert each word or word in the text The one-dimensional word vector of is used as input, and after a series of complex transformations, a one-dimensional word vector is output as the semantic representation of the text. In particular, it is usually expected that the words/words with similar semantics are relatively close in the feature vector space, so that the text vector converted from the word/word vector can also contain more accurate semantic information. Therefore, the main input of the BERT model is the original word vector of each word/word in the text, which can be initialized randomly, or pre-trained using algorithms such as Word2Vector as the initial value; the output is the fusion of each word/word in the text Vector representation after full-text semantic information.
目前关于中文文本分类多是对于微博和新闻等简单短小的文本集进行分类,而针对于可供汉语学习者的汉语文本分类若是使用现有的方法效果都不太理想。At present, most of the Chinese text classification is for simple and short text collections such as Weibo and news, but the existing methods for Chinese text classification for Chinese learners are not ideal.
发明内容Contents of the invention
针对现有技术以上缺陷或改进需求中的至少一种,特别是由于汉语学习者的文本分类问题的复杂性,在面对汉语学习者不同的需求的时候,分类的标准会发生相应的变化,针对于该任务本发明提出了一种基于Bert模型、svm和决策树特征融合的对外汉语难度评估方法。根据文章的长度,易读性等特点生成86个统计特征,用svm进行分类,得到置信度1。将encoding特征,用svm进行分类,得到置信度2。将得到的2个置信度,融合作为新的特征,用决策树来进行分类。For at least one of the above deficiencies or improvement needs of the prior art, especially due to the complexity of the text classification problem of Chinese learners, when facing the different needs of Chinese learners, the classification standards will change accordingly, Aiming at this task, the present invention proposes a method for assessing the difficulty of Chinese as a foreign language based on the fusion of Bert model, svm and decision tree features. According to the length of the article, readability and other characteristics, 86 statistical features are generated, and svm is used to classify, and the confidence level is 1. The encoding features are classified with svm to obtain a confidence level of 2. The two confidence levels obtained are fused as new features, and the decision tree is used for classification.
为实现上述目的,按照本发明的一个方面,提供了一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法,包括如下步骤:In order to achieve the above object, according to one aspect of the present invention, a method for evaluating the difficulty of Chinese as a foreign language based on the decision tree classification of random subspace feature selection of svm and bert model is provided, comprising the following steps:
S1、对对外汉语文章进行预处理;S1. Preprocessing the articles in Chinese as a foreign language;
S2、对于步骤S1预处理之后的对外汉语文章,根据对外汉语文章的长度,文章的易读性,文章的生词量生成多个特征;S2, for the article in Chinese as a foreign language after the preprocessing of step S1, according to the length of the article in Chinese as a foreign language, the legibility of the article, and the amount of new words in the article generate multiple features;
S3、然后使用基于随机子空间的svm结合对包含所有上述特征的文章进行分类,得到置信度1;S3, then use the svm based on the random subspace to classify the articles containing all the above features, and obtain a confidence level of 1;
S4、对于步骤S1预处理之后的对外汉语文章,通过BERT模型提取的encoding的-1层输出信息结果,再做average->max pooling处理,得到文章的多维encoding特征;S4. For the Chinese-language articles after the preprocessing in step S1, the output information result of the -1 layer of encoding extracted by the BERT model is processed by average->max pooling to obtain the multi-dimensional encoding features of the article;
S5、将encoding特征,使用基于随机子空间的svm进行分类,得到置信度2;S5, classify the encoding features using the svm based on the random subspace, and obtain a confidence level of 2;
S6、将得到的2个置信度,融合作为新的特征,用决策树来进行分类。S6. Combine the obtained two confidence levels as new features, and use a decision tree to classify.
优选地,步骤S1中,对对外汉语文章进行预处理包括保存为txt格式。Preferably, in step S1, preprocessing the foreign language article includes saving it in txt format.
优选地,步骤S1中,对对外汉语文章进行预处理包括删除文章中的空行。Preferably, in step S1, preprocessing the article in Chinese as a foreign language includes deleting blank lines in the article.
优选地,步骤S1中,对对外汉语文章进行预处理包括对文章进行分句。Preferably, in step S1, preprocessing the article in Chinese as a foreign language includes dividing the article into sentences.
优选地,步骤S1中,分句为利用python将每篇文章以句子为单位进行切割,存储在list结构中,并且去除标点符号。Preferably, in step S1, sentence segmentation is to use python to cut each article in sentence units, store it in a list structure, and remove punctuation marks.
优选地,步骤S2中生成的多个特征包括总字数,总笔画数,段落数,总句数,生词数。Preferably, the multiple features generated in step S2 include the total number of words, the total number of strokes, the number of paragraphs, the total number of sentences, and the number of new words.
优选地,步骤S6中,将置信度1和置信度2使用求加权平均值,来作为这篇文章的综合输出。上述优选技术特征只要彼此之间未构成冲突就可以相互组合。Preferably, in step S6, the weighted average of the confidence level 1 and the confidence level 2 is used as the comprehensive output of this article. The above preferred technical features can be combined with each other as long as they do not conflict with each other.
总体而言,通过本发明所构思的以上技术方案与现有技术相比,具有以下有益效果:基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法,利用Bert模型较强的文本特征提取能力,获得对外汉语文章的包含丰富语义信息的表示,再结合传统的文章字词的统计特征,这能够充分利用文章的各种特征。本发明避免了传统算法低效且欠拟合的问题,最合理的使用了所有信息,使得分类依据增多效果显著。本方法在对外汉语难度评估上取得了85.6%的准确率。Generally speaking, compared with the prior art, the above technical scheme conceived by the present invention has the following beneficial effects: the method for evaluating the difficulty of Chinese as a foreign language based on the random subspace feature selection of svm and bert model, using the Bert model Strong text feature extraction ability, obtaining the representation of foreign Chinese articles containing rich semantic information, combined with the statistical characteristics of traditional article words, which can make full use of various features of the article. The invention avoids the problems of inefficiency and underfitting of the traditional algorithm, uses all the information most reasonably, and makes the effect of increasing the basis of classification remarkable. This method has achieved an accuracy rate of 85.6% in evaluating the difficulty of Chinese as a foreign language.
附图说明Description of drawings
图1是本发明的基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法的总体示意图;Fig. 1 is the overall schematic diagram of the foreign Chinese difficulty assessment method of the decision tree classification based on the random subspace feature selection of svm and bert model of the present invention;
图2是本发明使用的基于Bert模型提取文章的encoding特征的结构图。Fig. 2 is a structural diagram of the encoding feature extracted from an article based on the Bert model used in the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。下面结合具体实施方式对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other. The present invention will be further described in detail below in combination with specific embodiments.
如图1所示,本发明提供一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法,包括如下步骤:As shown in Figure 1, the present invention provides a kind of method for assessing the difficulty of Chinese as a foreign language based on the decision tree classification of random subspace feature selection of svm and bert model, comprising the following steps:
S1、对对外汉语文章进行预处理,包括保存为txt格式、删除文章中的空行、对文章进行分句。分句为利用python将每篇文章以句子为单位进行切割,存储在list结构中,并且去除标点符号;S1. Preprocessing the article in Chinese as a foreign language, including saving it in txt format, deleting blank lines in the article, and dividing the article into sentences. Sentence segmentation is to use python to divide each article into sentences, store them in the list structure, and remove punctuation marks;
S2、对于步骤S1预处理之后的对外汉语文章,根据对外汉语文章的长度,文章的易读性,文章的生词量生成多个特征,例如86个,包括总字数,总笔画数,段落数,总句数,生词数;S2, for the article in Chinese as a foreign language after the step S1 preprocessing, according to the length of the article in Chinese as a foreign language, the legibility of the article, the number of new words in the article generates a plurality of features, such as 86, including the total number of words, the total number of strokes, the number of paragraphs, The total number of sentences, the number of new words;
S3、然后使用基于随机子空间的svm结合对包含所有上述特征的文章进行分类,得到置信度1;S3, then use the svm based on the random subspace to classify the articles containing all the above features, and obtain a confidence level of 1;
S4、对于步骤S1预处理之后的对外汉语文章,通过BERT模型提取的encoding的-1层输出信息结果,再做average->max pooling处理,得到文章的多维encoding特征,如图2所示;S4. For the Chinese-language articles after the preprocessing in step S1, the output information result of the encoding layer -1 extracted by the BERT model is processed by average->max pooling to obtain the multi-dimensional encoding features of the article, as shown in Figure 2;
S5、将encoding特征,使用基于随机子空间的svm进行分类,得到置信度2;S5, classify the encoding features using the svm based on the random subspace, and obtain a confidence level of 2;
S6、将得到的2个置信度,融合作为新的特征,用决策树来进行分类。优选地,步骤S6中,将置信度1和置信度2使用求加权平均值,来作为这篇文章的综合输出。上述优选技术特征只要彼此之间未构成冲突就可以相互组合。S6. Combine the obtained two confidence levels as new features, and use a decision tree to classify. Preferably, in step S6, the weighted average of the confidence level 1 and the confidence level 2 is used as the comprehensive output of this article. The above preferred technical features can be combined with each other as long as they do not conflict with each other.
下面以详细实例进行说明,本发明提供了一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估方法,包括以下步骤:Illustrate below with detailed example, the present invention provides a kind of method for assessing the difficulty of learning Chinese as a foreign language based on the decision tree classification of random subspace feature selection of svm and bert model, comprising the following steps:
(1)利用爬虫技术将作文网上的作文按照年级爬取(小学一年级到高中三年级),并以年级为标准对数据集进行正确的划分,并将年级信息写入文件名的前面,存储为txt格式。(1) Use crawler technology to crawl the compositions on Composition Online according to grades (first grade in elementary school to third grade in high school), and correctly divide the data set based on grades, and write the grade information in front of the file name, and store It is in txt format.
(2)对于每一个年级的文章需要选取一篇最具有代表性的作为标杆文章单独拿出来,作为每一类型的标准代表。(2) For the articles of each grade, one of the most representative articles should be selected as a benchmark article and taken out separately as the standard representative of each type.
(3)利用python将每篇文章以句子为单位进行切割,存储在list结构中,并且需要去除标点符号。(3) Use python to cut each article into sentences, store them in a list structure, and remove punctuation marks.
(4)、对于以上预处理之后的对外汉语文章,根据对外汉语文章的长度,文章的易读性,文章的生词量生成多个特征,例如86个,包括总字数,总笔画数,段落数,总句数,生词数;对外汉语分级读物的难易度,本发明从三个角度考察对外汉语分级读物的难易度,一是读物的长度,即读物中所含的中文字数,二是读物的易读性,即读物的平均句长和平均每百字句数,三是读物的生词量,即读物中出现的生词数量。(4), for the above pre-processed Chinese as a foreign language article, according to the length of the Chinese as a foreign language article, the readability of the article, and the amount of new words in the article, multiple features are generated, such as 86, including the total number of words, the total number of strokes, and the number of paragraphs , the total number of sentences, the number of new words; the degree of difficulty of graded Chinese as a foreign language, the present invention investigates the degree of difficulty of graded Chinese as a foreign language from three angles, the first is the length of the reading, i.e. the number of Chinese words contained in the reading, the second is The readability of the reading material refers to the average sentence length and the average number of sentences per 100 words of the reading material. The third is the number of new words in the reading material, which means the number of new words appearing in the reading material.
(5)、然后使用基于随机子空间的svm结合对包含所有上述特征的文章进行分类,得到置信度1。(5) Then use the random subspace-based svm combination to classify the articles containing all the above features, and get a confidence level of 1.
(6)、对于预处理之后的对外汉语文章,通过BERT模型提取的encoding的-1层输出信息结果,再做average->max pooling处理,得到文章的多维encoding特征,如图2所示,对于每一句话的输入,Bert结构都会进行编码,所以会改变label注意力加权机制及字词的权重值,多核心会使label嵌入的边界更加细致,能更好的拟合数据。(6) For the pre-processed Chinese language articles, the -1 layer output information of the encoding extracted by the BERT model is then processed by average->max pooling to obtain the multi-dimensional encoding features of the article, as shown in Figure 2. For The Bert structure will encode the input of each sentence, so the label attention weighting mechanism and the weight value of words will be changed. Multi-core will make the boundary of label embedding more detailed and better fit the data.
(7)、将encoding特征,使用基于随机子空间的svm进行分类,得到置信度2。(7) Classify the encoding features using svm based on random subspaces to obtain a confidence level of 2.
(8)、将得到的2个置信度,融合作为新的特征,用决策树来进行分类。其中,将训练的时候每一篇文章是切割为多个句子的组合,所以句子才是输入的基础单元,而在对于一篇文章的每一个句子进行分类后,要使用求加权平均值来作为这篇文章的综合输出。(8) Combine the obtained two confidence levels as a new feature, and use a decision tree to classify. Among them, each article is cut into a combination of multiple sentences during training, so the sentence is the basic unit of input, and after classifying each sentence of an article, the weighted average is used as the Synthesized output of this post.
<实验说明及结果><Experiment description and results>
本实例从13个作文网上爬取了共51356篇作文作文,依照从小学到高中12个年级进行作文分类,分别筛选出了各类作文4000篇、共48000篇作文,将作文存入txt格式,训练集和测试集和验证集比例为7:2:1分割,然后使用训练集按照具体实施方法去实施训练,同时观察验证集的准确率来选择终止训练的时间点。In this example, a total of 51,356 essays were crawled from 13 essays on the Internet, and the essays were classified according to 12 grades from elementary school to high school. 4,000 essays of various types were screened out, with a total of 48,000 essays, and the essays were stored in txt format. The ratio of training set, test set and verification set is 7:2:1, and then use the training set to implement training according to the specific implementation method, and observe the accuracy of the verification set to choose the time point to terminate the training.
每一次训练固定核心的模型时,会打乱所有样本顺序重取训练集、测试集和验证集,再次训练并验证,总共进行了10轮操作,下表结果为10次实验结果的平均值。Every time a model with a fixed core is trained, the sequence of all samples will be disrupted to retake the training set, test set, and verification set, and then train and verify again. A total of 10 rounds of operations have been performed. The results in the table below are the average of the 10 experimental results.
具体的实验效果如表1。The specific experimental results are shown in Table 1.
表1.实验结果Table 1. Experimental results
综上所述,针对对外汉语文章难度评估的文本分类问题,本发明提出了一种基于svm和bert模型的随机子空间特征选择的决策树分类的对外汉语难度评估及自动分类方法,利用Bert模型较强的文本特征提取能力,获得对外汉语文章的包含丰富语义信息的表示,再结合传统的文章字词的统计特征,这能够充分利用文章的各种特征。本发明避免了传统算法低效且欠拟合的问题,最合理的使用了所有信息,使得分类依据增多效果显著。本方法在对外汉语难度评估上取得了85.6%的准确率。In summary, aiming at the text classification problem of evaluating the difficulty of articles in Chinese as a foreign language, the present invention proposes a decision tree classification based on random subspace feature selection of svm and bert model and an automatic classification method for evaluating the difficulty of Chinese as a foreign language and using the Bert model Strong text feature extraction ability, obtaining the representation of foreign Chinese articles containing rich semantic information, combined with the statistical characteristics of traditional article words, which can make full use of various features of the article. The invention avoids the problems of inefficiency and underfitting of the traditional algorithm, uses all the information most reasonably, and makes the effect of increasing the basis of classification remarkable. This method has achieved an accuracy rate of 85.6% in evaluating the difficulty of Chinese as a foreign language.
本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911206414.9A CN111078874B (en) | 2019-11-29 | 2019-11-29 | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911206414.9A CN111078874B (en) | 2019-11-29 | 2019-11-29 | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111078874A CN111078874A (en) | 2020-04-28 |
CN111078874B true CN111078874B (en) | 2023-04-07 |
Family
ID=70312204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911206414.9A Active CN111078874B (en) | 2019-11-29 | 2019-11-29 | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111078874B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797229A (en) * | 2020-06-10 | 2020-10-20 | 南京擎盾信息科技有限公司 | Text representation method and device and text classification method |
CN112631139B (en) * | 2020-12-14 | 2022-04-22 | 山东大学 | Real-time detection system and method for rationality of smart home instruction |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200521895A (en) * | 2003-12-26 | 2005-07-01 | Inventec Besta Co Ltd | System and method to recognize the degree of mastering difficulty for a language text |
CN101814066A (en) * | 2009-02-23 | 2010-08-25 | 富士通株式会社 | Text reading difficulty judging device and method thereof |
CN103207854A (en) * | 2012-01-11 | 2013-07-17 | 宋曜廷 | Chinese text readability measuring system and method thereof |
CN105068993A (en) * | 2015-07-31 | 2015-11-18 | 成都思戴科科技有限公司 | Method for evaluating text difficulty |
CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | Multi-model fused short text classification method |
CN107145514A (en) * | 2017-04-01 | 2017-09-08 | 华南理工大学 | Chinese sentence pattern sorting technique based on decision tree and SVM mixed models |
CN107506346A (en) * | 2017-07-10 | 2017-12-22 | 北京享阅教育科技有限公司 | A kind of Chinese reading grade of difficulty method and system based on machine learning |
CN107977362A (en) * | 2017-12-11 | 2018-05-01 | 中山大学 | A kind of method defined the level for Chinese text and calculate the scoring of Chinese text difficulty |
CN108984531A (en) * | 2018-07-23 | 2018-12-11 | 深圳市悦好教育科技有限公司 | Books reading difficulty method and system based on language teaching material |
CN109977408A (en) * | 2019-03-27 | 2019-07-05 | 西安电子科技大学 | The implementation method of English Reading classification and reading matter recommender system based on deep learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007249755A (en) * | 2006-03-17 | 2007-09-27 | Ibm Japan Ltd | System and method for evaluating difficulty of understanding document |
WO2019204086A1 (en) * | 2018-04-18 | 2019-10-24 | HelpShift, Inc. | System and methods for processing and interpreting text messages |
-
2019
- 2019-11-29 CN CN201911206414.9A patent/CN111078874B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW200521895A (en) * | 2003-12-26 | 2005-07-01 | Inventec Besta Co Ltd | System and method to recognize the degree of mastering difficulty for a language text |
CN101814066A (en) * | 2009-02-23 | 2010-08-25 | 富士通株式会社 | Text reading difficulty judging device and method thereof |
CN103207854A (en) * | 2012-01-11 | 2013-07-17 | 宋曜廷 | Chinese text readability measuring system and method thereof |
CN105068993A (en) * | 2015-07-31 | 2015-11-18 | 成都思戴科科技有限公司 | Method for evaluating text difficulty |
CN105468713A (en) * | 2015-11-19 | 2016-04-06 | 西安交通大学 | Multi-model fused short text classification method |
CN107145514A (en) * | 2017-04-01 | 2017-09-08 | 华南理工大学 | Chinese sentence pattern sorting technique based on decision tree and SVM mixed models |
CN107506346A (en) * | 2017-07-10 | 2017-12-22 | 北京享阅教育科技有限公司 | A kind of Chinese reading grade of difficulty method and system based on machine learning |
CN107977362A (en) * | 2017-12-11 | 2018-05-01 | 中山大学 | A kind of method defined the level for Chinese text and calculate the scoring of Chinese text difficulty |
CN108984531A (en) * | 2018-07-23 | 2018-12-11 | 深圳市悦好教育科技有限公司 | Books reading difficulty method and system based on language teaching material |
CN109977408A (en) * | 2019-03-27 | 2019-07-05 | 西安电子科技大学 | The implementation method of English Reading classification and reading matter recommender system based on deep learning |
Non-Patent Citations (2)
Title |
---|
基于回归模型的对外汉语阅读材料的可读性自动评估研究;曾致中;《中国教育信息化》;全文 * |
基于随机森林算法的对外汉语文本可读性评估;杨文媞;《中国教育信息化》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111078874A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729309B (en) | A method and device for Chinese semantic analysis based on deep learning | |
CN105069021B (en) | Chinese short text sensibility classification method based on field | |
CN103577989B (en) | A kind of information classification approach and information classifying system based on product identification | |
CN107944014A (en) | A kind of Chinese text sentiment analysis method based on deep learning | |
CN101599071A (en) | Automatic Extraction Method of Dialogue Text Topics | |
CN108090099B (en) | Text processing method and device | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
CN108388554A (en) | Text emotion identifying system based on collaborative filtering attention mechanism | |
CN108563638A (en) | A kind of microblog emotional analysis method based on topic identification and integrated study | |
CN103995853A (en) | Multi-language emotional data processing and classifying method and system based on key sentences | |
CN103593431A (en) | Internet public opinion analyzing method and device | |
El-Halees | Mining opinions in user-generated contents to improve course evaluation | |
CN107818173B (en) | A Chinese fake comment filtering method based on vector space model | |
CN106951472A (en) | A kind of multiple sensibility classification method of network text | |
CN107688630A (en) | A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme | |
CN107679031A (en) | Based on the advertisement blog article recognition methods for stacking the self-editing ink recorder of noise reduction | |
CN111078874B (en) | Foreign Chinese difficulty assessment method based on decision tree classification of random subspace | |
CN109062958B (en) | Primary school composition automatic classification method based on TextRank and convolutional neural network | |
CN111090985B (en) | A Chinese text difficulty assessment method based on siamese network and multi-core LEAM architecture | |
CN110851593A (en) | Complex value word vector construction method based on position and semantics | |
CN107844475A (en) | A kind of segmenting method based on LSTM | |
CN108733652B (en) | Test method for film evaluation emotion tendency analysis based on machine learning | |
CN110414556A (en) | An automatic extraction method of metaphorical sentences and anthropomorphic sentences in primary school Chinese composition based on Word2Vec and recurrent neural network | |
CN104123336B (en) | Depth Boltzmann machine model and short text subject classification system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |