CN113934850B

CN113934850B - Chinese text readability evaluation method and system fusing text distribution law characteristics

Info

Publication number: CN113934850B
Application number: CN202111289536.6A
Authority: CN
Inventors: 赵慧周; 郭雯钰
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-06-17
Anticipated expiration: 2041-11-02
Also published as: CN113934850A

Abstract

The invention discloses a Chinese text readability evaluation method and system fusing text distribution law characteristics, wherein the method comprises the following steps: determining a candidate set of text features, the candidate set of text features comprising: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas; calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model; performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values; and performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model. The invention combines the text distribution law characteristics with the characters, words, sentences, articles and readability formula characteristics, and performs readability formula design and machine learning model training after characteristic selection, so that the readability prediction accuracy of the text of the discourse is obviously improved.

Description

Chinese text readability evaluation method and system fusing text distribution law characteristics

Technical Field

The invention relates to the technical field of Chinese information processing, in particular to a method and a system for evaluating the readability of a Chinese text by fusing text distribution law characteristics.

Background

Reading is an important way to obtain information, and training reading ability is an important aspect of language learning. For a language learner, a hierarchical reading should be performed, namely: reading the text material with proper difficulty matched with the comprehension ability of the learner. This is useful for maintaining reading interest and developing reading ability while developing reading habits.

The readability of a text is a method for quantitatively evaluating the reading difficulty of a text material, which is considered as readability, also called readability or intelligibility, in Lishaoshan (review of readability research [ J ]. release military college of foreign languages, 2000), and refers to the degree or nature of the text which is easy to read and understand, and is an important attribute of the text. The method for evaluating the readability of the research text has important significance for applications such as hierarchical reading, textbook compiling and the like.

The Chinese text readability prediction method includes a readability formula-based prediction method, a language model-based prediction method, a traditional machine learning model-based prediction method, a deep neural network-based prediction method and the like. Both readability formula methods and traditional machine learning models need to rely on text features.

The text readability formula is considered as a way of predicting the readability level of the text, which can be objectively evaluated. Royal (text readability formula initial exploration [ D ] of students in japanese and korean at junior middle school, university of beijing university of languages, university of masters research academic degree paper, 2005) considers that the readability formula is a formula for evaluating text difficulty degree by integrating all factors (particularly text factors) which affect reading difficulty and can be quantified. Hiebert et al (Standards, associations, and text differentiation in A.E. Farstrup & S.J. Samules (Eds.). What research has to about ready Reading instruction [ J ]. Newark, DE: International Reading Association,2002) showed that in the 20 th century, the educational material in the American educational system was evaluated and rated using readable formulas, and the educational material in all fields up to now required matching with the semantics and parameters of certain readable formulas.

In the form, the readability formula prediction method may be expressed as G ═ F (X), where G is readability grade, X is a text feature vector, F is formula content, and is generally a linear function, and after determining the feature vector, a researcher fits a training text to obtain constant terms of the linear function, thereby generating the readability formula. Table 1 lists some of the chinese readability formulas.

TABLE 1 Chinese readability formula summary

In table 1, GL is readability score, wd is number of words, sent is number of sentences, ease refers to number of familiar words (typically the number of words with lower difficulty is calculated from a hierarchical vocabulary), stroke refers to number of character-averaged strokes, hard refers to proportion of difficult words (typically the proportion of words with higher difficulty is calculated from a hierarchical vocabulary), syll/send refers to number of sentence-averaged words, wd/send refers to number of sentence-averaged words, syll/wd refers to the average number of words per word, x _2 refers to glyph complexity, x _3 refers to glyph complexity, func refers to number of imaginary words, splitsense refers to number of sentences, ease/wd refers to proportion of number of familiar words (typically the proportion of words with lower difficulty is calculated from a hierarchical vocabulary), and lengh refers to the largest number of words minus the smallest wd in a corpus.

According to the prediction method based on the traditional machine learning, after the characteristics are selected, the text characteristic value is input into a machine learning model, and the existing machine learning model is used for learning, so that the readability grade is predicted.

Wu Siyuan et al (Chinese text readability characteristic system construction and validity verification [ J ]. world Chinese teaching No. 34, 2020, No. 1, 2020) performs sentence-level five-classification prediction on primary school to high-school 12-grade Chinese teaching materials, and the accuracy of a support vector machine model is the highest and reaches 0.638. Jiang et al (GRAW +: Atwo-view map prediction method with word addition for accessibility assessment. "Journal of the Association for Information Science and Technology 70.5 (2019):433 447) used the GRAW + model to perform sentence-level six-classification prediction of Chinese language textbooks in primary schools with an accuracy of 0.54. Courage and the like (text reading difficulty automatic grading research [ J ] based on multiple characteristics, data analysis and knowledge discovery in the 07 th 2019) carry out chapter-grade and dichotomy prediction on the primary and high-school Chinese textbooks by using a multiple characteristic fusion method, and the accuracy of a logistic regression model is the highest and reaches 0.88. Sunggang (Chinese text readability prediction method research [ D ] based on linear regression. graduate paper of university of Nanjing 2015) performs chapter-level six-classification of primary school Chinese text teaching materials by using a logarithmic linear regression model, with the accuracy rate of 0.46. In the study of readability prediction by using the primary school Chinese text teaching material data set, the accuracy of sentence-level text readability prediction is higher than that of chapter-level text, and in the study of chapter-level six-classification readability prediction, the best result in the prior art is the study of Sunggang, and the accuracy is 0.46.

The existing readability formula method and the traditional machine learning method for performing readability evaluation and prediction on Chinese texts based on text characteristics mainly have the following problems:

(1) although the number of the features is large in the aspect of text features, the text features used for Chinese text readability evaluation are all the features of text characters, words, sentences and categories of paragraphs. The study on royal buds (the study on the student text readability formula of junior high-school Japanese and Korean students [ D ]. the study on the student biology position of Beijing language university, 2005), Guo Wang Hao (the study on the foreign Chinese text readability formula [ D ]. the study on the student biology position of Shanghai traffic university, 2009), Yangjin Yu (the study on the language difficulty determination of the high-level Chinese fine reading textbook [ D ]. the study on the student biology position of Beijing university, 2008), Chenalin (the quantitative calculation model for the neural network Chinese reading difficulty and the comparison of the results [ J ]. Chongqing university academy (Nature science edition), 2000), Xijing (the study on the readability of Chinese textbook [ J ]. the estimation of the readability value of the year of interest [ education information of the China, 1995, 3(3):113 and 127), Han Yingyi (the study on the Chinese university of Beijing university [ D ]. the study on the student biology position of the year [ A ] (the text of the third university of the study on the country university of the Chinese textbook [ A ]. the study on the difficulty of the country university of the country) (the university of the Chinese university of the English) was calculated by the third university of the study on the difficulty of the study on the country of the book [ D ]. the study on the country of the country) (English), Han university of the book [ D ]. the book of the country), and the book of China) (English study on the book of China) (English study on the country), and the book of China) (English study on the country), China) (English of China) (English study on the country of China) (English of China) ( The text characteristics used in The existing Chinese text readability evaluation techniques listed in Table 2 can be obtained by summarizing The works of seminar 2006), Yang (area for Chinese language [ D ]. The University of Wisconsin 1970), Wu Wen Mitsu (construction of Chinese text readability characteristic system and validation [ J ]. world Chinese teaching No. 34, No. 2020, No. 1, 2020).

TABLE 2 characteristics for text readability evaluation

It is considered here that more categories of features can be mined in addition to the text features of the above categories of characters, words, sentences, and paragraphs. Especially in the aspect of chapter readability evaluation, the text readability is affected by the theme, the structure and the long-distance context relevance of chapters, and the characteristics of metering in the aspects can be used for reference in the research field of metering linguistics. One of the technical key points of the invention is to use the text distribution law characteristics obtained by further calculating the text-based word and word related distribution function for the readability evaluation of text chapters.

(2) In the application field of Chinese text readability evaluation, the fine-grained chapter readability evaluation method for the native language learner is deficient.

In the research related to readability formulas (see table 1), the formulas of Guo, Tao, Wang and Zhouyuan are all proposed based on foreign Chinese teaching materials or reading materials. Jingxi Yi formula is a readability formula for traditional Chinese characters in 1-12 th grade of Taiwan. Only the grandson-Hanyin formula is a readability formula suitable for presenting simplified characters for the learner of the mother language.

For the traditional machine learning method, in the research taking simplified Chinese teaching materials as objects, the work taking chapter readability evaluation as a target only accounts for half, and only the work of the grandchild just classifies the chapter six of the primary school section as a classification target. This is mainly because the traditional machine learning method needs a certain amount of hierarchical texts as training corpora for model training, the Chinese text of middle and primary schools is the most suitable hierarchical text, but the amount of such corpora is small, taking the six-grade Chinese text of primary school as an example, in the 12 versions of the 1-6-grade Chinese text of primary school, the modern text chapters are less than 3000, and if the text of the primary school is classified six, the average number of chapters per classification is less than 500. Thus, most studies have adopted treatment schemes involving two directions:

one is to evaluate sentence readability because the number of sentences is much higher than the number of chapters. Here, it is considered that readability of chapters and readability of sentences is a different evaluation problem, and chapters are objects to be evaluated for Chinese hierarchical reading and textbook writing.

Secondly, the number of classifications is reduced, such as: the 12 grades of the primary and middle schools are divided into 5 categories, or the text of the primary and high schools is classified into two categories. The text readability evaluation is considered to be mainly applied to the text selection evaluation application in the language learning stage, and is more important for the text evaluation of primary school grades 1-6. Moreover, the grading granularity of the text evaluation used by the language learner is as fine as possible, so that the text evaluation more meets the requirement of i +1 of language learning input, and the text evaluation grading method can be applied to a text pushing system for the autonomous reading and learning of the learner.

In summary, the readability evaluation method and the readability evaluation system provided by the invention aim to obtain better effect than similar work in solving the problem of six classification of chapters of primary schools from 1 to 6.

Disclosure of Invention

Aiming at the problems, the invention provides a Chinese text readability evaluation method and system fusing text distribution law characteristics, which are suitable for classifying and predicting sections and texts at multiple readability levels. The example test of six classifications of text read by 1-6 grades of primary schools according to the grade shows that the accuracy of readability evaluation can be improved.

To solve the above technical problem, an embodiment of the present invention provides the following solutions:

on one hand, the method for evaluating the readability of the Chinese text by fusing the text distribution law characteristics comprises the following steps:

s1, determining a text feature candidate set, wherein the text feature candidate set comprises: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;

s2, calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;

s3, performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;

and S4, performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model.

Preferably, the training text in step S2 is a text labeled with a plurality of readability classification levels, and the arbitrary text in step S4 is an arbitrary text to be classified and predicted according to the readability classification levels of the training text in step S2.

Preferably, the readability formula design specifically comprises the following steps:

performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, and screening out the features of which the Pearson correlation coefficient is below a preset value alpha as the features for establishing a multiple linear regression model;

and carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with the readability grade.

Preferably, when α is 0.7 and the training text in step S2 is a text labeled with six readability classification levels including multiple versions of the language texts in the levels of primary school 1-6, the readability formula obtained by the fitting is expressed as:

Y＝15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.38 0*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre；

where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.

Preferably, the machine learning model training specifically comprises the steps of:

performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;

and (5) training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.

In one aspect, a system for evaluating readability of a chinese text fused with text distribution law features is provided, including:

a text feature candidate set determining module, configured to determine a text feature candidate set, where the text feature candidate set includes: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;

the characteristic value calculation module is used for calculating the characteristic values of six types of characteristics of characters, words, sentences, paragraphs, distribution laws and readability formulas, wherein the six types of characteristics are used for fitting readability formula parameters and training discourse texts of a machine learning model;

the design and training module is used for carrying out readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;

and the prediction module is used for performing readability prediction on any discourse text by utilizing a designed readability formula or a trained machine learning model.

Preferably, the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text.

Preferably, the design and training module is specifically configured to:

Preferably, when α is 0.7 and the training text is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by the fitting is expressed as:

Preferably, the design and training module is further specifically configured to:

and training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

in the embodiment of the invention, the text distribution law characteristics are combined with the characters, words, sentences and readability formula characteristics, readability formula design and machine learning model training are carried out after characteristic selection, and readability prediction accuracy of text of chapters is obviously improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for evaluating readability of a Chinese text with fusion of text distribution law characteristics according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a chinese text readability evaluation system fusing text distribution law characteristics according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The embodiment of the invention firstly provides a Chinese text readability evaluation method fusing text distribution law characteristics, and as shown in fig. 1, the method comprises the following steps:

s1, determining a text feature candidate set, wherein the text feature candidate set comprises: characters, words, sentences, articles, distribution rules and readability formulas.

As an embodiment of the present invention, a total of 93 features were identified and are listed in table 3.

TABLE 3 text feature candidate set

The text feature calculation methods are described in detail in the following references: royal bud (junior japanese korean student text readability formula initial research [ D ]. beijing university of language master academic degree paper, 2005), bright prospect (foreign chinese text readability formula research [ D ]. shanghai university of transportation master academic degree paper, 2009), yangjinmen (advanced chinese precision textbook language difficulty determination research [ D ]. beijing university of study academic degree paper, 2008), trealin (neural network chinese reading difficulty quantitative calculation model and result comparison [ J ]. Chongqing university college academic newspaper (natural science edition), 2000), xi (readability research of chinese textbook: estimation of fitness to reading age value [ J ]. educational research information, 1995, 3 (113-, 2006) yang (A reliability for Chinese language [ D ]. The University of Wisconsin,1970), Wu Yuan (Chinese text readability character system construction and validity verification [ J ]. world Chinese teaching No. 34, No. 1, 2020), Liuhai Tao (measured linguistics guide [ M ]. Business Press, 2017, pages 134-138), SMOG (Grading-A new readability for language [ J ]. Journal of Reading,1969,12(8):629 646), Flesch-Kincaid Forma (differentiation of new readability for language [ J ]. additive Basic Eduition, 1975,49), Flesh (A new readability of testing [ J ]. additive Basic idea, 1979, J ]. J.9).

Adding a distribution law feature into the text feature candidate set is one of the key technical points of the invention. The distribution law characteristics are the characteristics of texts obtained by further calculating the distribution function of the measurement indexes of characters, words and sentences based on the texts, and are detailed in Liu Hao (measurement linguistics guide theory [ M ]. Business impression library, 2017, pages 134-138).

And S2, calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model.

The training text is a text with a plurality of readability classification level labels. Example data for use with the present invention includes 12 versions of all the current generation of primary school grade 1-6 language textbooks, the language material details of which are shown in table 4. The feature values of the 93 features listed in table 3 were calculated for each chapter text.

TABLE 4 corpus Scale statistics

Readability level	Grade of the corresponding year	Text number	Total number of words	Total number of sentences
					1	Grade 1	340	5,9957	3990
2	Grade 2	523	17,3758	1,0299
					3	Grade 3	516	28,6135	1,4926
4	Grade 4	450	31,4060	1,5007
					5	Grade 5	468	43,1557	1,9908
6	6 years oldStage	401	43,0626	1,9327
					Total of	——	2698	1696093	83457

And S3, performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values.

The readable formula design specifically includes the steps of:

firstly, carrying out Pearson correlation analysis on the features of a first preset group in the text feature candidate set, screening out the features of which the Pearson correlation coefficient is below a preset value alpha, and taking the features as the features for establishing a multiple linear regression model;

and secondly, carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with readability levels.

Specifically, pearson correlation analysis is performed on the features 1 to 80, feature screening is performed when the upper limit value α of the phase relation number is 0.7, and the readability formula obtained by fitting is expressed as:

The machine learning model training specifically comprises the following steps:

Specifically, for features # 1-93, through pearson correlation analysis, 22 features (listed in table 5) with larger association with classification level and smaller association with each other are obtained.

Table 5 22 characteristics screened

And S4, performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model. The arbitrary discourse text refers to the arbitrary discourse text to be classified and predicted according to the readability classification level of the training discourse text.

The effect test of the technical scheme of the invention is carried out by using the primary Chinese teaching material text.

In the readability formula method, 15 features are selected after a text feature candidate set is screened, readability indexes are calculated according to a fitting formula, results are mapped to 6 readability levels to conduct discourse six classification prediction, the result accuracy is 0.46, and the readability formula is better than the readability formula with the best effect, namely a Guo-Tanzi formula (accuracy is 0.36).

The 22 features with the best prediction effect are obtained by using a machine learning method and through feature calculation and selection, the logistic regression model with the features as the input has the best effect, the accuracy is 0.52, and the logistic regression model is better than the existing work of the grandma (the accuracy is 0.46).

Correspondingly, an embodiment of the present invention further provides a system for evaluating readability of a chinese text by fusing text distribution law characteristics, as shown in fig. 2, the system includes:

the characteristic value calculation module is used for calculating the characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;

Further, the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text

Further, the design and training module is specifically configured to:

Further, when α is 0.7 and the training text is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by fitting is represented as:

Further, the design and training module is further specifically configured to:

performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively large in association with classification levels and relatively small in association with each other;

When the training text is a text marked by six readability classification levels of primary school grade 1-6 Chinese textbooks comprising a plurality of versions, the 22 screened features are listed in table 5, and the logistic regression model has the best effect.

The system of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A Chinese text readability evaluation method fusing text distribution law characteristics is characterized by comprising the following steps:

s4, performing readability prediction on any discourse text by using a designed readability formula or a trained machine learning model;

the readable formula design specifically includes the steps of:

performing regression analysis on the screened features to obtain a regression model with the highest goodness of fit with readability grade;

when α is 0.7 and the training text in step S2 is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by the fitting is expressed as:

Y＝15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.380*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre；

2. The method for evaluating readability of chinese text according to claim 1, wherein the training text at step S2 is a text labeled with a plurality of readability classification levels, and the arbitrary text at step S4 is an arbitrary text to be classified and predicted according to the readability classification levels of the training text at step S2.

3. The method for evaluating the readability of chinese text according to claim 1, wherein said training of machine learning model specifically comprises the steps of:

4. A Chinese text readability evaluation system fusing text distribution law characteristics is characterized by comprising the following steps:

the prediction module is used for predicting the readability of any discourse text by utilizing a designed readability formula or a trained machine learning model;

the design and training module is specifically configured to:

performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, screening out the features of which the Pearson correlation coefficient is below a preset value alpha, and using the features as the features for establishing a multiple linear regression model;

when alpha is 0.7 and the training text is a text labeled with six readability classification levels of primary school 1-6 grade Chinese textbooks comprising a plurality of versions, the readability formula obtained by fitting is represented as:

where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kini coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substlanpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.

5. The system for evaluating readability of chinese text according to claim 4, wherein the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text.

6. The system for assessing the readability of chinese text according to claim 4, wherein said design and training module is further configured to: