CN113934850B - Chinese text readability evaluation method and system fusing text distribution law characteristics - Google Patents

Chinese text readability evaluation method and system fusing text distribution law characteristics Download PDF

Info

Publication number
CN113934850B
CN113934850B CN202111289536.6A CN202111289536A CN113934850B CN 113934850 B CN113934850 B CN 113934850B CN 202111289536 A CN202111289536 A CN 202111289536A CN 113934850 B CN113934850 B CN 113934850B
Authority
CN
China
Prior art keywords
text
readability
features
training
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111289536.6A
Other languages
Chinese (zh)
Other versions
CN113934850A (en
Inventor
赵慧周
郭雯钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN202111289536.6A priority Critical patent/CN113934850B/en
Publication of CN113934850A publication Critical patent/CN113934850A/en
Application granted granted Critical
Publication of CN113934850B publication Critical patent/CN113934850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese text readability evaluation method and system fusing text distribution law characteristics, wherein the method comprises the following steps: determining a candidate set of text features, the candidate set of text features comprising: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas; calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model; performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values; and performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model. The invention combines the text distribution law characteristics with the characters, words, sentences, articles and readability formula characteristics, and performs readability formula design and machine learning model training after characteristic selection, so that the readability prediction accuracy of the text of the discourse is obviously improved.

Description

Chinese text readability evaluation method and system fusing text distribution law characteristics
Technical Field
The invention relates to the technical field of Chinese information processing, in particular to a method and a system for evaluating the readability of a Chinese text by fusing text distribution law characteristics.
Background
Reading is an important way to obtain information, and training reading ability is an important aspect of language learning. For a language learner, a hierarchical reading should be performed, namely: reading the text material with proper difficulty matched with the comprehension ability of the learner. This is useful for maintaining reading interest and developing reading ability while developing reading habits.
The readability of a text is a method for quantitatively evaluating the reading difficulty of a text material, which is considered as readability, also called readability or intelligibility, in Lishaoshan (review of readability research [ J ]. release military college of foreign languages, 2000), and refers to the degree or nature of the text which is easy to read and understand, and is an important attribute of the text. The method for evaluating the readability of the research text has important significance for applications such as hierarchical reading, textbook compiling and the like.
The Chinese text readability prediction method includes a readability formula-based prediction method, a language model-based prediction method, a traditional machine learning model-based prediction method, a deep neural network-based prediction method and the like. Both readability formula methods and traditional machine learning models need to rely on text features.
The text readability formula is considered as a way of predicting the readability level of the text, which can be objectively evaluated. Royal (text readability formula initial exploration [ D ] of students in japanese and korean at junior middle school, university of beijing university of languages, university of masters research academic degree paper, 2005) considers that the readability formula is a formula for evaluating text difficulty degree by integrating all factors (particularly text factors) which affect reading difficulty and can be quantified. Hiebert et al (Standards, associations, and text differentiation in A.E. Farstrup & S.J. Samules (Eds.). What research has to about ready Reading instruction [ J ]. Newark, DE: International Reading Association,2002) showed that in the 20 th century, the educational material in the American educational system was evaluated and rated using readable formulas, and the educational material in all fields up to now required matching with the semantics and parameters of certain readable formulas.
In the form, the readability formula prediction method may be expressed as G ═ F (X), where G is readability grade, X is a text feature vector, F is formula content, and is generally a linear function, and after determining the feature vector, a researcher fits a training text to obtain constant terms of the linear function, thereby generating the readability formula. Table 1 lists some of the chinese readability formulas.
TABLE 1 Chinese readability formula summary
Figure BDA0003334139780000021
In table 1, GL is readability score, wd is number of words, sent is number of sentences, ease refers to number of familiar words (typically the number of words with lower difficulty is calculated from a hierarchical vocabulary), stroke refers to number of character-averaged strokes, hard refers to proportion of difficult words (typically the proportion of words with higher difficulty is calculated from a hierarchical vocabulary), syll/send refers to number of sentence-averaged words, wd/send refers to number of sentence-averaged words, syll/wd refers to the average number of words per word, x _2 refers to glyph complexity, x _3 refers to glyph complexity, func refers to number of imaginary words, splitsense refers to number of sentences, ease/wd refers to proportion of number of familiar words (typically the proportion of words with lower difficulty is calculated from a hierarchical vocabulary), and lengh refers to the largest number of words minus the smallest wd in a corpus.
According to the prediction method based on the traditional machine learning, after the characteristics are selected, the text characteristic value is input into a machine learning model, and the existing machine learning model is used for learning, so that the readability grade is predicted.
Wu Siyuan et al (Chinese text readability characteristic system construction and validity verification [ J ]. world Chinese teaching No. 34, 2020, No. 1, 2020) performs sentence-level five-classification prediction on primary school to high-school 12-grade Chinese teaching materials, and the accuracy of a support vector machine model is the highest and reaches 0.638. Jiang et al (GRAW +: Atwo-view map prediction method with word addition for accessibility assessment. "Journal of the Association for Information Science and Technology 70.5 (2019):433 447) used the GRAW + model to perform sentence-level six-classification prediction of Chinese language textbooks in primary schools with an accuracy of 0.54. Courage and the like (text reading difficulty automatic grading research [ J ] based on multiple characteristics, data analysis and knowledge discovery in the 07 th 2019) carry out chapter-grade and dichotomy prediction on the primary and high-school Chinese textbooks by using a multiple characteristic fusion method, and the accuracy of a logistic regression model is the highest and reaches 0.88. Sunggang (Chinese text readability prediction method research [ D ] based on linear regression. graduate paper of university of Nanjing 2015) performs chapter-level six-classification of primary school Chinese text teaching materials by using a logarithmic linear regression model, with the accuracy rate of 0.46. In the study of readability prediction by using the primary school Chinese text teaching material data set, the accuracy of sentence-level text readability prediction is higher than that of chapter-level text, and in the study of chapter-level six-classification readability prediction, the best result in the prior art is the study of Sunggang, and the accuracy is 0.46.
The existing readability formula method and the traditional machine learning method for performing readability evaluation and prediction on Chinese texts based on text characteristics mainly have the following problems:
(1) although the number of the features is large in the aspect of text features, the text features used for Chinese text readability evaluation are all the features of text characters, words, sentences and categories of paragraphs. The study on royal buds (the study on the student text readability formula of junior high-school Japanese and Korean students [ D ]. the study on the student biology position of Beijing language university, 2005), Guo Wang Hao (the study on the foreign Chinese text readability formula [ D ]. the study on the student biology position of Shanghai traffic university, 2009), Yangjin Yu (the study on the language difficulty determination of the high-level Chinese fine reading textbook [ D ]. the study on the student biology position of Beijing university, 2008), Chenalin (the quantitative calculation model for the neural network Chinese reading difficulty and the comparison of the results [ J ]. Chongqing university academy (Nature science edition), 2000), Xijing (the study on the readability of Chinese textbook [ J ]. the estimation of the readability value of the year of interest [ education information of the China, 1995, 3(3):113 and 127), Han Yingyi (the study on the Chinese university of Beijing university [ D ]. the study on the student biology position of the year [ A ] (the text of the third university of the study on the country university of the Chinese textbook [ A ]. the study on the difficulty of the country university of the country) (the university of the Chinese university of the English) was calculated by the third university of the study on the difficulty of the study on the country of the book [ D ]. the study on the country of the country) (English), Han university of the book [ D ]. the book of the country), and the book of China) (English study on the book of China) (English study on the country), and the book of China) (English study on the country), China) (English of China) (English study on the country of China) (English of China) ( The text characteristics used in The existing Chinese text readability evaluation techniques listed in Table 2 can be obtained by summarizing The works of seminar 2006), Yang (area for Chinese language [ D ]. The University of Wisconsin 1970), Wu Wen Mitsu (construction of Chinese text readability characteristic system and validation [ J ]. world Chinese teaching No. 34, No. 2020, No. 1, 2020).
TABLE 2 characteristics for text readability evaluation
Figure BDA0003334139780000041
It is considered here that more categories of features can be mined in addition to the text features of the above categories of characters, words, sentences, and paragraphs. Especially in the aspect of chapter readability evaluation, the text readability is affected by the theme, the structure and the long-distance context relevance of chapters, and the characteristics of metering in the aspects can be used for reference in the research field of metering linguistics. One of the technical key points of the invention is to use the text distribution law characteristics obtained by further calculating the text-based word and word related distribution function for the readability evaluation of text chapters.
(2) In the application field of Chinese text readability evaluation, the fine-grained chapter readability evaluation method for the native language learner is deficient.
In the research related to readability formulas (see table 1), the formulas of Guo, Tao, Wang and Zhouyuan are all proposed based on foreign Chinese teaching materials or reading materials. Jingxi Yi formula is a readability formula for traditional Chinese characters in 1-12 th grade of Taiwan. Only the grandson-Hanyin formula is a readability formula suitable for presenting simplified characters for the learner of the mother language.
For the traditional machine learning method, in the research taking simplified Chinese teaching materials as objects, the work taking chapter readability evaluation as a target only accounts for half, and only the work of the grandchild just classifies the chapter six of the primary school section as a classification target. This is mainly because the traditional machine learning method needs a certain amount of hierarchical texts as training corpora for model training, the Chinese text of middle and primary schools is the most suitable hierarchical text, but the amount of such corpora is small, taking the six-grade Chinese text of primary school as an example, in the 12 versions of the 1-6-grade Chinese text of primary school, the modern text chapters are less than 3000, and if the text of the primary school is classified six, the average number of chapters per classification is less than 500. Thus, most studies have adopted treatment schemes involving two directions:
one is to evaluate sentence readability because the number of sentences is much higher than the number of chapters. Here, it is considered that readability of chapters and readability of sentences is a different evaluation problem, and chapters are objects to be evaluated for Chinese hierarchical reading and textbook writing.
Secondly, the number of classifications is reduced, such as: the 12 grades of the primary and middle schools are divided into 5 categories, or the text of the primary and high schools is classified into two categories. The text readability evaluation is considered to be mainly applied to the text selection evaluation application in the language learning stage, and is more important for the text evaluation of primary school grades 1-6. Moreover, the grading granularity of the text evaluation used by the language learner is as fine as possible, so that the text evaluation more meets the requirement of i +1 of language learning input, and the text evaluation grading method can be applied to a text pushing system for the autonomous reading and learning of the learner.
In summary, the readability evaluation method and the readability evaluation system provided by the invention aim to obtain better effect than similar work in solving the problem of six classification of chapters of primary schools from 1 to 6.
Disclosure of Invention
Aiming at the problems, the invention provides a Chinese text readability evaluation method and system fusing text distribution law characteristics, which are suitable for classifying and predicting sections and texts at multiple readability levels. The example test of six classifications of text read by 1-6 grades of primary schools according to the grade shows that the accuracy of readability evaluation can be improved.
To solve the above technical problem, an embodiment of the present invention provides the following solutions:
on one hand, the method for evaluating the readability of the Chinese text by fusing the text distribution law characteristics comprises the following steps:
s1, determining a text feature candidate set, wherein the text feature candidate set comprises: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;
s2, calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;
s3, performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;
and S4, performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model.
Preferably, the training text in step S2 is a text labeled with a plurality of readability classification levels, and the arbitrary text in step S4 is an arbitrary text to be classified and predicted according to the readability classification levels of the training text in step S2.
Preferably, the readability formula design specifically comprises the following steps:
performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, and screening out the features of which the Pearson correlation coefficient is below a preset value alpha as the features for establishing a multiple linear regression model;
and carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with the readability grade.
Preferably, when α is 0.7 and the training text in step S2 is a text labeled with six readability classification levels including multiple versions of the language texts in the levels of primary school 1-6, the readability formula obtained by the fitting is expressed as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.38 0*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
Preferably, the machine learning model training specifically comprises the steps of:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;
and (5) training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
In one aspect, a system for evaluating readability of a chinese text fused with text distribution law features is provided, including:
a text feature candidate set determining module, configured to determine a text feature candidate set, where the text feature candidate set includes: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;
the characteristic value calculation module is used for calculating the characteristic values of six types of characteristics of characters, words, sentences, paragraphs, distribution laws and readability formulas, wherein the six types of characteristics are used for fitting readability formula parameters and training discourse texts of a machine learning model;
the design and training module is used for carrying out readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;
and the prediction module is used for performing readability prediction on any discourse text by utilizing a designed readability formula or a trained machine learning model.
Preferably, the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text.
Preferably, the design and training module is specifically configured to:
performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, and screening out the features of which the Pearson correlation coefficient is below a preset value alpha as the features for establishing a multiple linear regression model;
and carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with the readability grade.
Preferably, when α is 0.7 and the training text is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by the fitting is expressed as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.38 0*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
Preferably, the design and training module is further specifically configured to:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;
and training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
in the embodiment of the invention, the text distribution law characteristics are combined with the characters, words, sentences and readability formula characteristics, readability formula design and machine learning model training are carried out after characteristic selection, and readability prediction accuracy of text of chapters is obviously improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for evaluating readability of a Chinese text with fusion of text distribution law characteristics according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a chinese text readability evaluation system fusing text distribution law characteristics according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The embodiment of the invention firstly provides a Chinese text readability evaluation method fusing text distribution law characteristics, and as shown in fig. 1, the method comprises the following steps:
s1, determining a text feature candidate set, wherein the text feature candidate set comprises: characters, words, sentences, articles, distribution rules and readability formulas.
As an embodiment of the present invention, a total of 93 features were identified and are listed in table 3.
TABLE 3 text feature candidate set
Figure BDA0003334139780000091
Figure BDA0003334139780000101
The text feature calculation methods are described in detail in the following references: royal bud (junior japanese korean student text readability formula initial research [ D ]. beijing university of language master academic degree paper, 2005), bright prospect (foreign chinese text readability formula research [ D ]. shanghai university of transportation master academic degree paper, 2009), yangjinmen (advanced chinese precision textbook language difficulty determination research [ D ]. beijing university of study academic degree paper, 2008), trealin (neural network chinese reading difficulty quantitative calculation model and result comparison [ J ]. Chongqing university college academic newspaper (natural science edition), 2000), xi (readability research of chinese textbook: estimation of fitness to reading age value [ J ]. educational research information, 1995, 3 (113-, 2006) yang (A reliability for Chinese language [ D ]. The University of Wisconsin,1970), Wu Yuan (Chinese text readability character system construction and validity verification [ J ]. world Chinese teaching No. 34, No. 1, 2020), Liuhai Tao (measured linguistics guide [ M ]. Business Press, 2017, pages 134-138), SMOG (Grading-A new readability for language [ J ]. Journal of Reading,1969,12(8):629 646), Flesch-Kincaid Forma (differentiation of new readability for language [ J ]. additive Basic Eduition, 1975,49), Flesh (A new readability of testing [ J ]. additive Basic idea, 1979, J ]. J.9).
Adding a distribution law feature into the text feature candidate set is one of the key technical points of the invention. The distribution law characteristics are the characteristics of texts obtained by further calculating the distribution function of the measurement indexes of characters, words and sentences based on the texts, and are detailed in Liu Hao (measurement linguistics guide theory [ M ]. Business impression library, 2017, pages 134-138).
And S2, calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model.
The training text is a text with a plurality of readability classification level labels. Example data for use with the present invention includes 12 versions of all the current generation of primary school grade 1-6 language textbooks, the language material details of which are shown in table 4. The feature values of the 93 features listed in table 3 were calculated for each chapter text.
TABLE 4 corpus Scale statistics
Readability level Grade of the corresponding year Text number Total number of words Total number of sentences
1 Grade 1 340 5,9957 3990
2 Grade 2 523 17,3758 1,0299
3 Grade 3 516 28,6135 1,4926
4 Grade 4 450 31,4060 1,5007
5 Grade 5 468 43,1557 1,9908
6 6 years oldStage 401 43,0626 1,9327
Total of —— 2698 1696093 83457
And S3, performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values.
The readable formula design specifically includes the steps of:
firstly, carrying out Pearson correlation analysis on the features of a first preset group in the text feature candidate set, screening out the features of which the Pearson correlation coefficient is below a preset value alpha, and taking the features as the features for establishing a multiple linear regression model;
and secondly, carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with readability levels.
Specifically, pearson correlation analysis is performed on the features 1 to 80, feature screening is performed when the upper limit value α of the phase relation number is 0.7, and the readability formula obtained by fitting is expressed as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.38 0*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
The machine learning model training specifically comprises the following steps:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;
and training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
Specifically, for features # 1-93, through pearson correlation analysis, 22 features (listed in table 5) with larger association with classification level and smaller association with each other are obtained.
Table 5 22 characteristics screened
Figure BDA0003334139780000121
And S4, performing readability prediction on the text of any chapters by using a designed readability formula or a trained machine learning model. The arbitrary discourse text refers to the arbitrary discourse text to be classified and predicted according to the readability classification level of the training discourse text.
The effect test of the technical scheme of the invention is carried out by using the primary Chinese teaching material text.
In the readability formula method, 15 features are selected after a text feature candidate set is screened, readability indexes are calculated according to a fitting formula, results are mapped to 6 readability levels to conduct discourse six classification prediction, the result accuracy is 0.46, and the readability formula is better than the readability formula with the best effect, namely a Guo-Tanzi formula (accuracy is 0.36).
The 22 features with the best prediction effect are obtained by using a machine learning method and through feature calculation and selection, the logistic regression model with the features as the input has the best effect, the accuracy is 0.52, and the logistic regression model is better than the existing work of the grandma (the accuracy is 0.46).
Correspondingly, an embodiment of the present invention further provides a system for evaluating readability of a chinese text by fusing text distribution law characteristics, as shown in fig. 2, the system includes:
a text feature candidate set determining module, configured to determine a text feature candidate set, where the text feature candidate set includes: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;
the characteristic value calculation module is used for calculating the characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;
the design and training module is used for carrying out readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;
and the prediction module is used for performing readability prediction on any discourse text by utilizing a designed readability formula or a trained machine learning model.
Further, the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text
Further, the design and training module is specifically configured to:
performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, and screening out the features of which the Pearson correlation coefficient is below a preset value alpha as the features for establishing a multiple linear regression model;
and carrying out regression analysis on the screened features to obtain a regression model with the highest goodness of fit with the readability grade.
Further, when α is 0.7 and the training text is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by fitting is represented as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.38 0*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
Further, the design and training module is further specifically configured to:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively large in association with classification levels and relatively small in association with each other;
and training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
When the training text is a text marked by six readability classification levels of primary school grade 1-6 Chinese textbooks comprising a plurality of versions, the 22 screened features are listed in table 5, and the logistic regression model has the best effect.
The system of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.
In the embodiment of the invention, the text distribution law characteristics are combined with the characters, words, sentences and readability formula characteristics, readability formula design and machine learning model training are carried out after characteristic selection, and readability prediction accuracy of text of chapters is obviously improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A Chinese text readability evaluation method fusing text distribution law characteristics is characterized by comprising the following steps:
s1, determining a text feature candidate set, wherein the text feature candidate set comprises: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;
s2, calculating characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;
s3, performing readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;
s4, performing readability prediction on any discourse text by using a designed readability formula or a trained machine learning model;
the readable formula design specifically includes the steps of:
performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, and screening out the features of which the Pearson correlation coefficient is below a preset value alpha as the features for establishing a multiple linear regression model;
performing regression analysis on the screened features to obtain a regression model with the highest goodness of fit with readability grade;
when α is 0.7 and the training text in step S2 is a text labeled with six readability classification levels of the primary school grade 1-6 language text including multiple versions, the readability formula obtained by the fitting is expressed as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.380*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kindney coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substrenpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
2. The method for evaluating readability of chinese text according to claim 1, wherein the training text at step S2 is a text labeled with a plurality of readability classification levels, and the arbitrary text at step S4 is an arbitrary text to be classified and predicted according to the readability classification levels of the training text at step S2.
3. The method for evaluating the readability of chinese text according to claim 1, wherein said training of machine learning model specifically comprises the steps of:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;
and training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
4. A Chinese text readability evaluation system fusing text distribution law characteristics is characterized by comprising the following steps:
a text feature candidate set determining module, configured to determine a text feature candidate set, where the text feature candidate set includes: six types of characteristics of characters, words, sentences, articles, distribution laws and readability formulas;
the characteristic value calculation module is used for calculating the characteristic values of the six types of characteristics of the text of the training chapters for fitting the readability formula parameters and training the machine learning model;
the design and training module is used for carrying out readability formula design or machine learning model training based on the features in the text feature candidate set and the calculated feature values;
the prediction module is used for predicting the readability of any discourse text by utilizing a designed readability formula or a trained machine learning model;
the design and training module is specifically configured to:
performing Pearson correlation analysis on the features of a first preset group in the text feature candidate set, screening out the features of which the Pearson correlation coefficient is below a preset value alpha, and using the features as the features for establishing a multiple linear regression model;
performing regression analysis on the screened features to obtain a regression model with the highest goodness of fit with readability grade;
when alpha is 0.7 and the training text is a text labeled with six readability classification levels of primary school 1-6 grade Chinese textbooks comprising a plurality of versions, the readability formula obtained by fitting is represented as:
Y=15.739+0.025*avesen_char+0.04*difficult_char+51.588*difficult_word+6.380*Gini+0.253*strokefre+1.437*lgcharfre-1.914*charwordpro-1.013*TC+6.121*subs tanpro-2.914*adjpro+4.38*funcpro+2.5*unlistwordpro+4.236*wordlenfre+0.688*La mbda+0.644*avelgwordfre;
where avesen _ char is the average sentence length, diffcult _ char is the Chinese character difficulty, diffcult _ word is the vocabulary difficulty, Gini is the kini coefficient, strokerre is the frequency weighted stroke number, lgcharfre is the average logarithmic word frequency, charwordpro is the word ratio, TC is the topic concentration, substlanpro is the real word ratio, adjpro is the adjective ratio, funcpro is the imaginary word ratio, unlistwordpro is the unknown word ratio, wordlenfre is the frequency weighted word length, Lambda is the Lambda value, avelgwordfre is the average logarithmic word frequency.
5. The system for evaluating readability of chinese text according to claim 4, wherein the training text is a text labeled with a plurality of readability classification levels, and the arbitrary text is an arbitrary text to be classified and predicted according to the readability classification levels of the training text.
6. The system for assessing the readability of chinese text according to claim 4, wherein said design and training module is further configured to:
performing Pearson correlation analysis on the features of a second preset group in the text feature candidate set, and screening out features which are relatively high in association with classification levels and relatively low in association with each other;
and (5) training the machine learning model by taking the screened features as input features, and selecting the optimal machine learning model.
CN202111289536.6A 2021-11-02 2021-11-02 Chinese text readability evaluation method and system fusing text distribution law characteristics Active CN113934850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111289536.6A CN113934850B (en) 2021-11-02 2021-11-02 Chinese text readability evaluation method and system fusing text distribution law characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111289536.6A CN113934850B (en) 2021-11-02 2021-11-02 Chinese text readability evaluation method and system fusing text distribution law characteristics

Publications (2)

Publication Number Publication Date
CN113934850A CN113934850A (en) 2022-01-14
CN113934850B true CN113934850B (en) 2022-06-17

Family

ID=79285457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111289536.6A Active CN113934850B (en) 2021-11-02 2021-11-02 Chinese text readability evaluation method and system fusing text distribution law characteristics

Country Status (1)

Country Link
CN (1) CN113934850B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147013B (en) * 2022-08-31 2023-07-18 南京复保科技有限公司 Insurance product readability calculating method, apparatus, computer device and storage medium
CN115859962B (en) * 2022-12-26 2023-06-16 北京师范大学 Text readability evaluation method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009032240A (en) * 2007-06-27 2009-02-12 Nagaoka Univ Of Technology Text legibility evaluation system and text legibility evaluation method
CN107506346A (en) * 2017-07-10 2017-12-22 北京享阅教育科技有限公司 A kind of Chinese reading grade of difficulty method and system based on machine learning
CN109933668A (en) * 2019-03-19 2019-06-25 北京师范大学 The classified estimation modeling method of simplified Chinese language text readability
CN113569556A (en) * 2021-07-28 2021-10-29 怀化学院 Rous model-based classification method for text difficulty in reading test of children

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9858695B2 (en) * 2014-05-30 2018-01-02 Focus Reading Technology Inc System and methods for improving the readability of content
US20200066174A1 (en) * 2018-08-22 2020-02-27 Harald John Torgesen, III Read Write Communicate Education Tool
CN112651356B (en) * 2020-12-30 2024-01-23 杭州菲助科技有限公司 Video difficulty grading model acquisition method and video difficulty grading method
CN113343690B (en) * 2021-06-22 2024-03-12 北京语言大学 Text readability automatic evaluation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009032240A (en) * 2007-06-27 2009-02-12 Nagaoka Univ Of Technology Text legibility evaluation system and text legibility evaluation method
CN107506346A (en) * 2017-07-10 2017-12-22 北京享阅教育科技有限公司 A kind of Chinese reading grade of difficulty method and system based on machine learning
CN109933668A (en) * 2019-03-19 2019-06-25 北京师范大学 The classified estimation modeling method of simplified Chinese language text readability
CN113569556A (en) * 2021-07-28 2021-10-29 怀化学院 Rous model-based classification method for text difficulty in reading test of children

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《阿Q正传》译入译出文本的风格计量学对比;许明 等;《外语研究》;20200715;第37卷(第03期);86-92、103 *
Attention-based Deep Learning Model for Text Readability Evaluation;Y. Sun 等;《2020 International Joint Conference on Neural Networks (IJCNN)》;20200928;1-8 *
阅读初探:基于小学教材的汉语性公式研究;刘苗苗 等;《语言文字应用》;20210515(第02期);116-126 *

Also Published As

Publication number Publication date
CN113934850A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
Uakarn et al. Sample size estimation using Yamane and Cochran and Krejcie and Morgan and Green formulas and Cohen statistical power analysis by G* power and comparisons
CN113934850B (en) Chinese text readability evaluation method and system fusing text distribution law characteristics
Connor et al. Tertium comparationis: A vital component in contrastive research methodology
CN108280065B (en) Foreign text evaluation method and device
KR102484007B1 (en) Method and system for estimating a reading index using automatic analysis program for text of korean language
Schnur et al. Lexical complexity, writing proficiency and task effects in Spanish Dual Language Immersion
Coniam Concordancing oneself: Constructing individual textual profiles
Cui et al. CTAP for Chinese: a linguistic complexity feature automatic calculation platform
CN112115701B (en) News reading text readability evaluation method and system
Imperial et al. Application of Lexical Features Towards Improvement of Filipino Readability Identification of Children's Literature
Yamamoto et al. Proposal of japanese vocabulary difficulty level dictionaries for automated essay scoring support system using rubric
CN115859962B (en) Text readability evaluation method and system
Wang et al. Readability Assessment of Textbooks in Low Resource Languages.
CN114969564A (en) Grading reading evaluation and recommendation method and system for books outside class of primary school
Choemue et al. Lexical richness in scientific journal articles: A comparison between ESL and EFL writers
Qiu et al. Research on Translation Style in Machine Learning Based on Linguistic Quantitative Characteristics Perception.
Hong et al. Linguistic Feature Analysis of CEFR Labeling Reliability and Validity in Language Textbooks.
Rori et al. Assessing Readability Of Reading Text'Bright', An English Course For Junior High School Students
Wen et al. Natural Language Processing for Corpus Linguistics by Jonathan Dunn. Cambridge: Cambridge University Press, 2022. ISBN 9781009070447 (PB), ISBN 9781009070447 (OC), vi+ 88 pages.
Xu et al. Using Coh-Metrix to Analyze Chinese ESL Learners’ Writing
Usoniene et al. Corpus Academicum Lithuanicum: design criteria, methodology, application
Jeaco DIY needs analysis and specific text types: Using The Prime Machine to explore vocabulary in readymade and homemade English corpora
Wang Research of natural language processing based on dynamic search corpus in cultural translation and emotional analysis
Chall Readability: Conceptions and Misconceptions.
Meyer et al. Automatic classification of didactic functions of e-learning resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant