CN109933668B

CN109933668B - Hierarchical evaluation modeling method for readability of simplified Chinese text

Info

Publication number: CN109933668B
Application number: CN201910206775.7A
Authority: CN
Inventors: 李虹; 李苗苗; 李燕
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2021-03-26
Anticipated expiration: 2039-03-19
Also published as: CN109933668A

Abstract

The invention belongs to the field of Chinese language data processing, and particularly relates to a hierarchical evaluation modeling method for readability of simplified Chinese texts. The grading evaluation modeling method for the readability of the simplified Chinese text comprises the following steps: creating a standard corpus; extracting text features; and (4) constructing a readability formula and evaluating the effect of the formula. The invention selects the text characteristics of three layers of Chinese characters, vocabularies and sentences on the basis of the traditional Chinese readability formula, and constructs a Chinese text readability formula which is suitable for simplified Chinese native language at primary school and has grade classification.

Description

Hierarchical evaluation modeling method for readability of simplified Chinese text

Technical Field

The invention belongs to the field of Chinese language data processing, and particularly relates to a hierarchical evaluation modeling method for readability of simplified Chinese texts.

Background

In the modern information society, the books for children grow exponentially, and the problem that how to select good books suitable for children from books in the great amount as in the tobacco sea is troubling teachers and parents is solved. According to the recent development area theory, the difficulty of reading materials for children is slightly higher than the current development level of children, but not too high, so as to achieve the purposes of training and improving the reading ability of children. If the selected reading material is too difficult, the reading efficiency of the children is damaged, so that the children can escape reading; and too simple materials can make children feel uninteresting and lose reading interest, and the purposes of cultivating reading habits and improving reading ability cannot be achieved. At present, most of the existing book grading systems are dominated by publishers, solid theoretical research is not taken as a foundation, the effectiveness of the book grading systems is also verified by empirical research, the book grading systems are not scientific enough, the public confidence is not high, the influence is not large, and the book grading systems have limited guiding significance for teenagers to read. In order to realize the matching of the reading ability of children and the difficulty of books, an objective and efficient Chinese text readability formula is researched and developed while the reading ability of children is accurately evaluated, and the text difficulty is accurately evaluated, so that the method is one of the difficulties and hot problems of the existing grading reading research.

The readability formula refers to extracting some quantifiable text features which affect reading difficulty by adopting a mathematical expression method, and determining a functional relation between the features and the text difficulty. Currently, there are dozens of readability formulas in the English system, such as the U.S. blues readability formula, the A-Z classification method, the Oxford reading Tree series in the United kingdom, and the like. The formulas have high accuracy and wide application range, and a huge grading reading system is established on the basis of the formulas, so that the formulas play a great role in promoting the reading ability cultivation and habit formation of English children and the like.

Because the Chinese language and the English language have great difference, the readability formula in the English world cannot be directly applied to Chinese text, but the Chinese readability formula of the prior searchable mathematical formula only has 7 items, mainly aims at traditional Chinese learners or Chinese teaching, most formulas do not provide clear grade division standards, and the reading selection guidance significance for pupils in continental region is limited. Therefore, creating a text readability formula for the primary school simplified Chinese native language remains a challenging frontier task.

Disclosure of Invention

The invention aims to provide a simplified Chinese text readability grading evaluation modeling method.

The method for modeling simplified Chinese text readability by hierarchical evaluation according to the specific embodiment of the invention comprises the following steps:

selecting a proper text to establish a standard corpus and carrying out grade marking on the text;

the characteristics of the text are extracted,

defining text difficulty characteristics of word, word and sentence levels, respectively carrying out word cutting, word and sentence labeling and the like on texts in a standard corpus, calculating difficulty characteristic values of each text, and then selecting an optimal characteristic set of the text difficulty characteristics;

a text readability grading evaluation formula is constructed,

the text in the standard corpus is divided into a training text set and a test text set,

the marked grade of the training text set is used as a dependent variable Y, and the optimal feature set is used as an independent variable (X)₁，X₂，X₃) Adopting a linear regression model to obtain a readability grading evaluation formula as follows:

Y_i＝β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_iwherein Y is_iRepresenting the readability level (1-12), X, of the text_1i，X_2iAnd X_3iValues, β, representing the three best feature sets of this text, respectively₀Is constant, represents the intercept, beta₁，β₂And beta₃Is a partial regression coefficient, representing the variable X with the other variables remaining unchanged₁，X₂Or X₃The amount of change in the Y value by one unit;

and evaluating the readability formula by taking the test text set as a reference.

According to the grading evaluation modeling method for the readability of the simplified Chinese text, in the step of extracting the text characteristics, an NLPIR Chinese word segmentation system is adopted to perform word segmentation and part-of-speech tagging on the text.

According to the grading evaluation modeling method for readability of simplified Chinese texts, which is disclosed by the embodiment of the invention, the optimal feature set is selected through the following steps:

respectively calculating the correlation between all the text difficulty characteristics and the text difficulty grades, and sequencing the text difficulty characteristics from large to small according to the absolute value of the correlation coefficient;

according to the sorting, sequentially selecting text difficulty characteristic values to enter an alternative characteristic set, and establishing a regression equation;

and selecting the text difficulty features left in the alternative feature set through co-linear judgment to obtain an optimal feature set.

According to the grading evaluation modeling method for readability of simplified Chinese texts, the method for selecting the text difficulty characteristics left in the alternative characteristic set through collinearity judgment comprises the following steps:

if the text difficulty characteristic X in the alternative characteristic set is used₁、X₂、……X_kThere is a number λ of not all 0₁、λ₂……λ_kSo that λ₁X₁+λ₂X₂+……λ_k X_k+μ_iIf the candidate feature set is 0, the collinearity problem exists in the candidate feature set, at this time, two text difficulty features with the collinearity problem need to be found out, and under the condition that other features are not changed, Δ R after the two text difficulty features are added is compared²Retention of Δ R in the alternative feature set²Larger features; if the candidate feature set does not have the collinearity problem, calculating the Delta R after the feature is added²If Δ R²>2%, reserving the features in the alternative feature set, and otherwise, deleting the features;

and circulating the steps until all the text difficulty features in the alternative feature set are traversed.

According to the hierarchical evaluation modeling method for readability of simplified Chinese texts, the construction method of the hierarchical evaluation formula for readability of simplified Chinese texts comprises the following steps:

the marked grade of the training text set is used as a dependent variable Y, and the optimal feature set is used as an independent variable (X)₁，X₂，X₃) Let Y follow X₁，X₂，X₃Changes, and exists in a linear relationship: y is_i＝β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i(i ═ 1,2,3, …, n), suppose

Respectively is a parameter beta₀，β₁，β₂，β₃The regression value of Y can be expressed as:

observed value Y_iAnd the regression value

Residual error e of_iIs composed of

According to the method of least squares,

should be such that all observations Y_kAnd the regression value

The sum of squared deviations of (a) and (b) is minimized, i.e. Q is obtained

The minimum value is obtained, and the minimum value,

according to the extreme value principle of the multivariate function, Q is respectively paired

First order partial derivatives are calculated and made equal to zero, i.e.

In the form of a matrix of

Because of the fact that

Is provided with

For the estimated value vector, sample regression model

The transposed matrix X' of the sample observation matrix X is multiplied by the two sides, then

Get the equation system

Since there is no multicollinearity, X 'X is a 4 th order square matrix, so X' X full rank, the inverse of X 'X (X' X)^-1Exist, thus

I.e. the OLS estimator for beta,

to obtain

According to the grading evaluation modeling method for readability of the simplified Chinese text, which is provided by the specific embodiment of the invention, a test text set is taken as a reference, and a grading evaluation formula for readability of the simplified Chinese text is evaluated through the following steps:

calculating an observed value Y calculated from a readability formula_{Observation of}And the actual value Y of the test text set_{Practice of}R between the two;

calculating the variation interpretation quantity R of the readability formula to the test text set data²，R²＝r²；

Calculating the approach accuracy rate, wherein the approach accuracy rate is equal to Y_{Observation of}-Y_{Practice of}If the adjacent accuracy is not more than 1, the evaluation is determined to be correct; calculating the proportion of the total number of the correctly evaluated texts in the total number of the test text sets, namely the near accuracy;

calculating the root mean square error:

when 0< r <1, r is close to 1, and

0<R²<1，R²is close to 1, and

the closer the accuracy rate is 1, the closer the accuracy rate is to 1, and

the smaller the root mean square error is, the more accurate the readability grade evaluation formula is judged.

The invention has the beneficial effects that:

based on the characteristics of Chinese, the invention provides a hierarchical assessment modeling method which can carry out difficulty characteristic analysis and automation on three levels of Chinese characters, vocabularies and syntax on Chinese texts, and ensures the objectivity of text difficulty assessment;

based on the statistical principle, the feature optimization is carried out on the basis of comprehensively analyzing 44 text features, the model is simplified, the problem of multiple collinearity is avoided, and the intelligibility of the model is improved while the prediction accuracy is ensured;

the invention constructs a Chinese readability formula and a text grading system, can be combined with Chinese reading capability evaluation, finally establishes a ladder reading system with Chinese characteristics and promotes the ladder reading system, realizes the effective matching of the reading capability of students and the difficulty of books, and scientifically promotes the development of the reading capability of all teenagers and children.

Drawings

FIG. 1 shows a flow chart of a hierarchical assessment method of the present invention;

FIG. 2 shows a flow chart of optimal feature set selection.

Detailed Description

Example 1

As shown in fig. 1, the modeling method for hierarchical evaluation of readability of simplified chinese text of the present invention comprises the following steps:

1. establishing golden standard corpus, i.e. defining dependent variables

1.1 selecting appropriate text

The invention mainly aims at reading materials of primary school children in continental areas, so that the selected text is from four versions of primary school Chinese textbooks widely used in the continental areas, and mainly comprises a set of people education publishers, Beijing university publishers, Jiangsu education publishers and southwest university publishers, wherein each publisher is provided with a set (12 books), 48 books are counted, and each book has clear grade information (book number) which can be used as the grade of the text.

1.2 screening text

Because ancient Chinese and modern Chinese have great difference in syntax, word meaning, modern poetry does not have punctuation marks, it is difficult to make statistics of the text characteristics at the sentence level, so the texts of ancient poetry, ancient Chinese, modern poetry, etc. have been deleted through manual inspection. The final gold standard corpus has 1478 texts, which totals 801550 characters, and the specific information is shown in table 1.

TABLE 1 Standard corpus

1.3 text rating labels

And marking each text at a grade of 1-12 according to the number of appearing books of the text in the teaching material (each grade is divided into an upper school period and a lower school period, and the six grades are 12 books in total).

2. Extracting text features, i.e. defining arguments

2.1 defining text features

The invention defines 44 text difficulty characteristics of three layers of characters, words and sentences, and the specific text characteristic names and definitions are shown in table 2:

table 2 text feature summary

2.2 text preprocessing

The method adopts an NLPIR Chinese word segmentation system (originated from NLPIR. org (natural language processing and information retrieval shared platform)) to perform word segmentation and part-of-speech tagging on the text, and the word segmentation and tagging accuracy of the system reaches 98.45%.

2.3 text feature computation

2.3.1 counting the number of words, word numbers, word types and the number of punctuation marks in the article;

2.3.2 comparing the characters and words with a Chinese character stroke number table, a word difficulty level table and the like to obtain the relevant information of each word and word;

2.3.3, counting the part of speech distribution of the vocabulary;

2.3.4 the operative definition of 44 features in table 2, and the results of 2.3.1 to 2.3.3, the corresponding 44 feature values for each text were obtained.

2.4 selecting an optimal feature set

2.4.1 calculate 44 features (X) respectively₁，X₂，X₃，……X₄₄) A correlation coefficient (r) with the text difficulty level (Y), in particular

Wherein j is 1,2,3, … …, 44; n is 1478; sigma_Xj，σ_YRepresents X_jStandard deviation of Y; x_jiRepresenting the fraction of the ith text on the characteristics of the jth text; y is_iA text difficulty rating representing the ith text;

representing the average of scores of all texts on the j text feature;

representing the average of the Y values of all text.

2.4.2 according to the absolute value of the correlation coefficient (r), sorting 44 characteristics from large to small, and sequentially selecting one characteristic according to the sequenceInputting the candidate characteristic set and establishing a regression equation Y_i＝β₀+β₁X_1i+β₂X_2i+……+β_kX_ki+μ_i；

Wherein, Y_iIndicating the difficulty rating, X, of the ith text_1i，X_2i，……，X_kiK candidate feature set scores, beta, representing the text, respectively₀Is constant, represents the intercept, beta₁，β₂……，β_kIs a partial regression coefficient, representing the variable X with the other variables remaining unchanged₁，X₂，……，X_kThe amount of change in the Y value by one unit.

2.4.3 making collinearity decisions

If for feature X in the candidate feature set at this time₁，X₂，……X_kThere is a constant lambda of not all 0₁，λ₂……λ_kμ, such that λ₁X₁+λ₂X₂+……λ_k X_kAnd the + mu is 0, namely, the co-linearity problem exists in the judgment candidate feature set. On the other hand, if the expression is not solved, the constant λ of not all 0 can not be found₁，λ₂……λ_kMu makes the equation true, then there is no collinearity problem.

When the collinearity problem exists in the alternative feature set, k features X in the alternative feature set are calculated₁，X₂，……X_kIf the correlation coefficient between two characteristics is larger than 0.75, the collinearity problem of the two characteristics can be determined.

Hypothesis feature X_k-1And X_kIf the collinearity problem exists, firstly establishing a regression equation model M without adding the two characteristics₀：Y_i＝β₀+β₁X_1i+……+β_k-2X_k-2i+μ_i(the meaning of the parameters is the same as 2.4.2) and calculating multiple blocks of the model

Wherein the content of the first and second substances,

the value of each text Y is calculated according to the regression model_iIs the actual value of Y and is,

means the average value of Y values;

then, in the model M₀Respectively adding the characteristics X on the basis of the characteristics of_k-1And X_kEstablishing a model M₁：Y_i＝β₀+β₁X_1i+……+β_k-2X_k-2i+β_k-1X_k-1i+μ_i(the meaning of the parameters is 2.4.2) and M₂：Y_i＝β₀+β₁X_1i+……+β_k-2X_k-2i+β_kX_ki+μ_i(the meaning of the parameters is the same as 2.4.2), the multiple determination coefficients R of the models M1 and M2 are also obtained_M1 ²And R_M1 ². Finally, the calculation is compared to model M₀In other words, model M₁And model M₂Increased R of²Variation amount: delta R_M1 ²＝R_M1 ²-R_M0 ²；△R_M2 ²＝R_M2 ²-R_M0 ²Retention of Δ R²All features in the larger model go into the set of candidate features.

If the candidate feature set does not have the co-linearity problem, calculating the Delta R after the feature is added²If Δ R²>2%, the feature is retained in the alternative feature set, otherwise the feature is deleted.

And 2.4.4 circulating the steps 2.4.2-2.4.3 until all the characteristics are traversed, and referring to the figure 2 in the flow chart.

2.4.5 finally obtaining an optimal feature set, wherein the optimal feature set finally comprises three features: the average difficulty of character types and the ratio of the virtual words in the character type and the literacy table.

3. Establishing readability formula and evaluating formula effect

3.1 determining training and test text sets

Randomly dividing the texts in each book of the Chinese teaching material into a training text set and a test text set, and ensuring that the number ratio of the texts in the training text set to the texts in the test text set in each version and each book is 1: 1.

3.2 establishing readability formulas

Marking the grade of the training text set as a dependent variable Y, and taking the optimal characteristic set (the character type, the average difficulty of the character types of the character learning table and the ratio of the null words) determined in the step 2.4 as an independent variable (X)₁，X₂，X₃) Adopting a linear regression model to construct a readability formula, which is as follows:

let Y follow X₁，X₂，X₃And in a linear relationship, formulated as follows:

Y_i＝β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i，

wherein, Y_iRepresenting the readability level of the text, X_1i，X_2i，X_3iThe values of the average difficulty of the character type and the character type of the literacy table of the text, the virtual word proportion, beta₀Is constant, represents the intercept, beta₁，β₂，β₃Is a partial regression coefficient, representing the variable X with the other variables remaining unchanged₁，X₂Or X₃The amount of change in the Y value by one unit.

Suppose that

observed value Y_iAnd the regression value

Residual error e of_iIs composed of

According to the method of least squares,

should be such that all observations Y_kAnd the regression value

The sum of squared deviations of (a) and (b) is minimized, i.e. Q is obtained

The minimum value is obtained.

First order partial derivatives are calculated and made equal to zero, i.e.

After the arrangement and simplification, the matrix form is

Because of the fact that

Is provided with

For the estimated value vector, sample regression model

Get normal system of equations

I.e. an OLS estimator for beta.

Finally, find out

The resulting readability formula is:

grade number-4.84 +0.01^*Type +3.34^*Average difficulty of character type of character learning table +7.83^*Ratio of the imaginary words.

3.3 readability formula evaluation

And evaluating the readability formula by taking the test text set as a reference, wherein the method specifically comprises the following steps:

3.3.1 calculate r value: calculating an observed value (Y) calculated from a readability formula_{Observation of}) And the actual value (Y) of the test text set_{Practice of}) The correlation coefficient between (the calculation formula is as same as 2.4.1, concretely is

Wherein n is 1478; sigma_{Y observation}，σ_{Y actual}Respectively represent Y_{Observation of}And Y_{Practice of}Standard deviation of (d); y is_{Observation i}Representing the difficulty level of the text calculated by the readability formula of the ith text; y is_{Reality i}Representing the actual text difficulty level of the ith text;

representing an average of all text difficulty rating observations;

representing the average of the actual values of all text difficulty ratings. The value of r ranges from 0 to 1, and the closer to 1, the better the readability formula is.

3.3.2 calculation of R²：R²Is an important index for measuring the regression result and represents the variation interpretation quantity of the readability formula on the difficulty value of the test text set, R²＝r²。

R²The value range is between 0 and 1, and the closer to 1, the better the readability formula is.

3.3.3 calculate proximity accuracy: the near-accurate means that the observed value and the actual value are different by one level and the prediction is correct. For example, if the actual value of the text is 3, then the observed value is 2,3 or 4, and the adjacent accuracy is | Y_{Observation of}-Y_{Practice of}|<The text accounts for 1, the value range is between 0 and 1, and the closer to 1, the better the readability formula is.

3.3.4 root mean square error: the root mean square error is the square root deviation between an observed value and an actual value, and the specific calculation formula is as follows:

the smaller the value, the better.

The indexes of the readability formula constructed by the invention are shown in table 3:

TABLE 3 readability formula indices

According to the result, the Chinese readability formula constructed by the method can be used for predicting the difficulty of Chinese texts in the primary school stage and carrying out 1-12-grade difficulty calibration.

Claims

1. The hierarchical evaluation modeling method for the readability of the simplified Chinese text is characterized by comprising the following steps of:

extracting text features;

defining text difficulty characteristics of word, word and sentence levels, respectively carrying out word cutting and word, word and sentence marking processing on texts in a standard corpus, calculating difficulty characteristic values of each text, and then selecting an optimal characteristic set of the text difficulty characteristics;

a text readability grading evaluation formula is constructed,

Y_i＝β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i，

wherein, beta₀Is constant, represents the intercept, beta₁，β₂And beta₃Is a partial regression coefficient, representing the variable X with the other variables remaining unchanged₁，X₂Or X₃The amount of change in the Y value by one unit,

evaluating the readability grading evaluation formula by taking the test text set as a reference,

wherein the content of the first and second substances,

selecting an optimal feature set by:

respectively calculating correlation coefficients of the text difficulty features and the text difficulty grades, and sequencing the text difficulty features according to absolute values of the correlation coefficients;

according to the sorting, sequentially selecting the difficulty features to enter an alternative feature set, and establishing a regression equation;

selecting the text difficulty characteristics left in the candidate characteristic set through collinearity judgment to obtain an optimal characteristic set,

wherein the content of the first and second substances,

the method for selecting the text difficulty characteristics left in the alternative characteristic set through collinearity judgment comprises the following steps:

text difficulty feature X as in alternative feature set₁、X₂、……X_kThere is a number λ of not all 0₁、λ₂……λ_kSo that λ₁X₁+λ₂X₂+……λ_kX_k+μ_iIf 0, the candidate features are concentrated to have a collinearity problem, at this time, two text difficulty features having the collinearity problem need to be found out, and Δ R after the two text difficulty features are respectively added is compared under the condition that other features are kept unchanged²Retention of Δ R in the alternative feature set²Larger features; if the candidate feature set does not have the collinearity problem, calculating the Delta R after the feature is added²If Δ R²>2%, the text difficulty feature is reserved in the alternative feature set, otherwise, the text difficulty feature is deleted;

2. The modeling method for hierarchical assessment of readability of simplified chinese text according to claim 1, wherein in the step of extracting the text features, the text is processed by word segmentation and part-of-speech tagging using NLPIR chinese segmentation system.

3. The modeling method for hierarchical evaluation of readability of simplified chinese text according to claim 1, wherein the readability hierarchical evaluation formula is constructed as follows:

observed value Y_iAnd the regression value

Residual error e of_iIs composed of

According to the method of least squares,

should be such that all observations Y_kAnd the regression value

Is minimized, i.e. such that

The minimum value is obtained, and the minimum value,

First order partial derivatives are calculated and made equal to zero, i.e.

In the form of a matrix of

Because of the fact that

Is provided with

For the estimated value vector, sample regression model

Get the equation system

I.e. the OLS estimator for beta,

to obtain

4. The modeling method for hierarchical assessment of readability of simplified chinese text according to claim 1, wherein the simplified chinese text readability hierarchical assessment formula is assessed with reference to the test text set by the following steps:

calculating an observed value Y calculated from a readability formula_{Observation of}And the actual value Y of the test text set_{Practice of}The correlation coefficient r between;

calculating the root mean square error:

when 0< r <1, r is close to 1, and

0<R²<1，R²is close to 1, and

the closer the accuracy rate is 1, the closer the accuracy rate is to 1, and