CN109933668A

CN109933668A - The classified estimation modeling method of simplified Chinese language text readability

Info

Publication number: CN109933668A
Application number: CN201910206775.7A
Authority: CN
Inventors: 李虹; 李苗苗; 李燕
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-06-25
Anticipated expiration: 2039-03-19
Also published as: CN109933668B

Abstract

The invention belongs to Chinese language data processing fields, and in particular to the classified estimation modeling method of simplified Chinese language text readability.The classified estimation modeling method of simplified Chinese language text readability of the invention is the following steps are included: creation standard corpus library；Extract text feature；It constructs readable formula and formula effect is assessed.The present invention chooses the text feature of three Chinese character, vocabulary and sentence levels on the basis of existing Chinese readability formula, Chinese language text readability formula constructing the suitable simplified Chinese native of primary school period, having grade grade classification.

Description

The classified estimation modeling method of simplified Chinese language text readability

Technical field

The invention belongs to Chinese language data processing fields, and in particular to the classified estimation of simplified Chinese language text readability is built Mould method.

Background technique

In advanced information society, how children's book grow exponentially picks out conjunction in vast as the open sea books The good book of suitable child becomes the problem of puzzlement teacher and parent.According to latest developments domain tyeory, the difficulty of children's reading material is answered The current development level of a little higher than children, but cannot be excessively high, it can be only achieved training and improve the purpose of children's reading ability. If selected reading material is excessively difficult, the efficiency sense of children's reading can be damaged, it is escaped and reads；And too simple material can then allow Children feel barren, lose reading interest, and culture reading habit is not achieved and improves the purpose of reading ability.Current existing figure Book staging hierarchy is dominated by publisher mostly, both based on not solid theoretical research, is also lacked positive research and is verified it Validity, scientific deficiency, public credibility is not high, influence power is little, and the directive significance read to teenager is limited.In order to realize The matching of virgin reading ability and books difficulty researches and develops objective, efficient Chinese language while accurate evaluation children's reading ability This readability formula carries out accurate evaluation to text difficulty, is one of difficult point and hot issue of current classification Reading studies.

Readable formula refers to the method using mathematical expression, extracts certain texts that are quantifiable, influencing reading difficulty Eigen, and determine the functional relation between these features and text difficulty.Currently, having more than ten of readability in English system Readable formula, A-Z staging, Oxford reading tree series of Britain etc. are thought in formula, such as the blue of the U.S..These formula it is accurate Degree is high, has a wide range of application, and establishes huge classification based on this and reads system, is promoting the ability culture of English children's reading Huge effect has been played with habit formation etc..

Since Chinese and English are there is greatest differences, the readable formula in English-Speaking World can not directly apply to the Chinese Chinese language sheet, and the Chinese readability formula that can find mathematical formulae at present only has 7, be primarily directed to complex form of Chinese characters learner or Teaching Chinese as a foreign language, and most of formula does not provide specific grading standard, to the reading matter of Continental Area pupil Select directive significance limited.Therefore, the text readability formula for being directed to the simplified Chinese native of primary school is created, is still a tool Challenging leading edge operation.

Summary of the invention

The purpose of the present invention is to provide a kind of classified estimation modeling methods of simplified Chinese language text readability.

The classified estimation modeling method of the simplified Chinese language text readability of specific embodiment according to the present invention comprising with Lower step:

The suitable text of selection establishes standard corpus library, and text is carried out grade mark；

Text feature is extracted,

Defined word, word, sentence level text difficulty feature, word cutting and words are carried out to the text in standard corpus library respectively Sentence mark processing etc., calculates the difficulty characteristic value of every text, then selects the optimal characteristics collection of text difficulty feature；

Text readability classified estimation formula is constructed,

Text in standard corpus library is divided into training text collection and test text collection,

Integrate the grade being marked using training text as dependent variable Y, is integrated using optimal characteristics as independent variable (X₁, X₂, X₃), it uses Linear regression model (LRM) obtains readable classified estimation formula are as follows:

Y_i=β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i, wherein Y_iIndicate the readable grade (1-12) of text, X_1i, X_2iAnd X_3i Respectively indicate the numerical value of three optimal characteristics collection of this text, β₀For constant, intercept, β are represented₁, β₂And β₃It is partial regression system Number represents the variable X in the case where its dependent variable remains unchanged₁, X₂Or X₃Y value variable quantity after changing a unit；

Integrated using test text as reference, the readable formula is assessed.

The classified estimation modeling method of the simplified Chinese language text readability of specific embodiment according to the present invention is extracting text In eigen step, word cutting is carried out to text using NLPIR Chinese word segmentation system and part-of-speech tagging is handled.

The classified estimation modeling method of the simplified Chinese language text readability of specific embodiment according to the present invention, by following Step selects optimal characteristics collection:

It is related to text grade of difficulty to calculate separately all text difficulty features, according to the absolute value of related coefficient from big To small by text difficulty feature ordering；

According to sequence, sequentially text difficulty characteristic value is selected to enter alternative features collection, establish regression equation；

The text difficulty feature that alternative features concentration is stayed in by synteny judgement selection, obtains optimal characteristics collection.

The classified estimation modeling method of the simplified Chinese language text readability of specific embodiment according to the present invention, by conllinear Property judgement selection stay in alternative features concentration text difficulty feature method are as follows:

If the text difficulty feature X concentrated for alternative features₁、X₂、……X_k, there is the number λ for being not all 0₁、λ₂…… λ_k, so that λ₁X₁+λ₂X₂+……λ_k X_k+μ_i=0, then there is synteny in alternative features concentration, need to find out at this time in the presence of altogether Two text difficulty features of linear problem compare two text difficulty features and add in the case where keeping other feature invariants △ R after entering², concentrated in alternative features and retain △ R²Biggish feature；If alternative features, which are concentrated, is not present synteny problem, Calculate the △ R after feature is added²If △ R²> 2%, then it is concentrated in alternative features and retains the feature, otherwise leave out the feature；

Above-mentioned steps are recycled, until all text difficulty features that traversal alternative features are concentrated.

The classified estimation modeling method of the simplified Chinese language text readability of specific embodiment according to the present invention, simplified Chinese The construction method of text readability classified estimation formula is as follows:

Integrate the grade being marked using training text as dependent variable Y, is integrated using optimal characteristics as independent variable (X₁, X₂, X₃), if Y with X₁, X₂, X₃Variation and change, and there are linear relationships: Y_i=β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i(i=1,2,3 ..., n), Assuming thatIt is parameter beta respectively₀, β₁, β₂, β₃Least-squares estimation, then the regressand value of Y may be expressed as:

Observation Y_iWith regressand valueResidual error e_iFor

According to least square method,It should make whole observation Y_kWith regressand valueSum of square of deviations Reach minimum, i.e., so that QMinimum value is obtained,

According to the extremum principle of the function of many variables, Q is right respectivelySingle order local derviation is sought, and it is enabled to be equal to zero, i.e.,Its matrix form is

Because

IfFor estimated value vector, regression modelBoth sides are the same as the transposition for multiplying sample observing matrix X Matrix X ', then haveObtain equation group

Since there is no multicollinearity, X ' X is 4 rank square matrixes, so X ' X full rank, the inverse matrix (X ' X) of X ' X^-1In the presence of, ThusThe OLS estimator of as β,

It acquiresSpecific embodiment is simplified according to the present invention The classified estimation modeling method of Chinese language text readability, is integrated using test text as reference, assesses simplified Chinese by following steps Text readability classified estimation formula:

Calculate the observation Y calculated according to readable formula_ObservationWith the actual value Y of test text collection_{It is practical}Between related r；

Readable formula is calculated to the variation accounts amount R of test text collection data², R²=r²；

Calculating closes on accuracy rate, close on accuracy rate=| Y_Observation-Y_{It is practical}|, if closing on accuracy rate no more than 1, it is being considered as assessment just Really；The ratio that correct text sum accounts for test text collection sum of assessing is calculated, accuracy rate is as closed on；

Calculate root-mean-square error:

As 0 < r < 1, r close to 1, and

0<R²< 1, R²Close to 1, and

Accuracy rate≤1 is closed on, closes on accuracy rate closer to 1, and

Root-mean-square error is smaller, then judges that readable classified estimation formula is more accurate.

Beneficial effects of the present invention:

The present invention is based on Chinese features, and three Chinese character, vocabulary and syntax levels can be carried out to Chinese language text by providing one kind Difficulty signature analysis and automation classified estimation modeling method, ensure that text difficulty evaluation objectivity；

The present invention is based on Principles of Statistics to have carried out characteristic optimization on the basis of 44 text features of analysis comprehensively, letter Change model, avoided Problems of Multiple Synteny, while guaranteeing forecasting accuracy, improves the comprehensibility of model；

Construction of the present invention Chinese readability formula and textual hierarchy system, can combine with Chinese reading capability comparison, Final establish there is the ladder of Chinese characteristic to read system is simultaneously promoted, and realize the effective of students ' reading ability and books difficulty Matching, science push the development of all youngsters and children reading abilities.

Detailed description of the invention

Fig. 1 shows grading evaluation method flow chart of the invention；

Fig. 2 shows that optimal characteristics collection selects flow chart.

Specific embodiment

Embodiment 1

As shown in Figure 1, the classified estimation modeling method of simplified Chinese language text readability of the invention the following steps are included:

1. establishing gold standard corpus, that is, define dependent variable

1.1 selection appropriate texts

The selection in standard corpus library needs to be bonded the use purpose of readable formula, and it is small that present invention is generally directed to Continental Areas The reading material of children is learned, therefore the text selected is taught from the primary school Chinese of Continental Area, four versions being widely used Material mainly includes that People's Education Publishing House, publishing house, Beijing Normal University, Jiangsu education publishing house and Southwestern Normal University publish Society, each publishing house is each a set of (12), amounts to 48, each volume has specific class information (volumes), can be used as text This grade.

1.2 screening texts

Since archaic Chinese and Modern Chinese have larger difference in syntax, words meaning, modern poetic does not have punctuation mark, It is difficult to count the text feature of sentence surface, therefore the texts such as ancient poetry, ancient Chinese prose, Modern Poetry is eliminated by manual inspection.Finally Gold standard corpus shares 1478 texts, amounts to 801550 words, and specifying information is shown in Table 1.

1 standard corpus library of table

1.3 text grades mark

According to appearance volumes (each year fraction upper and lower term, six grades total 12 copy) of the text in teaching material, to every One text carries out 1~12 grade mark.

2. extracting text feature, that is, define independent variable

2.1 define text feature

The present invention defines word, word, text difficulty feature total 44 of three levels of sentence, specific text feature title altogether And definition is shown in Table 2:

2 text feature of table summarizes

2.2 Text Pretreatment

Using NLPIR Chinese word segmentation system (being originated from NLPIR.org (natural language processing and information retrieval shared platform)) Word cutting and part-of-speech tagging processing are carried out to text, system word cutting mark accuracy reaches 98.45%.

2.3 text features calculate

2.3.1 the quantity of the number of words in statistics article, word number, word kind, word kind and punctuation mark；

2.3.2 word, word and Chinese-character stroke number table, words grade of difficulty table etc. are compared, obtain the phase of each words Close information；

2.3.3 the part of speech distribution situation of vocabulary is counted；

2.3.4 according to the operational definition of 44 features in table 2 and 2.3.1 to 2.3.3's as a result, obtaining every text This corresponding 44 characteristic value.

2.4 selection optimal characteristics collection

2.4.1 44 feature (X are calculated separately₁, X₂, X₃... ... X₄₄) and text grade of difficulty (Y) related coefficient (r), Specially

Wherein, j=1,2,3 ... ..., 44；N=1478；σ_Xj, σ_YIndicate X_j, the standard deviation of Y；X_jiIndicate that i-th text exists Score on jth item text feature；Y_iIndicate the text grade of difficulty of i-th text；Indicate all texts in jth item text Score average in feature；Indicate the Y value average of all texts.

2.4.2 according to the absolute value of related coefficient (r), 44 features are ranked up from big to small, in sequence successively It selects a feature to enter alternative features collection, establishes regression equation Y_i=β₀+β₁X_1i+β₂X_2i+……+β_kX_ki+μ_i；

Wherein, Y_iIndicate the grade of difficulty of i-th text, X_1i, X_2i... ..., X_kiThe k item for respectively indicating this text is standby Select feature set score, β₀For constant, intercept, β are represented₁, β₂..., β_kIt is partial regression coefficient, representative is remained unchanged in its dependent variable In the case where, variable X₁, X₂... ..., X_kY value variable quantity after changing a unit.

2.4.3 carrying out conllinear sex determination

If the feature X concentrated for alternative features at this time₁, X₂... ... X_k, there is the constant λ for being not all 0₁, λ₂……λ_k, μ, so that λ₁X₁+λ₂X₂+……λ_k X_kThere is synteny in+μ=0, i.e. judgement alternative features concentration.Conversely, if this formula Without solution, that is, can not find be not all 0 constant λ₁, λ₂……λ_k, μ sets up the equation, then synteny problem is just not present.

When alternative features concentration has synteny, the k feature X that alternative features are concentrated is calculated₁, X₂... ... X_kTwo Related coefficient (the same 2.4.1 of calculation method) between two, if the related coefficient between certain two feature is greater than 0.75, that is, can determine that is There is synteny in the two features.

Assuming that feature X_k-1And X_kThere are problems that synteny, then initially sets up the regression equation mould for being added without this two features Type M₀: Y_i=β₀+β₁X_1i+……+β_k-2X_k-2i+μ_i(the same 2.4.2 of meaning of parameters), and the multiple of computation model is determined

Wherein,Refer to each text Y value being calculated according to the regression model, Y_iIt is practical Y value,Refer to that Y value is flat Mean value；

Later, in model M₀Feature base on be separately added into feature X_k-1And X_k, establish model M₁: Y_i=β₀+β₁X_1i +……+β_k-2X_k-2i+β_k-1X_k-1i+μ_i(the same 2.4.2 of meaning of parameters) and M₂: Y_i=β₀+β₁X_1i+……+β_k-2X_k-2i+β_kX_ki+μ_i (the same 2.4.2 of meaning of parameters) is similarly obtained model M 1 and the multiple coefficient of determination R of M2_M1 ²And R_M1 ².Finally, it calculates compared to mould Type M₀For, model M₁And model M₂The increased R of institute²Variable quantity: △ R_M1 ²=R_M1 ²-R_M0 ²；△R_M2 ²=R_M2 ²-R_M0 ², retain △ R²All features enter alternative features collection in biggish model.

If synteny problem is not present in alternative features collection, the △ R after this feature is added is calculated²If △ R²> 2%, then exist Alternative features, which are concentrated, retains this feature, otherwise leaves out this feature.

2.4.4 each step of 2.4.2~2.4.3 is recycled, until traversing all features, flow chart is referring to fig. 2.

2.4.5 optimal characteristics collection is finally obtained, finally altogether includes three Xiang Tezheng in the present invention: word kind, character learning literary name kind Average difficulty and function word ratio.

3. constructing readable formula and assessing formula effect

3.1 determine training and test text collection

Text in each language teaching material is randomly divided into training text collection and test text collection, guarantee each version, In each volume, the amount of text ratio that training text collection and test text integrate is 1:1.

3.2 establish readable formula

Be demarcated as dependent variable Y with the grade of training text collection, in above-mentioned 2.4 step determine optimal characteristics collection (word kind, Character learning literary name kind average difficulty and function word ratio) it is independent variable (X₁, X₂, X₃), using linear regression model (LRM), construct readable public Formula, specific as follows:

If Y is with X₁, X₂, X₃Variation and change, and there are linear relationships, are formulated as follows:

Y_i=β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i,

Wherein, Y_iIndicate the readable grade of text, X_1i, X_2i, X_3iRespectively indicate word kind, the character learning literary name of this text The score value of kind average difficulty and function word ratio, β₀For constant, intercept, β are represented₁, β₂, β₃It is partial regression coefficient, represents in other changes In the case that amount remains unchanged, variable X₁, X₂Or X₃Y value variable quantity after changing a unit.

Assuming thatIt is parameter beta respectively₀, β₁, β₂, β₃Least-squares estimation, then the regressand value of Y can table It is shown as:

Observation Y_iWith regressand valueResidual error e_iFor

According to least square method,It should make whole observation Y_kWith regressand valueSum of square of deviations Reach minimum, i.e., so that QObtain minimum value.

According to the extremum principle of the function of many variables, Q is right respectivelySingle order local derviation is sought, and it is enabled to be equal to zero, i.e.,Its matrix form is after arranging abbreviation

Because

IfFor estimated value vector, regression modelBoth sides are the same as the transposition for multiplying sample observing matrix X Matrix X ', then haveObtain normal equation group

Since there is no multicollinearity, X ' X is 4 rank square matrixes, so X ' X full rank, the inverse matrix (X ' X) of X ' X^-1In the presence of, ThusThe OLS estimator of as β.

Finally acquire

Finally obtained readability formula are as follows:

It is classified number=- 4.84+0.01^*Word kind+3.34^*Character learning literary name kind average difficulty+7.83^*Function word ratio.

3.3 readable formula assessments

Integrated using test text as reference, above-mentioned readable formula assessed, specific steps are as follows:

3.3.1 it calculates r value: calculating the observation (Y calculated according to readable formula_Observation) and test text collection actual value (Y_{It is practical}) between related coefficient (the same 2.4.1 of calculation formula, specially

Wherein, n=1478；σ_{Y observation}, σ_{Y is practical}Respectively indicate Y_ObservationAnd Y_{It is practical}Standard deviation；Y_{Observe i}Indicate i-th text according to can The text grade of difficulty that the property read formula calculates；Y_{Practical i}Indicate the actual text grade of difficulty of i-th text；Indicate all texts The average of this grade of difficulty observation；Indicate the average of all text grade of difficulty actual values.R value value range is Between 0 to 1, closer to 1, readable formula effect is better.

3.3.2 R is calculated²: R²It is the important indicator for measuring regression result, indicates readable formula to test text collection difficulty The variation accounts amount of value, R²=r²。

R²Value range is between 0 to 1, and closer to 1, readable formula effect is better.

3.3.3 it calculates and closes on accuracy rate: closing on and accurately refer to the case where observation is differed to a rank with actual value It is correct to be considered as prediction.For example, observation is that 2 or 3 or 4 labels are to close on accuracy rate i.e. if text actual value is 3 | Y_Observation-Y_{It is practical}| ratio shared by≤1 text, value range is between 0 to 1, and closer to 1, readable formula effect is better.

3.3.4 root-mean-square error: root-mean-square error refers to the square root deviation size between observation and actual value, specifically Calculation formula are as follows:

Its value is the smaller the better.

The indices of readable formula constructed by the present invention are as shown in table 3:

The readable formula indices of table 3

As can be seen from the results, the Chinese readability formula of institute's construction of the present invention, can be used for predicting primary school period Chinese language text Difficulty carries out the difficulty calibration of 1~12 grade.

Claims

1. the classified estimation modeling method of simplified Chinese language text readability, which is characterized in that the classified estimation modeling method packet Include following steps:

Extract text feature；

Defined word, word, sentence level text difficulty feature, word cutting and words sentence mark are carried out to the text in standard corpus library respectively Note processing, calculates the difficulty characteristic value of every text, then selects the optimal characteristics collection of text difficulty feature；

Text readability classified estimation formula is constructed,

Integrate the grade being marked using training text as dependent variable Y, is integrated using optimal characteristics as independent variable (X₁, X₂, X₃), using linear Regression model obtains readable classified estimation formula are as follows:

Y_i=β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i,

Wherein, β₀For constant, intercept, β are represented₁, β₂And β₃It is partial regression coefficient, represents the case where its dependent variable remains unchanged Under, variable X₁, X₂Or X₃Y value variable quantity after changing a unit,

Integrated using test text as reference, the readable classified estimation formula is assessed.

2. the classified estimation modeling method of simplified Chinese language text readability according to claim 1, which is characterized in that mentioning It takes in text feature step, word cutting is carried out to text using NLPIR Chinese word segmentation system and part-of-speech tagging is handled.

3. the classified estimation modeling method of simplified Chinese language text readability according to claim 1, which is characterized in that pass through Following steps select optimal characteristics collection:

The related coefficient for calculating separately the text difficulty feature and text grade of difficulty, according to related coefficient absolute value by text Difficulty feature ordering；

According to sequence, sequentially difficulty feature is selected to enter alternative features collection, establish regression equation；

4. the classified estimation modeling method of simplified Chinese language text readability according to claim 1, which is characterized in that pass through The method that synteny judgement selection stays in the text difficulty feature of alternative features concentration are as follows:

The text difficulty feature X concentrated such as alternative features₁、X₂、……X_k, there is the number λ for being not all 0₁、λ₂……λ_k, so that λ₁X₁ +λ₂X₂+……λ_kX_k+μ_i=0, then there is synteny in alternative features concentration, needs to find out at this time and there are problems that synteny Two text difficulty features in the case where keeping other feature invariants, compare the △ after two text difficulty features are separately added into R², concentrated in alternative features and retain △ R²Biggish feature；If alternative features, which are concentrated, is not present synteny problem, feature is calculated △ R after addition²If △ R²> 2%, then it is concentrated in alternative features and retains the text difficulty feature, otherwise leave out the text Difficulty feature；

5. the classified estimation modeling method of simplified Chinese language text readability according to claim 1, which is characterized in that readable The construction method of property classified estimation formula is as follows:

Integrate the grade being marked using training text as dependent variable Y, is integrated using optimal characteristics as independent variable (X₁, X₂, X₃), if Y is with X₁, X₂, X₃Variation and change, and there are linear relationships: Y_i=β₀+β₁X_1i+β₂X_2i+β₃X_3i+μ_i(i=1,2,3 ..., n), it is assumed thatIt is parameter beta respectively₀, β₁, β₂, β₃Least-squares estimation, then the regressand value of Y may be expressed as:

Observation Y_iWith regressand valueResidual error e_iFor

According to least square method,It should make whole observation Y_kWith regressand valueSum of square of deviations reach Minimum, i.e., so thatMinimum value is obtained,

Because

IfFor estimated value vector, regression modelBoth sides are the same as the transposed matrix for multiplying sample observing matrix X X ' then hasObtain equation group

It acquires

6. the classified estimation modeling method of simplified Chinese language text readability according to claim 1, which is characterized in that survey Examination text set is reference, assesses simplified Chinese language text readability classified estimation formula by following steps:

Calculate the observation Y calculated according to readable formula_ObservationWith the actual value Y of test text collection_{It is practical}Between correlation coefficient r；

Calculating closes on accuracy rate, close on accuracy rate=| Y_Observation-Y_{It is practical}|, if closing on accuracy rate no more than 1, it is correct to be considered as assessment； The ratio that correct text sum accounts for test text collection sum of assessing is calculated, accuracy rate is as closed on；

Calculate root-mean-square error:

As 0 < r < 1, r close to 1, and

0<R²< 1, R²Close to 1, and

Accuracy rate≤1 is closed on, closes on accuracy rate closer to 1, and