JP5322047B2

JP5322047B2 - Text readability evaluation system

Info

Publication number: JP5322047B2
Application number: JP2008141689A
Authority: JP
Inventors: 秀子柴崎; 信一郎原
Original assignee: Nagaoka University of Technology
Current assignee: Nagaoka University of Technology
Priority date: 2007-06-27
Filing date: 2008-05-29
Publication date: 2013-10-23
Anticipated expiration: 2028-05-29
Also published as: JP2009032240A

Description

本発明は、文章の読み易さ評価システムに関するものである。 The present invention relates to the ease of evaluation system reading of the sentence.

日本語文章のリーダビリティー（Readability：読み易さ・読み難さ）を測定する方法としては、例えばマイクロソフト社のワードに搭載されている「読みやすさの評価」という機能がある。この機能は、文章を（１）文字数、（２）単語数、（３）文の数、（４）段落数、（５）１段落中の平均文数、（６）平均文長、（７）句点の間の平均文字数、（８）漢字、ひらがな、カタカナ、アルファベットなど４種類の文字種の割合の８項目に分析する。しかし、当該文章がどの程度読みやすいものか、または読みにくいものかの評価は提示されていない。 As a method for measuring the readability (readability: readability) of a Japanese sentence, for example, there is a function called “evaluation of readability” installed in a Microsoft word. This function is used to (1) the number of characters, (2) the number of words, (3) the number of sentences, (4) the number of paragraphs, (5) the average number of sentences in one paragraph, (6) the average sentence length, (7 8) Analyze into 8 items, the average number of characters between punctuation points, (8) Kanji, Hiragana, Katakana, Alphabet, etc. However, there is no evaluation of how easy or difficult the text is to read.

また、特許文献１には、文章中の各単語の難易度から文章の読み易さを評価する方法が開示されているが、各単語の難易度のみによっては正確なリーダビリティー測定はできない。 Further, Patent Document 1 discloses a method for evaluating the readability of a sentence from the difficulty level of each word in the sentence, but accurate readability cannot be measured only by the difficulty level of each word.

特開２００４−３３４６９９号公報JP 2004-334699 A

本発明は、上述のような現状に鑑みなされたもので、心理学、国語教育、日本語教育の文章理解研究のテキストを決定する際の基準ができるだけでなく、文章作成の指標ができ、日常における様々な読み物（政府刊行物，危険物や薬品の注意書き，ビジネス文書等々）を明確で平易にする作業に応用できるなど、文章理解に関する学問分野に貢献するばかりでなく、一般の人々の日常生活にも有益な材料を提供可能で、しかも、日本人だけでなく、日本語学習が十分でない外国人にとっても易しい日本語での情報の共有化につながり、社会的にも国際的にも貢献度が高い極めて実用性に秀れた文章の読み易さ評価システムを提供するものである。 The present invention has been made in view of the current situation as described above, and can be used not only as a standard for determining texts for text comprehension research in psychology, Japanese language education, and Japanese language education, but also as an index for text creation. In addition to contributing to the academic field related to sentence comprehension, it can be applied to work to clarify and simplify various reading materials (government publications, warnings about dangerous goods and medicines, business documents, etc.) It is possible to provide useful materials for daily life, and it is easy for not only Japanese people but also foreigners who do not have enough Japanese language learning to share information in Japanese, contributing socially and internationally. degree is intended to provide the ease of evaluation system reading of an extremely high for practical use were Shigeru been sentence.

本発明の要旨を説明する。 The gist of the present invention will be described.

対象文章を記憶する文章記憶手段と、この文章記憶手段に記憶された文章の１文当たりの平均文字数Ｘ_１１を算出する平均文字数算出手段と、前記文章全体の平仮名の割合Ｘ_１２を算出する平仮名割合算出手段と、前記文章の１文当たりの平均述語数Ｘ_１３を算出する平均述語数算出手段と、前記文章の１文当たりの平均文節数Ｘ_１４を算出する平均文節数算出手段と、前記平均文字数算出手段により算出された１文当たりの平均文字数Ｘ_１１、前記平仮名割合算出手段により算出された文章中の平仮名の割合Ｘ_１２、前記平均述語数算出手段により算出された１文当たりの平均述語数Ｘ_１３及び前記平均文節数算出手段により算出された１文当たりの平均文節数Ｘ_１４を夫々説明変数として、次式（４）
Ｙ＝ａ_１１Ｘ_１１＋ｂ_１１Ｘ_１２＋ｃ_１１Ｘ_１３＋ｄ_１１Ｘ_１４＋Ｘ_００（４）
ただし、上記式（４）において、ａ_１１，ｂ_１１，ｃ_１１，ｄ_１１は係数、Ｘ_００
は定数
に代入して文章の読み易さを評価する評価値Ｙを従属変数として導出する評価値導出手段とを備え、この評価値導出手段は前記評価値Ｙが小学１年〜中学３年に夫々対応するレベル１〜レベル９のいずれに最も近いかを判定するように構成したものであり、前記式（４）中の各係数ａ_１１，ｂ_１１，ｃ_１１，ｄ_１１及び定数Ｘ_００は、前記小学１年〜中学３年の各学年に対応する国語教科書若しくは前記各学年に対応する図書から抽出した多数の文章サンプルを解析して得た各文章サンプルの１文当たりの平均文字数Ｘ_１１、文章全体の平仮名の割合Ｘ_１２、文章の１文当たりの平均述語数Ｘ_１３及び文章の１文当たりの平均文節数Ｘ_１４を説明変数とし各文章サンプルに予め設定されている前記各学年に対応する評価値Ｙを従属変数として重回帰分析を行うことで得られたものであることを特徴とする文章の読み易さ評価システムに係るものである。
Hiragana for calculating the sentence storage means for storing a target sentence, and the average number-of-characters calculation means for calculating the average number of characters X ₁₁ per sentence sentences this sentence stored in the storage means, the hiragana percentage X ₁₂ of the entire sentence A ratio calculating means; an average predicate number calculating means for calculating an average number of predicates per sentence X ₁₃ of the sentence; an average phrase number calculating means for calculating an average number of phrases X ₁₄ per sentence of the sentence; The average number of characters per sentence X ₁₁ calculated by the average number of characters calculation means, the ratio X ₁₂ of hiragana in the sentence calculated by the hiragana ratio calculation means, the average per sentence calculated by the average predicate number calculation means Using the predicate number X ₁₃ and the average phrase number X ₁₄ per sentence calculated by the average phrase number calculation means as explanatory variables, the following equation (4)
_{_{_{_{Y = a 11 X 11 + b}}}} 11 X 12 + c 11 X 13 + d 11 X 14 + X 00 (4)
However, in the above formula _{_{_{(4), a 11, b}}} 11, c 11, d 11 _{coefficient, X 00}
Comprises an evaluation value deriving means for deriving an evaluation value Y for substituting a constant to evaluate the readability of the sentence as a dependent variable, and the evaluation value deriving means has an evaluation value Y in the first grade to the third grade. Each of the coefficients a ₁₁ , b ₁₁ , c ₁₁ , d ₁₁ and the constant X ₀₀ in the equation (4) is determined so as to determine which of the corresponding level 1 to level 9 is closest. The average number of characters per sentence of each sentence sample obtained by analyzing a large number of sentence samples extracted from Japanese textbooks corresponding to each grade of the first grade to the third grade or books corresponding to the grades X ₁₁ In each of the above grades, which are set in advance in each sentence sample, the ratio X ₁₂ of hiragana in the whole sentence, the average number of predicates per sentence X ₁₃ and the average number of phrases X ₁₄ per sentence in the sentence are explanatory variables. corresponding evaluation value Y Those of the sentence ease evaluation system readings, characterized in that is obtained by performing a multiple regression analysis as the dependent variable.

本発明は、上述のように構成したことから、心理学、国語教育、日本語教育の文章理解研究のテキストを決定する際の基準ができるだけでなく、文章作成の指標ができ、日常における様々な読み物（政府刊行物，危険物や薬品の注意書き，ビジネス文書等々）を明確で平易にする作業に応用できるなど、文章理解に関する学問分野に貢献するばかりでなく、一般の人々の日常生活にも有益な材料を提供可能で、しかも、日本人だけでなく、日本語学習が十分でない外国人にとっても易しい日本語での情報の共有化につながり、社会的にも国際的にも貢献度が高い極めて実用性に秀れた文章の読み易さ評価システムとなる。 Since the present invention is configured as described above, it can be used not only as a standard for determining texts for text comprehension research in psychology, Japanese language education, and Japanese language education, but also as an index for text creation. It can be applied to work that makes reading material (government publications, precautions for dangerous materials and medicines, business documents, etc.) clear and easy. It is possible to provide useful materials, and it leads to the sharing of information in Japanese, which is easy for not only Japanese but also foreigners who do not have sufficient Japanese language learning, making a high contribution both socially and internationally. It is extremely easy evaluation system to practical reading of Xiu the sentence.

好適と考える本発明の実施形態（発明をどのように実施するか）を、本発明の作用を示して簡単に説明する。 The preferred embodiment of the present invention (how to carry out the invention) will be briefly described, showing the operation of the present invention.

文章記憶手段に記憶された対象文章の、１文当たりの平均文字数Ｘ_１、１文当たりの平均単語数Ｘ_２、漢語の割合Ｘ_３、１文当たりの平均係り受け数Ｘ_４を計算し、後記式（１）に当てはめることで所定の評価値Ｙが得られ、この評価値Ｙにより複数の対象文章の読み易さを客観的に比較できることになる。特に、従来はできなかった日本語文章の読み易さ評価を客観的に行えることになる。 Calculate the average number of characters X ₁ per sentence, the average number of words X ₂ per sentence, the ratio of Chinese words X ₃ , the average number of dependency X ₄ per sentence of the target sentence stored in the sentence storage means, A predetermined evaluation value Y is obtained by applying the following formula (1), and the readability of a plurality of target sentences can be objectively compared by this evaluation value Y. In particular, it will be possible to objectively evaluate the readability of Japanese sentences, which was not possible before.

具体的には、例えば、多数の文章サンプルを解析して得た各文章サンプルの１文当たりの平均文字数Ｘ_１、１文当たりの平均単語数Ｘ_２、漢語の割合Ｘ_３及び平均係り受け数Ｘ_４を説明変数とし各文章サンプルに予め設定されている評価値Ｙを従属変数として重回帰分析を行うことで得られた係数ａ_１，ｂ_１，ｃ_１，ｄ_１及び定数Ｘ_０を用い（例えば、評価値Ｙを学年とした国語教科書や図書等をサンプルとする。）、更に、評価値Ｙが１から９までを取り、夫々が小学１年〜中学３年に対応するように設定することで、対象文章がどの程度のレベルなのか直感的に判断できることになり、対象文章の読み易さを客観的に且つ分かり易く表示することが可能となる。 Specifically, for example, the average number of characters X ₁ per sentence of each sentence sample obtained by analyzing a large number of sentence samples, the average number of words X ₂ per sentence, the proportion of Chinese words X ₃ and the average number of dependency Using coefficients a ₁ , b ₁ , c ₁ , d ₁ and a constant X ₀ obtained by performing multiple regression analysis with X ₄ as an explanatory variable and an evaluation value Y preset for each sentence sample as a dependent variable (For example, Japanese textbooks and books with the evaluation value Y as the school year are used as samples.) Furthermore, the evaluation value Y takes a value from 1 to 9, and each is set to correspond to the first grade to the third grade. By doing so, it is possible to intuitively determine the level of the target sentence, and it becomes possible to display the readability of the target sentence objectively and easily.

また、１文当たりの平均文字数Ｘ_１、１文当たりの平均単語数Ｘ_２、漢語の割合Ｘ_３、１文当たりの平均係り受け数Ｘ_４は、既存のソフトウエアによって容易に取得することが可能であり、また、上記重回帰分析も既存のソフトウエアを用いて簡単に行うことができ、従って、これらの容易に得られる情報を組み合わせて簡単に且つ正確な文章の読み易さ評価が実現可能となる。 The average number of characters per sentence X ₁ , the average number of words per sentence X ₂ , the percentage of Chinese words X ₃ , and the average dependency per sentence X ₄ can be easily obtained by existing software. It is possible, and the above multiple regression analysis can be easily performed using existing software. Therefore, easy and accurate evaluation of text readability is realized by combining these easily obtained information. It becomes possible.

更に、後述する実施例２のように、文章記憶手段に記憶された対象文章の、１文当たりの平均文字数Ｘ_１１、文章全体の平仮名の割合Ｘ_１２、１文当たりの平均述語数Ｘ_１３及び１文当たりの平均文節数Ｘ_１４を計算し、式（４）に当てはめることで所定の評価値Ｙを得る場合には、一層正確にこの評価値Ｙにより複数の対象文章の読み易さを客観的に比較可能となる。 Further, as in Example 2 to be described later, the average number of characters X ₁₁ per sentence of the target sentence stored in the sentence storage means, the ratio X ₁₂ of hiragana for the whole sentence, the average number of predicates X ₁₃ per sentence, and When the average number of phrases X ₁₄ per sentence is calculated and given evaluation value Y is obtained by applying it to Equation (4), the readability of a plurality of target sentences can be objectively evaluated more accurately by this evaluation value Y. Can be compared.

本発明の具体的な実施例について図面に基づいて説明する。 Specific embodiments of the present invention will be described with reference to the drawings.

実施例１は、対象文章を記憶する文章記憶手段と、この文章記憶手段に記憶された文章の１文当たりの平均文字数Ｘ_１を算出する平均文字数算出手段と、前記文章の１文当たりの平均単語数Ｘ_２を算出する平均単語数算出手段と、前記文章全体の漢語の割合Ｘ_３を算出する漢語割合算出手段と、前記文章の１文当たりの平均係り受け数Ｘ_４を算出する平均係り受け数算出手段と、前記平均文字数算出手段により算出された１文当たりの平均文字数Ｘ_１、前記平均単語数算出手段により算出された１文当たりの平均単語数Ｘ_２、前記漢語割合算出手段により算出された漢語の割合Ｘ_３及び前記平均係り受け数算出手段により算出された１文当たりの平均係り受け数Ｘ_４を夫々説明変数として、次式（１）
Ｙ＝ａ_１Ｘ_１＋ｂ_１Ｘ_２＋ｃ_１Ｘ_３＋ｄ_１Ｘ_４＋Ｘ_０（１）
ただし、上記式（１）において、ａ_１，ｂ_１，ｃ_１，ｄ_１は係数、Ｘ_０は定数
に代入して文章の読み易さを評価する評価値Ｙを従属変数として導出する評価値導出手段とを備えたものである。 Example 1, a sentence storage means for storing a target sentence, and the average number-of-characters calculation means for calculating the average number of characters X ₁ per sentence sentence stored in the sentence storage means, the average per one sentence of the sentence Mean word number calculating means for calculating the number of words X ₂ , Chinese word ratio calculating means for calculating the Chinese word ratio X ₃ of the whole sentence, and average relation for calculating the average dependency number X ₄ per sentence of the sentence The received number calculating means, the average number of characters X ₁ per sentence calculated by the average number of characters calculating means, the average number of words X ₂ per sentence calculated by the average word number calculating means, and the Chinese word ratio calculating means Using the calculated Chinese language ratio X ₃ and the average dependency number X ₄ per sentence calculated by the average dependency number calculation means as explanatory variables, the following equation (1)
Y = a ₁ X ₁ + b ₁ X ₂ + c ₁ X ₃ + d ₁ X ₄ + X ₀ (1)
However, in the above formula (1), a ₁ , b ₁ , c ₁ , d ₁ are coefficients, and X ₀ is assigned to a constant, and an evaluation value Y that is used to evaluate the readability of the sentence as a dependent variable is derived. And derivation means.

具体的には、前記式（１）中の各係数ａ_１，ｂ_１，ｃ_１，ｄ_１及び定数Ｘ_０は、多数の文章サンプルを解析して得た各文章サンプルの１文当たりの平均文字数Ｘ_１、１文当たりの平均単語数Ｘ_２、漢語の割合Ｘ_３及び平均係り受け数Ｘ_４を説明変数とし各文章サンプルに予め設定されている評価値Ｙを従属変数として重回帰分析を行うことで得られたものである。 Specifically, the coefficients a ₁ , b ₁ , c ₁ , d ₁ and the constant X ₀ in the formula (1) are averages per sentence of each sentence sample obtained by analyzing a large number of sentence samples. Multiple regression analysis is performed using the number of characters X ₁ , the average number of words per sentence X ₂ , the ratio of Chinese words X ₃ and the average dependency number X ₄ as explanatory variables, and the evaluation value Y preset for each sentence sample as the dependent variable. It is obtained by doing.

実施例１においては、文字数による文の長さ、語数による文の長さ、語種の割合、文法の複雑さの４つの要因で文章の読み易さを測定し、当該文章のレベルを１から９までのレベルで示す。レベル１からレベル９は、表1に示したように、学校教育における小学１年から中学３年を表す。 In the first embodiment, the readability of a sentence is measured by four factors: the length of a sentence by the number of characters, the length of a sentence by the number of words, the ratio of word types, and the complexity of the grammar. Shows up to levels. As shown in Table 1, Levels 1 to 9 represent first grade to third grade in school education.

上記係数及び定数を決定するための基礎データのサンプリングには、小学１年から６年までは、光村出版、学校図書、東京書籍の３種類の上巻と下巻の計３６冊、中学１年から３年までは、光村出版、三省堂、東京書籍の３種類の計９冊で、小学校教科書と中学校教科書の合計４５冊を使用した。 For sampling of basic data to determine the above coefficients and constants, from the first grade to the sixth grade, Mitsumura Publishing, school books, and 36 books of the first and second volumes of Tokyo books, a total of 36 books, junior high school grades 1 to 3 Until the year, a total of 45 books, including elementary school textbooks and junior high school textbooks, were used, with a total of nine books: Mitsumura Publishing, Sanseido, and Tokyo Books.

各教科書から読みを目的にした教材を対象とし、文字や語彙の学習、作文、話し合い、発表等を目的としたものは対象外とした。また、詩歌は形式や表現において散文と大きく異なるため、対象外とした。その結果、４５冊の教科書の中から２６４のテキストが選ばれた（表２）。 Materials intended for reading from each textbook were targeted, but those intended for learning characters, vocabulary, writing, discussion, and presentation were excluded. In addition, poetry is excluded from the subject because it differs greatly from prose in form and expression. As a result, 264 texts were selected from 45 textbooks (Table 2).

さらにこの中から、重複しているテキスト（小学１年「おおきなかぶ」、小学２年「かさこじぞう」「スイミー」、小学３年「モチモチの木」、小学４年「白いぼうし」「ポレポレ」「ごんぎつね」「一つの花」、小学５年「大造じいさんとがん」「注文の多い料理店」、中学２年「走れメロス」、中学３年「故郷」などは複数の教科書に掲載されている。）および、特殊なテキスト（例：小学３年「じゅげむ」は１単語の文字数が極端に多い）を除き、最終的に２４５テキストをデータとして使用することにした。 In addition, from these, duplicate texts (first grade "Okina Kabu", second grade "Kasako Jizou" "Swimmy", third grade "Mochimochi no Ki", fourth grade "White Boshi" "Pole Pole" “Gongitsune,” “One Flower,” 5th grade elementary school, “Old Taijisan and Cancer,” “Highly Ordered Restaurant,” 2nd grade, “Runos Meros,” 3rd grade, “Hometown,” etc. are listed in multiple textbooks. And 245 texts were used as data in the end, except for special texts (eg, “Jugemu” in the third grade of elementary school has an extremely large number of characters per word).

各説明変数について以下に説明する。 Each explanatory variable will be described below.

１文当たりの平均文字数について説明する。記憶研究の分野では、文は長ければ長いほど読みにくく、短ければ短いほど読み易いことが明らかになっている。文の長さは１文当たりの文字数で表すことができる。そこでまず、分析対象とするテキストのすべてを、１ページごとにスキャナーにかけ、eTypistv11.0（ソフトウエア）で画像を取り、コンピュータ画面上で使えるようにした。 The average number of characters per sentence will be described. In the field of memory research, it has become clear that the longer the sentence, the harder it is to read and the shorter the sentence, the easier it is to read. The length of a sentence can be expressed by the number of characters per sentence. First of all, we put all the texts to be analyzed on a scanner page by page, took images with eTypistv11.0 (software), and made them available on the computer screen.

次に、マイクロソフト社のワード機能の１つである「読みやすさの評価」を使って、すべてのテキストについて１文当たりの平均文字数を計算した。学年ごとにテキストをまとめると図１に示したように、１文当たりの平均文字数は学年ごとに増えていくことがわかる。 Next, the average number of characters per sentence was calculated for all texts using “readability assessment”, one of Microsoft's word functions. When the texts are summarized for each school year, it can be seen that the average number of characters per sentence increases with each school year, as shown in FIG.

１文当たりの平均単語数について説明する。英語のように文字種がアルファベット１種類だけの言語においては、１文当たりの平均文字数がそのまま文の長さを反映する。しかし、日本語には、漢字、ひらがな、カタカナ、ローマ字（アルファベット）の４種類があるため、文字数だけで文の長さを示すことは妥当ではないと考えた。例えば（１）「教養課程が開設される。」という文は、１０文字であるが、（２）「きょうようかていがかいせつされる。」の文は１６文字である。 The average number of words per sentence will be described. In a language with only one alphabet, such as English, the average number of characters per sentence reflects the length of the sentence as it is. However, there are four types of Japanese characters, kanji, hiragana, katakana, and romaji (alphabet), so I thought it was not appropriate to indicate the length of a sentence only by the number of characters. For example, the sentence (1) “The liberal arts course will be opened” is 10 characters, while the sentence (2) “You are being asked” is 16 characters.

実際には、日常生活で（２）のような文が使われることはないが、どの程度の分量の漢字を用いるかは書き手の判断に任されており、基準になるものはない。本発明者等はこの問題を解決するために、以下のように考えた。 Actually, a sentence like (2) is not used in daily life, but the amount of kanji used is left to the writer's judgment and there is no standard. In order to solve this problem, the present inventors considered as follows.

（１）と（２）は文字数が異なるが、単語の数は同数であり、文の中に含まれる情報の量も同じである。例えば、（１）「教養／課程／が／開設される。」も（２）「きょうよう／かてい／が／かいせつされる。」も４つの単語からなる文である。このように語数で文の長さを示すことは、日本語に限らず、複数の文字種の存在する言語において可能である。そこで、マイクロソフト社のワード機能を使って、テキストごとに１文平均語数を計算した。学年ごとにテキストをまとめると、図１に示したように、１文当たりの平均語数は学年ごとに増えていくことがわかる。 (1) and (2) differ in the number of characters, but the number of words is the same, and the amount of information contained in the sentence is also the same. For example, (1) “Education / Course / Guide / Opened” and (2) “Kyoyo / Kate / Gai / Kaisei” are sentences composed of four words. In this way, the length of a sentence can be indicated by the number of words, not only in Japanese but also in a language in which a plurality of character types exist. Therefore, we used the Microsoft word function to calculate the average number of words per sentence for each text. Summarizing the text by grade, it can be seen that the average number of words per sentence increases with each grade, as shown in FIG.

語種の割合について説明する。日本語の語彙には、漢語、和語、外来語、混種語など複数の語種があり、実施例１においては、語種の割合を文章の読み易さの決定要因の一つとした。語種の割合を計算するためには以下の方法を用いた。 Explain the ratio of word types. The Japanese vocabulary includes a plurality of word types such as Chinese, Japanese, foreign words, mixed words, and in Example 1, the ratio of the word types is one of the determinants of the readability of the sentences. The following method was used to calculate the percentage of word types.

まず、全テキストをChaSen（ソフトウエア）を用いて形態素解析を行い、名詞、動詞、形容詞、形容動詞の４種類の内容語の数を計算した。次に、Katarigusa（ソフトウエア）を用いて、内容語を漢語、和語、外来語、混種語、漢外語、和外語、和漢語に分析し、内容語全体に対する漢語、和語、外来語、混種語のそれぞれの割合を算出した。その結果、外来語と混種語の割合は学年による変化は見られず、一方、各学年の和語と漢語の割合の和はほぼ一定していることが示された。すなわち、学年が上がるごとに和語は減り、反対に漢語が増える。図２に学年ごとの語種の割合を示した。 First, morphological analysis was performed on the entire text using ChaSen (software), and the number of four types of content words: nouns, verbs, adjectives, and adjective verbs was calculated. Next, using Katarigusa (software), the content words are analyzed into Chinese, Japanese, foreign words, mixed languages, Chinese foreign languages, Japanese foreign languages, and Japanese-Korean languages. The ratio of each mixed word was calculated. As a result, it was shown that the ratio of foreign words and mixed words did not change according to grades, while the sum of the percentages of Japanese and Chinese in each grade was almost constant. In other words, as the school year goes up, the Japanese language decreases, and conversely the Chinese language increases. Fig. 2 shows the percentage of word types by grade.

文法の複雑さについて説明する。実施例１においては、文法の複雑さを係り受けの文節数と文節間の距離で説明する。全テキストの各文を日本語係り受け解析器CaboCha（ソフトウエア）で解析し、テキストごとの係り受けの総数、及び係りの文節と受けの文節の間の距離（係り文節と受け文節の間にある文節の数）を計算した。CaboChaでの解析結果では、係り文節と受け文節の間にある文節数が最高で３５まで示されたが、実際には３５もの文節が係り受けの間にあることは考えにくい。これは、CaboChaの精度が８９．２９％であることから生じた問題だと考えられる。そのため、実施例１では文節数が１から１０までの場合のみを分析対象とした。 Explain the complexity of grammar. In the first embodiment, the complexity of the grammar will be described with the number of dependent clauses and the distance between the clauses. Each sentence of all texts is analyzed with Japanese dependency analyzer CaboCha (software), the total number of dependency for each text, and the distance between the dependency clause and the reception clause (between the dependency clause and the reception clause). The number of phrases) was calculated. In the analysis result in CaboCha, the maximum number of clauses between the dependency clause and the reception clause is 35, but it is difficult to think that there are actually 35 phrases between the dependency clauses. This seems to be a problem caused by the accuracy of CaboCha being 89.29%. Therefore, in Example 1, only the case where the number of clauses is 1 to 10 was set as the analysis target.

表３は左からテキスト名、テキスト中の係り受けの総数、及び係り文節と受け文節との間にある文節数を示したものである。係り文節と受け文節との距離で１を示すものは、係り文節のすぐ隣に受け文節とある場合であり、２を示すものは、係り文節と受け文節の間に文節が１つある場合である。例えばtext1は小学校１年生のテキストであるが、係り受け総数は３２であり、距離が１（係り文節のすぐ隣に受け文節があるのもの）のものは１６で、全体の５０％にあたる。そして、距離が４以上のものはほとんどないことから、係り受けの量と複雑さから見た場合、text1は単純な文であることがわかる。一方、text4は中学３年生のテキストであるが、係り受け総数４１０のうち、距離が１のものは６６で全体の１６％で、距離が２から１０までのものが多いことから複雑な文であることがわかる。 Table 3 shows, from the left, the text name, the total number of dependency in the text, and the number of clauses between the dependency clause and the reception clause. The distance between the dependency clause and the receiving clause indicates 1 when the receiving clause is immediately adjacent to the dependency clause, and 2 indicates the case when there is one clause between the dependency clause and the receiving clause. is there. For example, text1 is a text of a first grader of elementary school, but the total number of dependency is 32, and the one whose distance is 1 (the one with the reception clause right next to the dependency clause) is 16, which is 50% of the total. And since there are almost no distances greater than 4, it can be seen that text1 is a simple sentence when viewed from the amount and complexity of the dependency. On the other hand, text4 is a text of junior high school third grader, but out of the total number of dependency 410, the one with distance 1 is 66, which is 16% of the total, and there are many distances from 2 to 10 with complicated sentences. I know that there is.

そこで、本研究では以下の計算式（２）を用いて、文法の複雑さとした。 Therefore, in this study, the following formula (2) was used to make the grammar complex.

上の式で学年ごとに文法の複雑さ（１文当たりの係り受けの数）を計算したところ、表４及び図３で示したように、学年が上がるごとに複雑さの数値が上がっていくことが示された。 Using the above formula, the complexity of the grammar (number of dependencies per sentence) is calculated for each grade, and as shown in Table 4 and Figure 3, the complexity increases as the grade rises. It was shown that.

以上より、１文当たりの平均文字数、１文当たりの平均単語数、漢語の割合、文法の複雑さ（１文当たりの係り受けの数）の４つの要因において、すべて学年が上がるごとに数値が増えることが示されたので、学年を線形と仮定した上で、小学１年から中学３年までの教科書にある２４５のテキストから、１文当たりの平均文字数、１文当たりの平均単語数、漢語の割合、文法の複雑さの数値を計算し、この４変数を説明変数とし、学年を従属変数として、統計解析ソフトＳＰＳＳを用いて重回帰分析（強制投入法）で分析した。 From the above, the four factors, the average number of characters per sentence, the average number of words per sentence, the proportion of Chinese, and the complexity of the grammar (number of dependency per sentence), are all numerical values as the grade increases. Since it was shown that the grade would be linear, the average number of characters per sentence, the average number of words per sentence, Chinese, from the 245 texts in the textbooks from the first grade to the third grade, assuming the grade was linear The numerical values of the ratio and the complexity of the grammar were calculated, and these four variables were used as explanatory variables, and the grade was used as the dependent variable, and analyzed by multiple regression analysis (forced input method) using statistical analysis software SPSS.

その結果、２２のテキストが予測値から大きくはずれることが示されたので、この２２テキストをはずれ値として除外し、最終的に残りの２２３のテキストで再度、重回帰分析を行った。その結果を図４に示す。 As a result, it was shown that 22 texts deviated greatly from the predicted values. Therefore, these 22 texts were excluded as outliers, and finally multiple regression analysis was performed again with the remaining 223 texts. The result is shown in FIG.

βで示された部分が係数となる。４つの変数のすべて有意（p<.001）であることから、この４変数を説明変数として使うことに妥当性があることが示された。ここで示された係数を各変数にかけると、以下の式（３）ができる。 The part indicated by β is the coefficient. Since all four variables are significant (p <.001), it is shown that it is appropriate to use these four variables as explanatory variables. When the coefficient shown here is applied to each variable, the following equation (3) is obtained.

Ｙ＝0.171Ｘ_１−0.439Ｘ_２＋13.434Ｘ_３＋0.065Ｘ_４＋2.388 （３）
Ｙ：学年、Ｘ_１：１文当たりの平均文字数、Ｘ_２：１文当たりの平均単語数、Ｘ_３：漢語の割合、Ｘ_４：１文当たりの平均係り受け数 _{_{Y = 0.171X 1 -0.439X 2 + 13.434X}} 3 + 0.065X 4 +2.388 (3)
Y: Grade, X ₁ : Average number of characters per sentence, X ₂ : Average number of words per sentence, X ₃ : Percentage of Chinese, X ₄ : Average number of dependency per sentence

Ｒ二乗値は予測力の強さを示す決定係数であるが、0.668は高い数値なので（１．０が最高値）、この式の予測力は高いということである。Ｙで示された学年が、リーダビリティーの判定値となり、１から９までの値をとる。１が最も簡単で、９が最も難しい。当該テキストの４変数を計算しこの公式に当てはめれば、そのテキストが国語教科書のどの学年に近いかが判定できる。そこで、これをテキストの学年レベルを予測するリーダビリティー公式とした。この公式が出来たことで、各変数を計算するソフトウエアを開発し、誰でも簡単にテキストのレベルを判定できることになる。 The R-square value is a coefficient of determination indicating the strength of the predictive power, but 0.668 is a high value (1.0 is the highest value), so the predictive power of this equation is high. The grade indicated by Y is a judgment value of readability, and takes values from 1 to 9. 1 is the simplest and 9 is the most difficult. If four variables of the text are calculated and applied to this formula, it can be determined which grade of the textbook the text is closest to. Therefore, this is a readability formula that predicts the grade level of the text. With this formula, anyone can develop software that calculates each variable and anyone can easily determine the level of text.

実施例１の各手段について説明する。 Each means of the first embodiment will be described.

実施例１は、一般的なコンピュータ（ＰＣ等）を用いて実施することができる。具体的には、文章記憶手段は、ハードディスク、光ディスクやメモリ等の記憶装置であり、このハードディスク等の記憶装置に格納されたプログラムに基づいてＣＰＵ等の演算処理部が下記の各手段として機能するように構成している。 The first embodiment can be implemented using a general computer (PC or the like). Specifically, the text storage means is a storage device such as a hard disk, an optical disk, or a memory, and an arithmetic processing unit such as a CPU functions as the following means based on a program stored in the storage device such as the hard disk. It is configured as follows.

平均文字数算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章の1文当たりの平均文字数Ｘ_１を、文章全体の文字数÷文の数により算出するものであり、算出結果を記憶装置に記憶する。 The average number-of-characters calculation means reads the target sentence that is stored in the text storage means, the average number of characters X ₁ per sentence of the target sentence is for calculating the number of a whole text characters ÷ statement, the calculation result Store in the storage device.

平均単語数算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章の1文当たりの平均単語数Ｘ_２を、文章全体を単語毎に区分した後、文章全体の単語数÷文の数により算出するものであり、算出結果を記憶装置に記憶する。 Mean word number calculating means reads the target sentence that is stored in the text storage means, the average number of words X ₂ per sentence of the target sentence, after divided into each word of a whole text, number of words in a whole text ÷ It is calculated by the number of sentences, and the calculation result is stored in the storage device.

漢語割合算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章全体における漢語の割合Ｘ_３を、文章全体を漢語とそれ以外の語種とに区分した後、文章全体の漢語の数÷文章全体の内容語で算出するものであり、算出結果を記憶装置に記憶する。 Chinese ratio calculation means reads the target sentence that is stored in the text storage means, the Chinese ratio X ₃ of the entire target sentence, after dividing the entire sentence to the Chinese and other Katarigusa, the entire text of Chinese It is calculated by the number divided by the content word of the whole sentence, and the calculation result is stored in the storage device.

平均係り受け数算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章の1文当たりの係り受け数Ｘ_４を、上記式（２）を用いて算出するものであり、算出結果を記憶装置に記憶する。 The average dependency number calculating means reads the target sentence stored in the sentence storage means, and calculates the dependency number X ₄ per sentence of the target sentence using the above equation (2). The result is stored in the storage device.

評価値導出手段は、上記各手段により算出されて記憶装置に記憶されたＸ_１〜Ｘ_４を読み出し、上記式（３）に代入することで評価値Ｙを算出するものであり、その結果を記憶装置に記憶するかディスプレイ等の適宜な表示手段に表示する（Ｘ_１〜Ｘ_４及び過程等もまとめて表示するようにしても良い。）。 The evaluation value deriving means calculates the evaluation value Y by reading out X _{1 to} X ₄ calculated by each of the above means and stored in the storage device and substituting it into the above equation (3). Either stored in a storage device or displayed on an appropriate display means such as a display (X _{1 to} X ₄ and the process may be displayed together).

従って、例えば、ＰＣ等の記憶装置に上記各手段となるプログラムを格納しておき、適宜なサンプルを用いて上記式（１）中の各係数ａ_１，ｂ_１，ｃ_１，ｄ_１及び定数Ｘ_０を予め得て、上記式（３）等のような式を評価値導出手段に備えることで、対象文章のＸ_１〜Ｘ_４を既存のソフトウエア等により簡単に分析してこれを入力するだけで日本語文章の読み易さを評価できることになる。 Therefore, for example, a program serving as each of the above means is stored in a storage device such as a PC, and the coefficients a ₁ , b ₁ , c ₁ , d ₁ and constants in the above equation (1) are stored using appropriate samples. to obtain X ₀ in advance, by providing the evaluation value deriving means an expression like the equation (3) or the like, inputs the a X ₁ to X ₄ of the target sentence to easily analyze the existing software or the like You can evaluate the readability of Japanese sentences just by doing.

よって、実施例１は、心理学、国語教育、日本語教育の文章理解研究のテキストを決定する際の基準ができるだけでなく、文章作成の指標ができ、日常における様々な読み物（政府刊行物，危険物や薬品の注意書き，ビジネス文書等々）を明確で平易にする作業に応用できるなど、文章理解に関する学問分野に貢献するばかりでなく、一般の人々の日常生活にも有益な材料を提供可能で、しかも、日本人だけでなく、日本語学習が十分でない外国人にとっても易しい日本語での情報の共有化につながり、社会的にも国際的にも貢献度が高い極めて実用性に秀れたものとなる。 Therefore, in Example 1, not only a standard for determining texts for text comprehension research in psychology, Japanese language education, and Japanese language education, but also an index for text creation, various daily readings (government publications, It can be applied to work to clarify and simplify dangerous goods and chemicals, business documents, etc.), and can provide materials that are useful not only for academic fields related to sentence comprehension but also for everyday life of the general public. Moreover, it is easy to share information in Japanese, not only for Japanese but also for foreigners who are not good at learning Japanese, and it is extremely practical because it contributes both socially and internationally. It will be.

実施例２は、実施例１の公式の一部を改良し、より精度の高い評価を可能としたものである。具体的には、対象文章を記憶する文章記憶手段と、この文章記憶手段に記憶された文章の１文当たりの平均文字数Ｘ_１１を算出する平均文字数算出手段と、前記文章全体の平仮名の割合Ｘ_１２を算出する平仮名割合算出手段と、前記文章の１文当たりの平均述語数Ｘ_１３を算出する平均述語数算出手段と、前記文章の１文当たりの平均文節数Ｘ_１４を算出する平均文節数算出手段と、前記平均文字数算出手段により算出された１文当たりの平均文字数Ｘ_１１、前記平仮名割合算出手段により算出された文章中の平仮名の割合Ｘ_１２、前記平均述語数算出手段により算出された１文当たりの平均述語数Ｘ_１３及び前記平均文節数算出手段により算出された１文当たりの平均文節数Ｘ_１４を夫々説明変数として、次式（４）
Ｙ＝ａ_１１Ｘ_１１＋ｂ_１１Ｘ_１２＋ｃ_１１Ｘ_１３＋ｄ_１１Ｘ_１４＋Ｘ_００（４）
ただし、上記式（４）において、ａ_１１，ｂ_１１，ｃ_１１，ｄ_１１は係数、Ｘ_００
は定数
に代入して文章の読み易さを評価する評価値Ｙを従属変数として導出する評価値導出手段とを備えたものである。 In the second embodiment, a part of the formula of the first embodiment is improved to enable more accurate evaluation. Specifically, a sentence storage means for storing a target sentence, and the average number-of-characters calculation means for calculating the average number of characters X ₁₁ per sentence sentences this sentence stored in the storage unit, the entire sentence Hiragana proportion X Hiragana ratio calculating means for calculating the _12, average clause number for calculating the average predicate count calculation means for calculating the average predicate number X ₁₃ per sentence, the average clause number X ₁₄ per sentence of the sentence of the sentence An average number of characters X ₁₁ per sentence calculated by the calculating means, the average number of characters calculating means, a ratio of hiragana X ₁₂ in the sentence calculated by the hiragana ratio calculating means, and an average predicate number calculating means Using the average predicate number X ₁₃ per sentence and the average phrase number X ₁₄ per sentence calculated by the average phrase number calculating means as explanatory variables, the following equation (4)
_{_{_{_{Y = a 11 X 11 + b}}}} 11 X 12 + c 11 X 13 + d 11 X 14 + X 00 (4)
However, in the above formula _{_{_{(4), a 11, b}}} 11, c 11, d 11 _{coefficient, X 00}
Is provided with an evaluation value deriving means for deriving an evaluation value Y, which is substituted into a constant, to evaluate the readability of a sentence as a dependent variable.

実施例１で用いた公式では、１文当たりの平均単語数を変数の１つとしているが、国語教科書を電子ファイル化した後、マイクロソフト社のWord及び形態素解析ツールChaSenで分析を行ない、分析結果を詳細に検討した結果、複合語の数え方に多少問題があることを発見した。例えば、長岡技術科学大学は固有名詞で１単語であるが、Word機能では、「長岡/技術/科学/大学」というように４単語として数える。これは、Wordの元になるシソーラス（辞書のようなもの）に長岡技術科学大学がないためである。また、ChaSenは本来新聞記事を分析するために作成されたツールであるため、平仮名が多い文章には適さず、特に複合語の解析結果に欠点が見られる。日本語は複数の単語の組み合わせで複合語が成立するとき音韻が変化するという規則を持つ。例えば「ほん」と「はこ」で「ほんばこ」、「はな」と「はたけ」で「はなばたけ」というように「はこ」と「はたけ」の音は変化する。「本箱」や「花畑」のように一般的な語彙はChaSenのシソーラスに入っているが、一般性のない複合語は入っていない。そのため、小学校３年国語教材「ごんぎつね」を分析すると、「母さんぎつね」という1語が、「母さん（名詞）ぎ（未知語）つね（人名，固有名詞）」のように分析される。これは、「母さん」と「きつね」と「狐」はChaSenのシソーラスにあっても、「母さんぎつね」がないからである。同様の問題はこのような複合語だけでなく、擬音語にも見られ、「トントン」が、「トン」（人名，固有名詞）のように分析される。分析結果の一部を表５に示す。 In the formula used in Example 1, the average number of words per sentence is one of the variables. However, after the Japanese language textbook is converted into an electronic file, it is analyzed using Microsoft's Word and morphological analysis tool ChaSen. As a result of detailed examination, we found that there are some problems in how to count compound words. For example, Nagaoka University of Technology is a proper noun with one word, but the Word function counts as four words, such as “Nagaoka / Technology / Science / University”. This is because there is no Nagaoka University of Technology in the thesaurus (like a dictionary) that is the basis of Word. ChaSen is a tool originally created to analyze newspaper articles, so it is not suitable for sentences with many hiragana characters, and there is a drawback in the analysis results of compound words. Japanese has a rule that phonology changes when a compound word is formed by a combination of a plurality of words. For example, the sounds of “Hako” and “Hatake” change, such as “Honba” and “Hana” between “Hon” and “Hako”, “Hana” and “Hatake”. Common vocabularies such as “Book Box” and “Hanabata” are in the ChaSen thesaurus, but there are no uncommon compound words. Therefore, when analyzing the third grade elementary school language teaching material “Gongitsune”, one word “mother fox” is analyzed as “mother (noun) gi (unknown word) tsune (person name, proper noun)”. This is because “Mother”, “Kitsune”, and “Tatsumi” have no “Mother's fox” even in the ChaSen thesaurus. A similar problem is found not only in such compound words but also in onomatopoeia, and “tonton” is analyzed like “ton” (person name, proper noun). A part of the analysis results is shown in Table 5.

以上のことから、解析ツールによる１文当たりの平均単語数の信頼性にはやや不足する点があると見なし、実施例２においては１文当たりの平均単語数を公式の変数からはずすことにした。 From the above, it is considered that the reliability of the average number of words per sentence by the analysis tool is somewhat insufficient, and in Example 2, the average number of words per sentence is excluded from the official variables. .

また、実施例１で用いた公式では、文法構造の複雑さを反映するものとして、１文当たりの平均係り受け数を変数の１つとしているが、この点についても、やや問題があることを発見した。 In addition, the formula used in Example 1 reflects the complexity of the grammatical structure, and the average number of dependency per sentence is one of the variables. However, there is a problem with this point as well. discovered.

日本語の文法は入れ子型構造（主語―述語の中に、複数の句を入れることができる構造）であり、文法構造の複雑さは係りの文節と受けの文節の距離と数だけでは説明が不十分である可能性がある。例えば、以下の１）では、「少年は投げて」と「少年は走り去った」の2つの係り受けがある。２）でも、「少年が投げた」と「ボールは落ちた」の2つの係り受けがある。しかし、１）は「投げて」と「走り去った」が並列構造関係であるのに対し、２）は「少年が投げた」と「ボール」が連体修飾関係であり、同じものとして扱うことはできないと考えた。そのため、公式の変数に１文当たりの平均係り受け数が文法の複雑さを表わすとは必ずしも言い切れないと考え、これを除外することにした。 Japanese grammar has a nested structure (a structure in which multiple phrases can be placed in the subject-predicate), and the complexity of the grammar structure can be explained only by the distance and number of the related and receiving clauses. It may be insufficient. For example, in the following 1), there are two dependencies: “The boy throws” and “The boy runs away”. 2) But there are two dependencies: "The boy threw" and "The ball fell". However, in 1) “throw” and “run away” have a parallel structure relationship, whereas in 2) “the boy throws” and “ball” have a combination modification relationship, I thought it was impossible. For this reason, it was considered that the average dependency per sentence in the official variables represents the complexity of the grammar, so we decided to exclude it.

１）少年はボールを投げて、走り去った。
２）少年が投げたボールは河に落ちた。 1) The boy threw the ball and ran away.
2) The ball thrown by the boy fell into the river.

以上の点を踏まえ、学年を予測する変数から、１文当たりの平均単語数と１文当たりの平均係り受け数を除外し、１文当たりの平均単語数の代わりに１文当たりの平均文節数を使うことにした。また、１文当たりの平均係り受け数の代わりに１文当たりの平均述語数を使うことにした。理由を以下に述べる。 Based on the above points, exclude the average number of words per sentence and the average number of dependency per sentence from the variables that predict the grade, and instead of the average number of words per sentence, the average number of phrases per sentence I decided to use Also, instead of the average number of dependency per sentence, we decided to use the average number of predicates per sentence. The reason is described below.

実施例１で用いた公式に１文当たりの平均単語数を入れたのは、日本語には複数の文字種があるためで、文の長さを示すものとして１文当たりの平均文字数だけでは不十分であると考えたからである。文の長さを示すものとして１文当たりの平均単語数が使えないのであれば、これに代わるものとして、1文当たりの文節の数を文の長さを示す変数として使うことができる。橋本進吉による橋本文法（学校文法）によれば、文節とは「日本語において、自立語（名詞，動詞など）に接語がつながった発音上の単位である。接語は無いこともある。文の途中に「ネ」「サ」などの言葉を入れて切ってもおかしくないところ」とされている。例えば「少年が投げたボールは河に落ちた。」という文は、「少年が/投げた/ボールは/河に/落ちた。」という区切りをつけることができ、５文節からなる文であるということができる。 The reason why the average number of words per sentence is included in the formula used in Example 1 is that there are a plurality of character types in Japanese, and the average number of characters per sentence is not enough to indicate the length of the sentence. Because I thought it was enough. If the average number of words per sentence cannot be used to indicate the length of a sentence, the number of clauses per sentence can be used as a variable indicating the length of the sentence as an alternative. According to Hashimoto Grammar (School Grammar) by Hashimoto Shinkichi, a phrase is “in Japanese, a unit of pronunciation that is connected to an independent word (noun, verb, etc.). It is said that it would not be strange to put “ne” or “sa” in the middle of the sentence. For example, the sentence “The ball thrown by the boy fell into the river” can be divided into “the boy / thrown / the ball / falled into the river.” It can be said.

言語処理の分野では、係り受け解析ツールCaboChaで文節を数えることができる。CaboChaも基本的にはChaSenの形態素解析結果をもとに係り受けを解析するのであるが、文節の単位であれば、語が複合語であってもそのすぐ後に助詞が来るので、複合語を分解して数えるようなミスが少ない。また、たとえあっても、人為的に見つけやすい。図５は１（小学校１年）〜１２（高校３年）の学年別に１文当たりの平均文節数を算出したものである。図６に示されたように、学年に応じてほぼ線形に上昇し、変数として適当であると考えられる。従って、これを変数として使うことにした。 In the field of language processing, clauses can be counted with the dependency analysis tool CaboCha. CaboCha also basically analyzes the dependency based on the morphological analysis result of ChaSen, but if it is a unit of a phrase, a particle comes immediately after it even if the word is a compound word. There are few mistakes such as disassembling and counting. In addition, even if there is, it is easy to find artificially. FIG. 5 shows the calculation of the average number of phrases per sentence for each grade from 1 (1st grade of elementary school) to 12 (3rd grade of high school). As shown in FIG. 6, it rises almost linearly according to the school year, and is considered suitable as a variable. Therefore, I decided to use this as a variable.

述語とは、言語学において文や節の中心を担う成分のことであり、他の名詞句に関する何かを表す部分である。述語が１つである文のことを単文（たんぶん）といい、述語が２以上存在する文を複文（ふくぶん）という。単文よりも複文の方が複雑であることから、読み手にとっては単文よりも複文が多い方が難しい。また、１９８１年に行われたGoetz,Andersion,＆Schallertの実験で「文の長さが同じであれば命題が多い方が難しい」という結果が得られている。ここでいう命題とは、１つの述語と１つの項からなる単位のことなので、述語を数えれば命題の数になる。よって、係り受けの数で説明が困難な文法構造の複雑さを、１文当たりの述語の数で、ある程度は説明できると考えられる。学年ごとの１文当たりの平均述語数を図６に示す。 A predicate is a component that plays a central role in sentences and clauses in linguistics, and represents a part related to other noun phrases. A statement with one predicate is called a simple statement, and a statement with two or more predicates is called a compound statement. Because compound sentences are more complex than simple sentences, it is more difficult for readers to have more than one sentence. In addition, an experiment by Goetz, Andersion, & Schallert conducted in 1981 shows that “the more propositions are more difficult if the sentence length is the same”. Since the proposition here is a unit consisting of one predicate and one term, the number of propositions is obtained by counting the predicates. Therefore, it is considered that the complexity of the grammatical structure, which is difficult to explain with the number of dependencies, can be explained to some extent with the number of predicates per sentence. The average number of predicates per sentence for each grade is shown in FIG.

実施例１においては１（小学１年）〜９（中学３年）までの学年の国語教科書をサンプルとした。学年を予測する公式を作るには９学年で十分と思われるが、実施例２においてはさらに高校１年から３年までのデータを増やし、１２学年分で観察した。学年を増やすことで、学年が「線形」であるかどうかを確認できるからである。しかし、結果的には９学年までが「線形」であるということが示されたので、分析には９学年分のデータのみを使用した。 In Example 1, Japanese language textbooks of grades 1 (first year of elementary school) to 9 (third year of middle school) were used as samples. Nineth grade seems to be sufficient to create a formula for predicting the school year, but in Example 2, the data from the first to third year of high school was further increased and observed for 12th grade. This is because it is possible to confirm whether the grade is “linear” by increasing the grade. However, as a result, it was shown that up to the ninth grade was “linear”, so only the data for the ninth grade was used for the analysis.

さらに、日本語が他の言語と異なる大きな特徴の一つとして、４種類の文字種（平仮名・漢字・カタカナ・ローマ字）があることを考慮すると、文字種の割合もリーダビリティー公式の強い変数となることが予想される。実施例１においては、この点が含まれていなかったが、実施例２においては文字種の割合も変数として加えることにした。小学１年から高校３年までの国語教科書に出現する文字の種類を学年ごとに計算したところ、図７のように分析された。 Furthermore, considering that there are four types of characters (Hiragana, Kanji, Katakana, and Roman) as one of the major features that make Japanese different from other languages, the proportion of character types is also a strong variable in the readability formula. It is expected that. In Example 1, this point was not included, but in Example 2, the ratio of the character type was also added as a variable. The types of characters appearing in Japanese language textbooks from the first grade to the third grade were calculated for each grade and analyzed as shown in FIG.

平仮名は学年が上がるごとに減少し、漢字は学年が上がるごとに増加しており、いずれか一方を変数として用いることが可能である。しかし、小学１年のテキストには漢字の出現数がゼロであるものも多い。ゼロを示すサンプルを含むことは統計解析での計算上、不適切な場合もあることが予想されるため、平仮名の割合を変数として用いることにした。 Hiragana decreases with each grade, and kanji increases with each grade. Either one can be used as a variable. However, many texts in the first grade of elementary school have zero occurrences of kanji. Since it is expected that including a sample indicating zero may be inappropriate in the calculation of statistical analysis, the ratio of hiragana was used as a variable.

以上の分析により、リーダビリティー公式を構築するための変数としては、実施例１で用いた１文当たりの平均文字数と漢語の割合に加えて、１文の平均文節数、平仮名の割合及び１文の平均述語数の５変数が学年を予測する独立変数として適切であると判断した。 Based on the above analysis, the variables for constructing the readability formula include the average number of characters per sentence and the percentage of Chinese used in Example 1, the average number of phrases per sentence, the percentage of hiragana and 1 Five variables of the average number of predicates in the sentence were judged to be appropriate as independent variables for predicting grade.

各変数の学年による変化を観察すると、９学年程度までは、ほぼ右肩上がりになっていることから、１学年から９学年ぐらいまでが学年を線形であると仮定しても問題ないと思われる（図６〜９参照）。尚、実施例２で分析したテキストと実施例１で分析したテキストとは一部重複するが全く同じものではない。 Observing changes in each variable according to grades, it is almost right up to grade 9, so it seems safe to assume that grades 1 to 9 are linear. (See FIGS. 6-9). Although the text analyzed in Example 2 and the text analyzed in Example 1 partially overlap, they are not exactly the same.

そこで、義務教育の中学３年までを一応の区切りとし、全データの中から１学年から９学年までのデータを使って、前述の５変数を独立変数とし、学年を従属変数として重回帰分析（強制投入法）を行った。強制投入法とは、説明力の強い変数も弱い変数も、すべて有効な変数として使う分析方法である。その結果、予測値からはずれる４０のテキストを除外した。実施例１では２２のテキストを除外したが、実施例２ではさらに精度の高いものにするために、外れ値の基準を厳しくし、４０のテキストが除外されることになった。残りの２０５のテキストのみを使って、最初の方法と同じく、学年を従属変数、他の変数を独立変数として重回帰分析を行った（ステップワイズ法）ところ、漢語の割合が除外された。ステップワイズ法とは、説明力のある変数だけを残すことができる分析方法である。漢語が学年との相関が強い（r=0.749，p<.0001）にも関わらず除外されたのは、漢語の割合と平仮名の割合の相関が極めて高く（r=-0.870，p<.0001）、多重共線性があったためだと考えられる。多重共線性とは二つ以上の変数が同じことを説明しているということで、ステップワイズ法を使うと、同じことを説明する変数は一つだけを残して、あとは除外されることになっている。漢語も含めた変数ごとの平均値と標準偏差及び変数間の相関を表６に示した。 Therefore, using the data from the first grade to the ninth grade from all data, the above five variables are made independent variables, and the grade is made a dependent variable. Forced input method). The forced input method is an analysis method that uses all variables with strong explanatory power and weak variables as effective variables. As a result, 40 texts deviating from the predicted values were excluded. In Example 1, 22 texts were excluded, but in Example 2, the outlier criteria were tightened and 40 texts were excluded in order to achieve higher accuracy. Using only the remaining 205 texts, as in the first method, a multiple regression analysis was performed with the grade as a dependent variable and the other variables as independent variables (stepwise method). The stepwise method is an analysis method that can leave only explanatory variables. Despite the fact that Kanji has a strong correlation with school year (r = 0.749, p <.0001), it was excluded because of the extremely high correlation between the ratio of Kanji and hiragana (r = -0.870, p <.0001). ), Probably due to multicollinearity. Multicollinearity means that two or more variables explain the same thing. Using the stepwise method, only one variable explaining the same thing is left, and the rest is excluded. It has become. Table 6 shows the average value and standard deviation for each variable including the Chinese language and the correlation between the variables.

図１０は漢語の割合が除外された４変数における重回帰分析結果である。定数と各係数はモデル４の非標準化係数のところを見る。文章中の平仮名の割合は-0.148、１文の平均述語数は1.585、１文の平均文字数は-0.117、１文の平均文節数は-0.126で、これらが係数となる。定数は15.581である。図１０には示されていないが、Ｒ二乗値は0.858であった。Ｒ二乗値とは、重回帰分析で計算された公式の予測力の確かさを表すもので、最低値ゼロから最高値1.0の間で示される。0.858という数値は、この公式が学年を予測するのに非常に強い予測力を持っていることを表している。 FIG. 10 shows the results of multiple regression analysis for four variables excluding the proportion of Chinese. For constants and coefficients see Model 4 non-standardized coefficients. The percentage of hiragana in the sentence is -0.148, the average number of predicates in one sentence is 1.585, the average number of characters in a sentence is -0.117, and the average number of phrases in a sentence is -0.126, which are coefficients. The constant is 15.581. Although not shown in FIG. 10, the R-square value was 0.858. The R-square value represents the certainty of the predictive power of the formula calculated by the multiple regression analysis, and is shown between the lowest value zero and the highest value 1.0. The number 0.858 indicates that this formula has a very strong predictive power in predicting grades.

４変数間の多重共線性の有無を診断したところ、固有値0.008、条件指標24.206というモデルにおいて、分散の比率が１文の平均述語数0.45、１文の平均文字数0.99、１文の平均文節数0.42という結果が示され、この３変数間には共線性の疑いがあると思われた。そこで、慎重を期すために、さらにこの中から述語の数を除外して、１文の平均文字数、平仮名の割合、１文の平均文節数の３変数で重回帰分析（ステップワイズ法）を行ったところ、１文の平均文字数、１文の平均文節数が除外され、平仮名の割合だけが残った。しかし、数学の計算上はそのような結果であっても、平仮名の割合だけが学年を予測する変数であるということは、現実的に不自然である。また、先に示したように１文の平均文字数、平仮名の割合、１文の平均述語数及び１文の平均文節数の４変数における説明変数としてのＲ二乗値が0.858と極めて高いことから、この４変数を使った公式が妥当であると判断し、係数を各変数にかけ定数を加えることで、以下の公式（５）が作られた。 When the presence or absence of multicollinearity between four variables was diagnosed, in the model with eigenvalue of 0.008 and condition index of 24.206, the variance ratio was 0.45 average predicate number per sentence, 0.99 average number of characters per sentence, and 0.42 average phrase number per sentence. The result showed that there was a suspicion of collinearity between these three variables. Therefore, in order to be cautious, the number of predicates is excluded from this, and multiple regression analysis (stepwise method) is performed with three variables: the average number of characters in one sentence, the ratio of hiragana and the average number of phrases in one sentence. As a result, the average number of characters in one sentence and the average number of phrases in one sentence were excluded, and only the ratio of hiragana remained. However, even in the mathematical calculation, it is practically unnatural that only the ratio of hiragana is a variable that predicts the school year. In addition, as described above, since the R square value as an explanatory variable in the four variables of the average number of characters in one sentence, the ratio of hiragana, the average number of predicates in one sentence, and the average number of clauses in one sentence is 0.858, it is extremely high. The following formula (5) was created by judging that the formula using these four variables was valid, multiplying each variable by a constant, and adding a constant.

Ｙ＝-0.117Ｘ_１１−0.148Ｘ_１２＋1.585Ｘ_１３-0.126Ｘ_１４＋15.581 （５）
Ｙ：学年、Ｘ_１１：１文当たりの平均文字数、Ｘ_１２：文章全体の平仮名の割合、
Ｘ_１３：一文当たりの平均述語数、Ｘ_１４：１文当たりの平均文節数 _{_{Y = -0.117X 11 -0.148X 12 + 1.585X}} 13 -0.126X 14 +15.581 (5)
Y: Grade, X ₁₁ : Average number of characters per sentence, X ₁₂ : Ratio of hiragana characters in the whole sentence,
X ₁₃ : average number of predicates per sentence, X ₁₄ : average number of clauses per sentence

よって、テキストの４変数を計算し上記式（５）に当てはめれば、そのテキストが国語教科書のどの学年に近いかが実施例１に係る式を用いる場合より一層正確に判定可能となる。 Therefore, if the four variables of the text are calculated and applied to the above formula (5), it is possible to determine more precisely which grade of the textbook in the Japanese language textbook is closer to the case where the formula according to the first embodiment is used.

実施例２の各手段について説明する。 Each means of Example 2 will be described.

平均文字数算出手段は実施例１と同様のものを採用する。 The average number-of-characters calculation means is the same as in the first embodiment.

また、平仮名割合算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章全体における平仮名の割合Ｘ_１２を、文章全体を平仮名とそれ以外の文字種とに区分した後、文章全体の平仮名の数÷文章全体の文字数で算出するものであり、算出結果を記憶装置に記憶するものである。 Furthermore, Hiragana ratio calculation means reads the target sentence that is stored in the text storage means, the hiragana percentage X ₁₂ in the overall target sentence, after dividing the entire sentence to the hiragana and other types of characters, the whole sentence It is calculated by the number of hiragana divided by the number of characters in the entire sentence, and the calculation result is stored in a storage device.

また、平均述語数算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章の１文当たりの述語数Ｘ_１３を算出するものであり、算出結果を記憶装置に記憶するものである。 The average predicate count calculation means reads the target sentence that is stored in the text storage means is for calculating a predicate number X ₁₃ per sentence of the target sentence, one that stores the calculation result in the storage device is there.

また、平均文節数算出手段は、文章記憶手段に記憶された対象文章を読み出し、この対象文章の１文当たりの文節数Ｘ_１４を算出するものであり、算出結果を記憶装置に記憶するものである。 The average phrase number calculation means reads the target sentence stored in the sentence storage means, calculates the number of phrases X ₁₄ per sentence of the target sentence, and stores the calculation result in the storage device. is there.

評価値導出手段は、上記各手段により算出されて記憶装置に記憶されたＸ_１１〜Ｘ_１４を読み出し、上記式（５）に代入することで評価値Ｙを算出するものであり、その結果を記憶装置に記憶するかディスプレイ等の適宜な表示手段に表示する（Ｘ_１１〜Ｘ_１４及び過程等もまとめて表示するようにしても良い。）。 The evaluation value deriving means calculates the evaluation value Y by reading X _{11 to} X ₁₄ calculated by each of the means and stored in the storage device and substituting it into the above equation (5). Either stored in the storage device or displayed on an appropriate display means such as a display (X _{11 to} X ₁₄ and the process may be displayed together).

従って、例えば、ＰＣ等の記憶装置に上記各手段となるプログラムを格納しておき、適宜なサンプルを用いて上記式（４）中の各係数ａ_１１，ｂ_１１，ｃ_１１，ｄ_１１及び定数Ｘ_００を予め得て、上記式（５）等のような式を評価値導出手段に備えることで、対象文章のＸ_１１〜Ｘ_１４を既存のソフトウエア等により簡単に分析してこれを入力するだけで日本語文章の読み易さを評価できることになる。 Therefore, for example, a program serving as each of the above means is stored in a storage device such as a PC, and the coefficients a ₁₁ , b ₁₁ , c ₁₁ , d ₁₁ and constants in the above equation (4) are stored using appropriate samples. obtained in advance the X _00, by providing the evaluation value deriving means an expression like the equation (5) or the like, enter this X ₁₁ to X ₁₄ of the target sentence to easily analyze the existing software or the like You can evaluate the readability of Japanese sentences just by doing.

その余は実施例１と同様である。 The rest is the same as in Example 1.

尚、実施例２は以下の比較実験の通り、適切に日本語文章の読み易さを評価できることを確認している。 In addition, Example 2 has confirmed that the readability of a Japanese sentence can be evaluated appropriately as the following comparative experiments.

２００７年１０月１０日に名古屋大学佐藤研究室が「ことば不思議箱」というページを公開し、「日本語テキストの難易度を測る」というツールを掲載した。実際に国語教科書に掲載された教材から無作為に１８のテキストを選び。このツール（以下、比較例）と、実施例２のリーダビリティー公式（以下、実施例）の両方で学年を計算したところ、表７のような結果となった。テキストの実学年、実施例、比較例で得られた学年について、反復測定による分散分析を行った。その結果、実学年（M=5.28，SD=2.47）、実施例（M=5.11，SD=2.40）、比較例（M=3.78，SD=2.58）の３種類について有意な主効果が得られた［F(2,34)=26.404，p<.001］。つまり、この３者には差があるという意味である。では、どこに差があるかを見るために単純対比を行った。その結果、実学年と実施例の学年指標については有意な違いがなかったが［F(1,17)=1.889，p=.187，n.s.］、実学年と比較例では有意な違いが見られた［F(1,17)=33.585，p<.001］。つまり、実学年と実施例で計算された結果には有意な差がなく、あったとしても僅差であるが、実学年と比較例で計算された結果には差があるということである。さらに、実施例と比較例の間にも有意な違いがみられた［F(1,17)=24.727，p<.001］。従って、実施例はテキストの実学年を比較例に比し、より適切に予測できることが示された。 On October 10th, 2007, Sato Lab at Nagoya University published a page called “Language Box” and a tool called “Measure the difficulty of Japanese text”. 18 texts were selected at random from the teaching materials actually published in the textbooks of Japanese language. When the school year was calculated using both this tool (hereinafter referred to as a comparative example) and the readability formula of Example 2 (hereinafter referred to as an example), the results shown in Table 7 were obtained. Analysis of variance was performed for the actual grade of the text, the grades obtained in the examples and comparative examples by repeated measurements. As a result, significant main effects were obtained for the three grades: actual grade (M = 5.28, SD = 2.47), examples (M = 5.11, SD = 2.40), and comparative examples (M = 3.78, SD = 2.58). [F (2,34) = 26.404, p <.001]. This means that there is a difference between the three. Then, a simple comparison was made to see where there was a difference. As a result, there was no significant difference between the actual grade and the grade index of the working example [F (1,17) = 1.889, p = .187, ns], but there was a significant difference between the actual grade and the comparative example. [F (1,17) = 33.585, p <.001]. In other words, there is no significant difference between the results calculated in the actual school year and the example, and if any, there is a slight difference, but there is a difference between the results calculated in the actual school year and the comparative example. Furthermore, a significant difference was also observed between Examples and Comparative Examples [F (1,17) = 24.727, p <.001]. Therefore, it was shown that the example can predict the actual grade of a text more appropriately than a comparative example.

本発明は、本実施例に限られるものではなく、各構成要件の具体的構成は適宜設計し得るものである。 The present invention is not limited to this embodiment, and the specific configuration of each component can be designed as appropriate.

学年ごとの１文当たりの平均文字数と平均語数を示すグラフである。It is a graph which shows the average number of letters and average number of words per sentence for every grade. 学年ごとの語種の割合を示すグラフである。It is a graph which shows the ratio of the word type for every grade. 学年ごとの文法の複雑さを示すグラフである。It is a graph showing the complexity of grammar for each grade. 重回帰分析の結果を示す説明図である。It is explanatory drawing which shows the result of a multiple regression analysis. 学年ごとの１文当たりの平均文節数を示すグラフである。It is a graph which shows the average number of clauses per sentence for every grade. 学年ごとの１文当たりの平均述語数を示すグラフである。It is a graph which shows the average number of predicates per sentence for every grade. 学年ごとの文字種の割合を示すグラフである。It is a graph which shows the ratio of the character type for every grade. 学年ごとの１文当たりの文字数と文節数を示すグラフである。It is a graph which shows the number of characters per sentence and the number of clauses for every grade. 学年ごとの語種の割合を示すグラフである。It is a graph which shows the ratio of the word type for every grade. 重回帰分析の結果を示す説明図である。It is explanatory drawing which shows the result of a multiple regression analysis.

Claims

Hiragana for calculating the sentence storage means for storing a target sentence, and the average number-of-characters calculation means for calculating the average number of characters X ₁₁ per sentence sentences this sentence stored in the storage means, the hiragana percentage X ₁₂ of the entire sentence A ratio calculating means; an average predicate number calculating means for calculating an average number of predicates per sentence X ₁₃ of the sentence; an average phrase number calculating means for calculating an average number of phrases X ₁₄ per sentence of the sentence; The average number of characters per sentence X ₁₁ calculated by the average number of characters calculation means, the ratio X ₁₂ of hiragana in the sentence calculated by the hiragana ratio calculation means, the average per sentence calculated by the average predicate number calculation means Using the predicate number X ₁₃ and the average phrase number X ₁₄ per sentence calculated by the average phrase number calculation means as explanatory variables, the following equation (4)
_{_{_{_{Y = a 11 X 11 + b}}}} 11 X 12 + c 11 X 13 + d 11 X 14 + X 00 (4)
However, in the above formula _{_{_{(4), a 11, b}}} 11, c 11, d 11 _{coefficient, X 00}
Comprises an evaluation value deriving means for deriving an evaluation value Y for substituting a constant to evaluate the readability of the sentence as a dependent variable, and the evaluation value deriving means has an evaluation value Y in the first grade to the third grade. Each of the coefficients a ₁₁ , b ₁₁ , c ₁₁ , d ₁₁ and the constant X ₀₀ in the equation (4) is determined so as to determine which of the corresponding level 1 to level 9 is closest. The average number of characters per sentence of each sentence sample obtained by analyzing a large number of sentence samples extracted from Japanese textbooks corresponding to each grade of the first grade to the third grade or books corresponding to the grades X ₁₁ In each of the above grades, which are set in advance in each sentence sample, the ratio X ₁₂ of hiragana in the whole sentence, the average number of predicates per sentence X ₁₃ and the average number of phrases X ₁₄ per sentence in the sentence are explanatory variables. corresponding evaluation value Y Sentence ease evaluation system readings, characterized in that is obtained by performing a multiple regression analysis as the dependent variable.