WO2023228359A1

WO2023228359A1 - Word selection device, method, and program

Info

Publication number: WO2023228359A1
Application number: PCT/JP2022/021577
Authority: WO
Inventors: 早苗藤田; 哲生小林; 正嗣服部
Original assignee: 日本電信電話株式会社
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2023-11-30

Abstract

A word selection device A1 comprises: a storage unit 11 in which a word familiarity DB storing a plurality of words and a plurality of familiarities corresponding to the plurality of words is stored, the familiarity being an indicator of intimacy with a word; and a word selection unit 12 that uses the word familiarity DB stored in the storage unit 11 to select a plurality of test words from among the plurality of words such that familiarity intervals corresponding to the test words are constant intervals.

Description

Word selection device, method and program

The disclosed technology relates to a technology for selecting words.

The total number of words that a person knows is called that person's vocabulary. The vocabulary number estimation test is a test that accurately estimates the vocabulary number in a short time (see, for example, Non-Patent Document 1). An outline of the estimation procedure is shown below.

(1) Arrange the word list in the word familiarity DB (database) in order of familiarity and select test words at approximately regular intervals (for example, select one word for every 1000 words). Familiarity (word familiarity) is a numerical value of the familiarity of a word. The higher the degree of familiarity with a word, the more familiar the word is.

(2) Present a test word to the user and ask them to answer whether they know the word or not.

(3) Perform logistic regression analysis to best explain such combinations of test words and answers. However, in this logistic regression analysis, the total number of words in the word familiarity DB that have a familiarity greater than or equal to each test word is set as the independent variable x, and the probability that the user answers that they know each word (for example, 0 or 1) as the dependent variable y. As a result of the logistic regression analysis, a logistic model (or a logistic regression equation) is obtained. An example of a logistic model is shown in FIG.

(4) In the obtained logistic model, find the value of x corresponding to y=0.5 and use it as the estimated vocabulary number. Note that the estimated number of vocabulary means the value estimated to be the user's vocabulary.

In this method, by using a word familiarity database, it is possible to accurately estimate the user's vocabulary size simply by testing whether or not the user knows the selected test word.

Note that, as shown in FIG. 11, the number of words corresponding to each intimacy level is not the same. In other words, the number of words varies depending on familiarity.

For this reason, even if test words are selected at approximately constant intervals from the words arranged in order of familiarity, the familiarity values of the selected test words will not be at constant intervals. In other words, more words are selected from a familiarity level where many words are concentrated, and conversely, fewer words are selected from a familiarity level where few words exist.

For this reason, for example, test words are not often selected from words with low familiarity, so the user may know most of the test words presented, and there may not be enough words for y=0. First, even if logistic regression analysis is performed, it is difficult to converge, so there are cases where the estimated number of vocabulary exceeds the maximum number of words in the word familiarity database. For example, in the example of FIG. 12, the value of x corresponding to y=0.5 is outside the range of the diagram.

The disclosed technology aims to make it easier to converge the logistic regression analysis and to robustly estimate the number of vocabulary.

One aspect of the disclosed technology is a word selection device in which the degree of familiarity is an index representing the degree of intimacy with respect to a word, and a word parent that stores a plurality of words and a plurality of degrees of familiarity corresponding to the plurality of words. Using a storage unit in which the density DB is stored and a word familiarity DB stored in the storage unit, multiple test words are generated from a plurality of words so that the intervals between the familiarities corresponding to the test words are constant. and a word selection section for selecting words.

According to the disclosed technology, it is possible to make it easier to converge the logistic regression analysis and to robustly estimate the number of vocabulary.

FIG. 1 is a diagram showing an example of the functional configuration of a model generation device and a word selection device. FIG. 2 is a diagram illustrating an example of the processing procedure of the model generation method and word selection method. FIG. 3 is a diagram showing an example of a logistic regression model. FIG. 4 is a diagram illustrating an example of the functional configuration of the acquisition probability acquisition device. FIG. 5 is a diagram illustrating an example of the processing procedure of the acquisition probability acquisition method. FIG. 6 is a diagram for explaining an example of generation of acquired word information. FIG. 7 is a diagram showing an example of the functional configuration of the recommended learning word extraction device. FIG. 8 is a diagram illustrating an example of the processing procedure of the recommended learning word extraction method. FIG. 9 is a diagram showing an example of recommended learning words. FIG. 10 is a diagram showing an example of a functional configuration of a computer. FIG. 11 is a diagram showing an example of the correspondence between familiarity and number of words. FIG. 12 is a diagram for explaining the background technology.

Hereinafter, embodiments of the disclosed technology will be described with reference to the drawings.

[First embodiment]
First, a first embodiment will be described. The first embodiment is a model generation device and method, and a word generation device and method.

As illustrated in FIG. 1, the model generation device 1 of this embodiment includes a storage section 11, a word selection section 12, a presentation section 13, an answer reception section 14, a model generation section 15, and a vocabulary number estimation section 16. . The model generation device 1 does not need to include the word selection section 12, the presentation section 13, the answer reception section 14, the storage section 11, and the vocabulary number estimation section 16.

Note that, as shown by the broken line in FIG. 1, the word generation device A1 is configured by the storage section 11 and the word selection section 12. Note that the word generation device A1 may include a presentation section 13 and a response reception section 14.

<Storage unit 11>
The storage unit 11 stores an intimacy database (DB) in advance. The word familiarity DB is a database that stores sets of M words (a plurality of words) and a predetermined familiarity (word familiarity) for each word. In other words, a word familiarity DB is stored that stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words.

The M words in the word familiarity DB are ranked in an order based on familiarity (for example, in order of familiarity). M is an integer of 2 or more representing the number of words included in the word familiarity DB. There is no limit to the value of M, but for example, when measuring the number of vocabulary words in a native language, M is preferably 70,000 or more, and when measuring the number of vocabulary words in a second language (for example, English for Japanese native speakers), M is preferably 10,000 or more. . This is because the vocabulary size of Japanese adults is said to be around 40,000 to 50,000, so around 70,000 words would cover most people's vocabulary, including individual differences. On the other hand, in the case of a second language, it often does not have as large a vocabulary as the mother tongue, and it is thought that most people's vocabulary can be covered with fewer words than M in the case of the mother tongue. However, the number of vocabulary varies greatly depending on how the vocabulary is counted, such as variations in spelling and the handling of derived words. Therefore, depending on how you count vocabulary, you may need M of 100,000 or more for your native language. Furthermore, the upper limit of the estimated number of vocabulary is the number of words included in the standard word familiarity DB. Therefore, when estimating the vocabulary of a person with a large vocabulary who is an outlier, it is desirable to increase the value of M.

Familiarity (word familiarity) is an index that expresses the familiarity with a word. Examples of indicators that express the familiarity of a word are: Familiarity is an indicator that expresses the degree of familiarity of a word (for example, the numerical value of the familiarity of a word introduced in Non-Patent Document 1); These are indicators that show how well you see and hear, how well you know words, how well you can write words, and how well you can speak using words. .

For example, the higher the familiarity of a word, the more intimate the word. In this embodiment, the larger the numerical value representing the degree of intimacy, the higher the degree of intimacy. However, this does not limit the invention.

The storage unit 11 receives read requests from the word selection unit 12 and the model generation unit 15, and outputs the word corresponding to the request and the familiarity of the word.

<Word selection section 12>
Input: Request for question generation from the user or system Output: N test words used for the vocabulary estimation test When the word selection unit 12 receives a question generation request from the user or the system, the word selection unit 12 selects the word parent in the storage unit 11. A plurality of test words w(1), . . . , w(N) used in the vocabulary size estimation test are selected and output from a plurality of ordered words included in the density DB.

For example, the word selection unit 12 uses the word familiarity DB stored in the storage unit 11 to select a plurality of test words w(1), ..., w(N) from a plurality of words corresponding to the test word. The intervals of intimacy are selected to be constant intervals (step S12).

For example, the word selection unit 12 evenly selects N words from all the words included in the word familiarity DB in the storage unit 11 so that the familiarity of the selected words is at approximately constant intervals, and words are output as test words w(1), ..., w(N).

For example, the word selection unit 12 selects words such that the familiarity interval is 0.1. For example, the word selection unit 12 selects a word w(1) with a familiarity of 1, a word w(2) with a familiarity of 1.1, a word w(60) with a familiarity of 6.9, a word w(60) with a familiarity of 7 words w (61), a total of 61 words may be selected.

The familiarity of the test words w(1),...,w(N) does not necessarily have to be at regular intervals, it is sufficient if they are evenly selected, and from past research, it is the boundary between what the user knows and what they do not know. If the degree of familiarity in the surrounding area is predicted, a larger number of words in the vicinity of the degree of familiarity to be investigated may be selected. That is, the familiarity values of the series of test words w(1), . . . , w(N) may vary in density.

There is no limit to the order of the test words w(1),..., w(N) output from the word selection unit 12, but the word selection unit 12 selects the test words w(1),..., w(N) in order of familiarity, for example. Output w(N).

The number N of test words may be specified by the question generation request, or may be predetermined. Although there is no limit to the value of N, it is desirable that, for example, about 50≦N≦100. In order to perform sufficient estimation, it is desirable that N≧25. A larger N allows more accurate estimation, but increases the burden on the user (subject) (step S12).

In order to reduce the burden on users and increase accuracy, tests of 50 words each are conducted multiple times (for example, 3 times), the number of vocabulary is estimated for each test, and the answers from multiple tests are combined. You may re-estimate. In this case, the number of words tested at one time can be reduced, which reduces the burden on the user, and if the results can be viewed for each test, the user's motivation to answer can be maintained. Furthermore, if the final vocabulary size estimation is performed by combining words from multiple times, the estimation accuracy can be improved.

By selecting multiple test words so that the intervals of familiarity corresponding to the test words are constant, variations in familiarity can be suppressed, making it easier for the logistic curve to converge.

<Presentation section 13>
Input: N test words Output: instruction sentence and N test words N test words w(1), ..., w(N) output from the word selection section 12 are input to the presentation section 13. Ru. The presentation unit 13 presents the test words w(1), ..., w(N) to the user 100 (subject) according to a preset display format (step S13).

For example, the presentation unit 13 displays a predetermined instruction sentence prompting the user 100 to input an answer regarding his or her knowledge of test words, and N test words w(1), . . . , in accordance with a preset display format. w(N) is presented to the user 100 in a format for a vocabulary size estimation test.

There are no limitations to this presentation format, and this information may be presented as visual information such as text or images, auditory information such as audio, or tactile information such as Braille. good.

For example, the presentation unit 13 may electronically display the instruction sentence and the test words on the display screen of a terminal device such as a PC (personal computer), tablet, or smartphone. That is, the presentation unit 13 may generate screen information to be presented on a display or the like, and may output the screen information to the display.

Alternatively, the presentation unit 13 may be a printing device, and the instruction sentences and test words may be printed on paper or the like and output. Alternatively, the presentation unit 13 may be a speaker of the terminal device and may output the instruction sentence and the test word aloud. Alternatively, the presentation unit 13 may be a Braille display and present the instruction sentence and the test word in Braille.

User 100's answer regarding knowledge of the test word indicates either "knowing" or "doing not know" the test word (response of knowing or not knowing the test word of each rank) or may represent any of three or more options including "I know" and "I don't know." Examples of options other than ``I know'' and ``I don't know'' include ``I'm not sure (I know)'' and ``I know the word, but I don't know the meaning.'' However, even if the user 100 is asked to answer from three or more options including ``I know'' and ``I don't know,'' the number of vocabulary is lower than when the user 100 is asked to answer either ``I know'' or ``I don't know.'' In some cases, the estimation accuracy may not improve. For example, if the user 100 is asked to select an answer from three options: "I know," "I don't know," and "I'm not confident," it is up to the user 100 to choose "I'm not confident." Depends on personality. In such a case, increasing the number of options does not improve vocabulary size estimation accuracy. Therefore, it is usually preferable to have the user 100 answer the test word with two choices, such as "I know" or "I don't know."

However, rather than "I know" or "I don't know," it is "I can explain the meaning of (the test word)" or "I can't create an example sentence (using the test word)." ” or “I can't explain the meaning.” By clarifying your viewpoint, the estimated number of vocabulary will change. For example, if the user is able to create example sentences, the number of vocabulary they think they can use is estimated.

In the following, an example will be described in which the user 100 is asked to answer the test word by selecting either "I know" or "I don't know."

Furthermore, for example, the test words are presented in descending order of familiarity, but the presentation order is not limited to this, and the test words may be presented in a random order.

<Reply reception section 14>
Input: Answer regarding the user's knowledge of the test word Output: Answer regarding the user's knowledge of the test word The user 100, who has been presented with the instruction sentence and the test word, sends the answer regarding the user's 100 knowledge of the test word to the response reception section. 14 (step S14).

For example, the answer reception unit 14 is a touch panel of a terminal device such as a PC, a tablet, or a smartphone, and the user 100 inputs the answer to the touch panel. The answer receiving unit 14 may be a microphone of a terminal device, and in this case, the user 100 inputs the answer by voice into the microphone.

The user 100 may input an answer into the answer reception unit 14 by clicking with a mouse or the like.

The answer reception unit 14 receives an input answer regarding knowledge of the test word (for example, an answer that the test word is known or an answer that the test word is not known), and outputs the answer as electronic data. do. The answer receiving unit 14 may output an answer for each test word, may output answers for one test at once, or may output answers for multiple tests at once.

For example, when the answer reception unit 14 receives an answer that the user 100 knows the test word, it assigns a value of 1 to the answer regarding the knowledge of the test word. On the other hand, when the answer reception unit 14 receives an answer that the user 100 does not know the test word, it assigns a numerical value of 0 to the answer regarding the knowledge of the test word. These numerical values are output to the model generation section 15.

<Model generation unit 15>
Input: Answer regarding the user's knowledge of the test word Output: Model The answer regarding the user's 100 knowledge of the test word outputted from the answer reception unit 14 is input to the model generation unit 15.

The model generation unit 15 uses the answer regarding the knowledge of the test word and the word familiarity DB stored in the storage unit 11 to generate a value based on the familiarity corresponding to the test word and the user 100's knowledge of the test word. A model representing the relationship between a value based on the probability of answering that they know is obtained (step S15). The obtained model is output to the vocabulary number estimation section 16.

The value based on the familiarity corresponding to the test word may be the familiarity itself corresponding to the test word, or may be a non-monotonically decreasing function value (for example, a monotonically increasing function value) of the familiarity corresponding to the test word. You can. To simplify the explanation, a case will be exemplified below in which the value based on the degree of familiarity corresponding to the test word is the degree of familiarity corresponding to the test word itself.

The value based on the probability that the user 100 answers that he or she knows the test word may be the probability that the user 100 answers that he or she knows the test word, or may be the probability that the user 100 answers that he or she knows the test word. It may be a non-monotonically decreasing function value (for example, a monotonically increasing function value) of the probability of answering that the answer is yes. To simplify the explanation, in the following, a case is exemplified in which the value based on the probability that the user 100 answers that he or she knows the test word is the probability that the user 100 answers that he or she knows the test word. .

Although there is no limitation to the model, an example of the model is a logistic regression model (logistic model). To simplify the explanation, below, we use a logistic curve y=f(x ,Ψ) is a model. Ψ is a model parameter.

The model generation unit 15 refers to the word familiarity DB stored in the storage unit 11 to obtain the familiarity corresponding to the test word w(n) that the user 100 answered that he/she knows. Let x(n) be the intimacy level. This familiarity x(n) is the familiarity corresponding to the test word w(n).

The model generation unit 15 calculates that for the test word w(n) that the user 100 answered that he or she knows, the probability y that the user 100 answers that he or she knows the test word w(n) is 1 (that is, 100%). ), and a point (x, y)=(x(n), 1) is set where the familiarity corresponding to the test word w(n) is x(n).

Further, the model generation unit 15 determines that the user 100 knows the test word w(n) for which the user 100 answered that he or she does not know (or does not answer that he or she knows) the test word w(n). The point (x, y) = (x(n), 0).

The model generation unit 15 fits each point (x, y) = (x(n), 1) or (x(n), 0) of n = 1, ..., N to a logistic curve, A logistic curve y=f(x,Ψ) that minimizes the error is obtained as a model. That is, the model generation unit 15 generates a logistic model that minimizes the error for each point (x, y) = (x(n), 1) or (x(n), 0) of n = 1, ..., N. A curve y=f(x,Ψ) is obtained as a model.

FIG. 3 illustrates a model of the logistic curve y=f(x, Ψ). In FIG. 3, the horizontal axis represents the degree of familiarity, and the vertical axis represents the probability (y) of a person answering that they know the word. The circles indicate the points (x, y) = (x(n), 1) for the test words w(n) that the user 100 answered that they knew, and the points that the user 100 answered that they did not know (or did not know). The point (x, y) = (x(n), 0) for the test word w(n) is expressed as (x, y) = (x(n), 0). "AIC" in FIG. 3 represents the Akaike information criterion, and the smaller the value, the better the fit of the model. "n" in FIG. 3 represents the number of test words.

Here, generation can also be translated as creation or construction. Therefore, the model generation section 15 may be the model generation section 15 or the model construction section 15. Models may also be created or constructed.

<Number of vocabulary estimation unit 16>
Input: Model Output: Number of vocabulary of user 100 The vocabulary number estimating unit 16 estimates the number of vocabulary of user 100 based on the model (step S16).

Hereinafter, estimation methods 1 to 3 will be explained as examples of methods for estimating the number of vocabulary of the user 100 by the vocabulary number estimation unit 16.

(Estimation method 1)
In the model, the vocabulary size estimating unit 16 obtains a predetermined value acquisition familiarity, which is the familiarity when the value based on the probability that the user 100 answers that he or she knows the word is at a predetermined value or in the vicinity of the predetermined value. Examples of the predetermined value are 0.5 or 0.8. Of course, the predetermined value may be any other value greater than 0 and less than 1.

Then, the vocabulary number estimating unit 16 refers to the word familiarity DB stored in the storage unit 11, obtains the number of words with a familiarity equal to or higher than a predetermined value of acquired familiarity, and uses the obtained number as a user's The number of vocabulary is 100.

(Estimation method 2)
The vocabulary number estimating unit 16 refers to the model and the word familiarity DB stored in the storage unit 11 to determine the familiarity x(m) corresponding to the word w(m) included in the word familiarity DB. ) is input to the model, the output value y(m) is obtained. In other words, the vocabulary size estimation unit 16 calculates the value of y corresponding to the familiarity x(m) corresponding to the word w(m) in the model, and sets the calculated value as the output value y(m). . The vocabulary number estimating unit 16 performs this process on each word w(m) (m=1,...,M) included in the word familiarity DB, thereby obtaining an output value y(m)(m=1,...,M). …, M) is obtained.

Then, the vocabulary number estimating unit 16 calculates Σ _m=1 ^M y (m), and sets this calculated value as the user's 100 vocabulary number.

At that time, if the word w(m) is a test word and an answer regarding the knowledge of the test word w(m) has been obtained, the vocabulary number estimation unit 16 calculates the knowledge of the test word w(m). The number of vocabulary of the user 100 may be estimated by taking into account the answers regarding the question.

For example, the vocabulary estimation unit 16 sets y(m)=1 when the answer regarding the knowledge of the test word w(m) is that the person knows the test word w(m). If the answer is no, y(m)=0. For y(m) of words other than the test word, the output value y(m) obtained from the model as described above is used.

Then, the vocabulary number estimating unit 16 uses these y(m) to calculate Σ _m=1 ^My (m), and sets this calculated value as the user's 100 vocabulary number.

By considering answers regarding knowledge of test words, a more appropriate vocabulary size can be estimated.

When the number of vocabulary is directly set to x by estimating the number of vocabulary based on the logistic model estimated from y, which is the probability that the user 100 answers that he/she knows the test word, and the familiarity x of the test word. The model converges more easily and the number of vocabulary can be estimated more robustly. Moreover, even if the distribution of the number of words corresponding to each degree of familiarity differs greatly, a sudden change in the estimated number of vocabulary can be suppressed.

(Estimation method 3)
The vocabulary number estimation unit 16 refers to the model and the word familiarity DB stored in the storage unit 11, and calculates the result when the familiarity x(i) included in the word familiarity DB is input to the model. Obtain the output value y(i). In other words, the vocabulary number estimating unit 16 calculates the value of y corresponding to the familiarity x(i) in the model, and sets the calculated value as the output value y(i). Further, the vocabulary number estimating unit 16 refers to the word familiarity DB stored in the storage unit 11, and the number n( of words corresponding to the familiarity x(i) included in the word familiarity DB) i) obtain. The vocabulary number estimation unit 16 performs these processes for each familiarity level x(i) (i=1,...,I) included in the word familiarity level DB, thereby obtaining an output value y(i) (i= 1,...,I), and the number of words n(i) (i=1,...,I). I is the number of types of intimacy.

Then, the vocabulary number estimating unit 16 calculates Σ _i=1 ^I y(i)×n(i), and sets this calculated value as the user's 100 vocabulary number.

If the intimacy levels are the same, the corresponding y values are the same. Also, there may be words with the same familiarity. Therefore, by performing calculations for each degree of familiarity as in estimation method 3 instead of estimation method 2, it is possible to quickly calculate the vocabulary size estimation.

<Modified example of first embodiment>
The word selection unit 12 does not simply select a plurality of test words w(1), ..., w(N) from a plurality of words, rather than setting the familiarity intervals corresponding to the test words at regular intervals. good.

Further, the model generation unit 15 assumes an answer regarding knowledge of the non-presentation words, and generates values based on the familiarity corresponding to the test words and the non-presentation words, and the user 100's knowledge of the test words and the non-presentation words. A model representing the relationship between the probability of answering or a value based on an assumption may be obtained.

Here, the non-presented word is a word other than the plurality of test words among the plurality of words. In order to make it easier for the logistic model to converge, answers for non-presented words that were not used as test words are assumed and used to create the model. Words near the upper limit of familiarity are words that many people know, and words near the lower limit are words that many people do not know. Therefore, if the user 100 answers that he/she knows the word with the highest degree of familiarity among the test words, it is assumed that the user 100 also knows the non-presented words with a degree of familiarity higher than that degree of familiarity. Conversely, if the user answers that he or she does not know the word with the lowest familiarity among the test words, it is assumed that the user does not know the non-presented word with a familiarity lower than that familiarity.

In other words, assuming that a non-presentation word is presented to the user 100, the answer regarding the knowledge of the non-presentation word of the user 100 will be It is assumed that the answer is "I know", and the answer is "I don't know" for words whose familiarity is lower than the minimum value of the test word's familiarity.

For example, if the user answers that they know a test word with a familiarity level of 6.5, it is assumed that the user answers that they know a non-presentation word with a familiarity level of 6.7 or 6.9. Furthermore, if the user answers that he or she does not know the test word with a familiarity level of 2, it is assumed that the user answers that he/she does not know the non-presentation words with a familiarity level of 1.8 or 1.6.

In this way, by adding non-presented words, which are words that were not presented to the user 100, and answers regarding knowledge of non-presented words, and estimating the model, the model converges more easily and a more appropriate model can be created. can be generated. This makes it easier for the model to converge, for example, even if user 100 answers that he or she knows most of the test words, or even if user 100 answers that he or she does not know most of the test words. , a more appropriate model can be generated.

[Second embodiment]
A second embodiment will be described. The second embodiment is an acquisition probability acquisition device and method.

Below, differences between the first embodiment and the modification of the first embodiment will be mainly described. Explanation of matters that have already been explained may be omitted.

As illustrated in FIG. 4, the acquisition probability acquisition device 2 of this embodiment includes a storage section 11, a model storage section 21, a word extraction section 22, a familiarity acquisition section 23, an acquisition probability acquisition section 24, and an acquisition word information generation section. It is equipped with 25. The acquisition probability acquisition device 2 does not need to include the word extraction section 22 and the acquisition word information generation section 25.

<Storage unit 11>
The storage unit 11 is the same as the storage unit 11 of the first embodiment.

The storage unit 11 stores a word familiarity DB that stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words. Here, the degree of familiarity is an index representing the degree of familiarity with a word.

<Model storage unit 21>
The model storage unit 21 stores a model representing the relationship between a value based on the degree of familiarity corresponding to each word and a value based on the probability that a certain person has acquired each word. Here, "a certain person" is a person who obtains the acquisition probability. “Someone” may be the user 100.

Here, acquiring words means, in other words, knowing the words, being able to use the words, knowing the words, or being able to explain the words.

An example of this model is a model generated by the model generation device 1 of the first embodiment and a modification of the first embodiment.

As shown by the broken line in FIG. 4, the acquisition probability acquisition device 2 may further include a model generation device 1 for generating a model stored in the model storage unit 21.

That is, the acquisition probability acquisition device 2 includes (1) a word selection unit 12 that selects a plurality of test words from a plurality of words, (2) a presentation unit 13 that presents test words to a user, and (3) a user (4) Answer reception unit 14 that accepts answers regarding knowledge of test words; (4) Answers regarding knowledge of test words and word familiarity DB stored in storage unit 11; A model expressing the relationship between a value based on , and a value based on the probability that the user answers that he or she knows the test word is obtained, and the obtained model is used as the model stored in the model storage unit. The device may further include a portion 15.

<Word extraction unit 22>
Input: Text Output: Word The word extraction unit 22 extracts each word included in the input text (step S22).

Each extracted word is output to the familiarity acquisition unit 23.

The text input to the word extraction unit 22 may be any text that can be read by the word extraction unit 22, which is an information processing device. Examples of texts are books such as textbooks and novels, newspapers and magazines, and texts published on web pages.

The word extraction unit 22 extracts each word contained in the input text, for example, by performing morphological analysis on the input text.

<Familiarity acquisition unit 23>
Input: Word Output: Word, Familiarity Each word extracted by the word extraction unit 22 is input to the intimacy acquisition unit 23 . The intimacy level acquisition unit 23 acquires the familiarity level corresponding to each word from the word familiarity level DB stored in the storage unit 11 (step S23).

If the acquisition probability acquisition device 2 does not include the word extraction unit 22, each word included in the text is input. In this case, the familiarity acquisition unit 23 acquires the familiarity corresponding to each word included in the text from the word familiarity DB stored in the storage unit 11 (step S23).

Each word and the familiarity corresponding to each word are output to the acquisition probability acquisition unit 24.

Note that the familiarity acquisition unit 23 and the word extraction unit 22 do not need to acquire the familiarity for words that are function words such as proper nouns, numerals, and particles. In other words, the word extraction unit 22 may acquire the familiarity of only words that are content words.

Function words such as number words and particles are words that many people know. Therefore, by acquiring the familiarity of these function words, in other words, by using these function words as processing targets, the acquired word information generation unit 25 calculates the familiarity of the text. The percentage of estimated acquired words can be increased. On the other hand, by not acquiring familiarity for these function words, in other words, by not processing these function words, the acquired word information generation unit 25 calculates The estimated percentage of acquired words can be lowered.

Additionally, the familiarity acquisition unit 23 may ignore words that are not included in the word familiarity DB without acquiring the familiarity. Thereby, even if the morphological analysis is incorrect, the acquisition probability acquisition process can be performed appropriately.

<Acquisition probability acquisition unit 24>
Input: word, familiarity output: word, acquisition probability The acquisition probability acquisition unit 24 uses at least the familiarity level corresponding to each word and the model stored in the model storage unit 21 to determine whether each word is The acquisition probability, which is the probability that has been acquired, is acquired (step S24).

The acquisition probability acquisition unit 24 obtains an output value when the familiarity corresponding to each word is input into the model, and uses the obtained output value as the acquisition probability corresponding to each word. In other words, the acquisition probability acquisition unit 24 calculates the value of y corresponding to the familiarity x corresponding to each word in the model, and uses the calculated value as the acquisition probability corresponding to each word.

The model stored in the model storage unit 21 uses a logistic curve y=f(x ,Ψ) is a model, the acquisition probability acquisition unit 24 calculates the value of y=f(x,Ψ) corresponding to the familiarity x corresponding to each word, and uses the calculated value as Let it be the acquisition probability corresponding to each word.

The acquisition probability acquisition unit 24 may acquire the acquisition probability by considering the part of speech, word length, etc. For example, the acquisition probability acquisition unit 24 may acquire the acquisition probability using part of speech, word length, etc. as explanatory variables.

Each word and the acquisition probability corresponding to each word are output to the acquisition word information generation unit 25.

<Acquired word information generation unit 25>
Input: word, acquisition probability Output: acquired word information The acquired word information generation unit 25 generates acquired word information, which is information regarding acquisition of words included in the text, using the acquisition probability corresponding to each word (step S25 ).

An example of the acquired word information is at least one of the following: estimated acquired words in the text, the number of estimated acquired words in the text, and the ratio of estimated acquired words in the text.

Hereinafter, examples of how each of the estimated acquired words in a text, the number of estimated acquired words in a text, and the proportion of estimated acquired words in a text are determined will be explained.

(estimated acquired words in text)
First, the acquired word information generation unit 25 estimates the number of vocabulary words of a certain person. The number of vocabulary can be estimated by the method described in the vocabulary number estimation unit 16 of the first embodiment. In order to estimate the number of vocabulary, the word familiarity DB from the storage unit 11 and the model from the model storage unit 21 may be input to the acquired word information generation unit 25, as shown by the dashed line in FIG.

Next, the acquired word information generation unit 25 obtains the number GOISU(k) of words with a familiarity level greater than or equal to the familiarity level corresponding to each input word w(k). In order to obtain GOISU(k), the word familiarity DB may be input from the storage unit 11 to the acquired word information generation unit 25, as shown by the dashed line in FIG.

Then, the acquired word information generation unit 25 sets words whose GOISU(k) is less than or equal to the number of vocabulary of a certain person as estimated acquired words in the text. Generally, the higher the familiarity of a word, the smaller the GOISU(k). Therefore, it can be assumed that a person knows words in GOISU(k) that are less than the number of words in that person's vocabulary.

FIG. 6 shows an example of GOISU(k).

(Estimated number of acquired words in text)
First, the acquired word information generation unit 25 estimates the number of vocabulary words of a certain person. The number of vocabulary can be estimated by the method described in the vocabulary number estimation unit 16 of the first embodiment. In order to estimate the number of vocabulary, the word familiarity DB from the storage unit 11 and the model from the model storage unit 21 may be input to the acquired word information generation unit 25, as shown by the dashed line in FIG.

Then, the acquired word information generation unit 25 sets the number of words whose GOISU(k) is less than or equal to the number of vocabulary of a certain person as the estimated number of acquired words in the text.

(Percentage of estimated acquired words in text)
The acquired word information generation unit 25 calculates a value determined by, for example, the following formula (1) or formula (2), and uses the calculated value as the ratio of estimated acquired words in the text.

(Σ _k=1 ^K y(k) FREQ(k))/Σ _k=1 ^K FREQ(k)…(1)
(Σ _k=1 ^K y(k)DIFF(k))/Σ _k=1 ^K DIFF(k)…(2)
Here, FREQ(k) is the number of times the word w(k) appears in the text. Assuming that the text is divided into multiple parts, DIFF(k) is the number of parts in which the word w(k) appears. An example of a part is a predetermined unit that constitutes a text, such as a unit, chapter, or section. The entire text may be used as a unit. K is the total number of words included in the text and for which the acquisition probability has been acquired by the acquisition probability acquisition unit 24.

The acquired word information generation unit 25 counts FREQ(k) and DIFF(k) based on the input word. The acquired word information generation unit 25 calculates a value determined by equation (1) or equation (2) using FREQ(k) and DIFF(k) found by counting.

FIG. 6 shows an example of FREQ(k) and DIFF(k).

In general, the more people know a word, the more frequently it appears, and the more people don't know a word, the less often it appears. Therefore, the number of occurrences of rare words in text will be less than the number of occurrences of well-known words in text.

Therefore, the proportion of estimated acquired words in the text determined by equation (1) using FREQ(k) is higher than the proportion of estimated acquired words in the text determined by equation (2) using DIFF(k). It is considered to be. Which of equations (1) and (2) to use is determined as appropriate depending on what kind of information is needed as acquired word information.

Note that the acquired word information generation unit 25 may use the number of estimated acquired words in the text/K as the ratio of the estimated acquired words in the text. The number of estimated acquired words in a text can be determined by the method described in (Estimated number of acquired words in text).

[Third embodiment]
A third embodiment will be described. The third embodiment is a recommended learning word extraction device and method.

As illustrated in FIG. 7, the recommended learning word extraction device 3 of this embodiment includes a storage section 11, a model storage section 31, an acquisition probability acquisition section 32, and a recommended learning word extraction section 33.

<Model storage unit 31>
The model storage unit 31 stores a model representing the relationship between a value based on the degree of familiarity corresponding to each word and a value based on the probability that a certain person has acquired each word. Here, "a certain person" is a person from whom recommended learning words are extracted. “Someone” may be the user 100.

As shown by the broken line in FIG. 6, the recommended learning word extraction device 3 may further include a model generation device 1 for generating a model stored in the model storage unit 31.

That is, the learning recommended word extraction device 3 includes (1) a word selection section 12 that selects a plurality of test words from a plurality of words, (2) a presentation section 13 that presents test words to a user, and (3) a usage method. and (4) the answer reception unit 14 that receives answers regarding the knowledge of the test words by the parent who corresponds to the test words, using the answers regarding the knowledge of the test words and the word familiarity DB stored in the storage unit 11. A model that obtains a model expressing the relationship between a value based on the density and a value based on the probability that the user answers that he or she knows the test word, and uses the obtained model as the model stored in the model storage unit. The generating unit 15 may further be provided.

<Acquisition probability acquisition unit 32>
Input: word Output: word, acquisition probability The acquisition probability acquisition unit 32 receives a word set consisting of a plurality of words that are candidates for recommended learning words.

The acquisition probability acquisition unit 32 uses at least the word familiarity DB stored in the storage unit 11 and the model stored in the model storage unit 31 to identify each word included in the input word set to a certain person. The acquisition probability, which is the probability that is acquired, is obtained (step S32).

The acquisition probability acquisition unit 32 obtains an output value when the familiarity corresponding to each word is input into the model, and uses the obtained output value as the acquisition probability corresponding to each word. In other words, the acquisition probability acquisition unit 32 calculates the value of y corresponding to the familiarity x corresponding to each word in the model, and uses the calculated value as the acquisition probability corresponding to each word.

The model stored in the model storage unit 31 uses a logistic curve y=f(x , Ψ) is a model, the acquisition probability acquisition unit 32 calculates the value of y=f(x, Ψ) corresponding to the familiarity x corresponding to each word, and uses the calculated value as Let it be the acquisition probability corresponding to each word.

The acquisition probability acquisition unit 32 may acquire the acquisition probability by considering the part of speech, the length of the word, etc. For example, the acquisition probability acquisition unit 32 may acquire the acquisition probability using part of speech, word length, etc. as explanatory variables.

<Learning recommended word extraction unit 33>
Input: word, acquisition probability Output: recommended learning word The recommended learning word extraction unit 33 extracts recommended learning words from the word set based on the acquired acquisition probability (step S33).

For example, the recommended learning word extraction unit 33 may extract words whose acquisition probability is close to a predetermined probability as recommended learning words.

The predetermined probability is a number greater than 0 and less than 1. An example of a predetermined probability is 0.5.

The recommended learning word extraction unit 33 may extract a predetermined number of words with a predetermined probability as recommended learning words.

If the predetermined probability is 0.5 and the predetermined number is 7, for example, the 7 words shown in FIG. 9 are extracted as recommended learning words. In FIG. 9, ENTRY is the notation of the word, PSY is the familiarity, Prob is the acquisition probability, and YN is the answer that the user 100 knows or does not know about these words. If there is, it is information about the answer, and Distance50 is the size of the difference between 0.5, which is the predetermined probability in this case, and Prob.

In this example, "-" is displayed in YN because the user 100 has not answered that he or she knows or does not know. If user 100 answers that they know the word, "1" is displayed in YN, and if user 100 answers that they do not know the word, "0" is displayed in YN. ” is displayed.

The recommended learning words are presented to the person from whom the recommended learning words are to be extracted. The recommended learning words may be presented to a person from whom the recommended learning words are to be extracted, in the form of a table shown in FIG.

The recommended learning word extraction unit 33 may extract words included within a predetermined range that includes a predetermined probability as recommended learning words.

The recommended learning word extraction unit 33 may extract words with a predetermined part of speech and whose obtained acquisition probability is close to a predetermined probability as recommended learning words. Examples of predetermined parts of speech are verbs, nouns, and adjectives. The predetermined part of speech may be two or more types of parts of speech. In this case, the recommended learning word extracting unit 33 may extract, as recommended learning words, words whose acquisition probability is close to a predetermined probability from among words of two or more types of parts of speech.

The part of speech information may be stored in the word familiarity DB. In this case, the learning recommended word extraction unit 33 can refer to the word familiarity DB, obtain the part of speech of the word, and perform the above processing.

The learning recommended word extraction unit 33 may refer to a dictionary in which words and their parts of speech are stored in a storage unit (not shown), obtain the part of speech of the word, and perform the above processing.

<Modified example of third embodiment>
The word set that is input to the acquisition probability acquisition unit 32 and is made up of a plurality of words that are candidates for learning recommended words may be words that are included in a predetermined text. For this purpose, the recommended learning word extraction device 3 may include a word extraction section 34 described below.

<Word extraction unit 34>
Input: Text Output: Word The word extraction unit 34 extracts each word included in the input text (step S34).

Each extracted word is output to the acquisition probability acquisition unit 32 as a word set that is a candidate for a recommended learning word.

The text input to the word extraction unit 34 may be any text that can be read by the word extraction unit 22, which is an information processing device. Examples of texts are books such as textbooks and novels, newspapers and magazines, and texts published on web pages.

The word extraction unit 34 extracts each word included in the input text, for example, by performing morphological analysis on the input text.

[Modified example]
Note that the present disclosure is not limited to the embodiments described above, and various modifications and applications are possible without departing from the gist of the present disclosure.

The various processes described in the embodiments are not only executed in chronological order according to the order described, but may also be executed in parallel or individually depending on the processing capacity of the device that executes the processes or as necessary.

For example, data may be exchanged directly between the components of the model generation device 1, acquisition probability acquisition device 2, and recommended learning word extraction device 3, or may be performed via a storage unit (not shown). .

[Program, recording medium]
The processing of each part of each device described above may be realized by a computer, and in this case, the processing contents of the functions that each device should have are described by a program. Then, by loading this program into the storage unit 1020 of the computer 1000 shown in FIG. is realized on a computer.

A program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, specifically a magnetic recording device, an optical disk, or the like.

Further, distribution of this program is performed, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.

A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer into the auxiliary storage unit 1050, which is its own non-temporary storage device. Store. When executing a process, this computer loads a program stored in the auxiliary storage unit 1050, which is its own non-temporary storage device, into the storage unit 1020, and executes the process according to the read program. Further, as another form of execution of this program, the computer may directly load the program from a portable recording medium into the storage unit 1020 and execute processing according to the program. Each time the received program is transferred, processing may be executed in accordance with the received program. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

Furthermore, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be implemented in hardware.

For example, the word selection unit 12, the presentation unit 13, the answer reception unit 14, the model generation unit 15, the number of vocabulary estimation unit 16, the word extraction unit 22, the familiarity acquisition unit 23, the acquisition probability acquisition unit 24, the acquired word information generation unit 25 , the acquisition probability acquisition section 32, the recommended learning word extraction section 33, and the word extraction section 34 may be constituted by a processing circuit.

Furthermore, the storage unit 11, model storage unit 21, and model storage unit 31 may be configured by memory.

Regarding the above embodiments, the following additional notes are further disclosed.

(Additional note 1)
memory and
at least one processor connected to the memory;
including;
The memory stores a word familiarity DB in which familiarity is an index representing intimacy with a word, and stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words, respectively;
The processor includes:
selecting a plurality of test words from the plurality of words using the word familiarity DB stored in the memory such that the intervals of the familiarity corresponding to the test words are constant intervals;
Word selection device.

(Additional note 2)
A non-transitory storage medium storing a program executable by a computer to perform a word selection process,
The word selection process includes:
Familiarity is an index representing the familiarity with a word, and using a word familiarity DB that stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words, multiple tests are performed from the plurality of words. Select words such that the familiarity intervals corresponding to the test words are at regular intervals.
Non-transitory storage medium.

(Additional note 3)
memory and
at least one processor connected to the memory;
including;
The memory stores a word familiarity DB in which familiarity is an index representing intimacy with a word, and stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words, respectively;
The processor includes:
A plurality of test words and an answer regarding the knowledge of the test word of the user to whom the plurality of test words are presented are input, and the answer regarding the knowledge of the test word and the word familiarity DB stored in the memory are input. to obtain a model representing the relationship between a value based on the degree of familiarity corresponding to the test word and a value based on the probability that the user answers that he/she knows the test word.
Model generator.

(Additional note 4)
A non-transitory storage medium storing a program executable by a computer to perform a model generation process,
The model generation process is
A plurality of test words and an answer regarding the knowledge of the test word of the user to whom the plurality of test words were presented are input, and the answer regarding the knowledge of the test word and the word familiarity DB are used to perform the test. Obtaining a model representing the relationship between a value based on familiarity corresponding to a word and a value based on a probability that the user answers that he/she knows the test word,
The degree of familiarity is an index representing the degree of familiarity with a word, and the word familiarity DB stores a plurality of words and a plurality of degrees of familiarity corresponding to the plurality of words, respectively.
Non-transitory storage medium.

(Additional note 5)
memory and
at least one processor connected to the memory;
including;
The memory includes:
The degree of familiarity is an index representing the degree of intimacy with a word, and a word familiarity degree DB stores a plurality of words and a plurality of degrees of familiarity corresponding to the plurality of words, respectively;
A model representing the relationship between a value based on familiarity corresponding to each word and a value based on the probability that a certain person has acquired each word;
is memorized,
The processor includes:
Obtaining the familiarity corresponding to each word included in the input text from the word familiarity DB stored in the memory,
acquiring an acquisition probability that is a probability that the certain person has acquired each of the words, using at least the familiarity corresponding to each of the acquired words and the model stored in the memory;
Acquisition probability acquisition device.

(Additional note 6)
A non-temporary storage medium storing a program executable by a computer to execute an acquisition probability acquisition process,
The acquisition probability acquisition process includes:
Familiarity is an index representing the familiarity with a word, and from a word familiarity DB that stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words, each word included in the input text is Get the corresponding intimacy,
Using at least a model representing the relationship between a value based on the familiarity corresponding to each word and a value based on the probability that a certain person has acquired each word, and the familiarity corresponding to each of the acquired words. and obtain an acquisition probability that is the probability that the certain person has acquired each word;
Non-transitory storage medium.

(Supplementary Note 7)
memory and
at least one processor connected to the memory;
including;
The memory includes:
The degree of familiarity is an index representing the degree of intimacy with a word, and a word familiarity degree DB stores a plurality of words and a plurality of degrees of familiarity corresponding to the plurality of words, respectively;
A model representing the relationship between a value based on familiarity corresponding to each word and a value based on the probability that a certain person has acquired each word;
is memorized,
The processor includes:
acquisition, which is the probability that the certain person has acquired each word included in the input word set, using at least the word familiarity DB stored in the memory and the model stored in the memory; get the probability,
extracting recommended learning words from the word set based on the acquired acquisition probability;
Recommended learning word extraction device.

(Supplementary Note 8)
A non-temporary storage medium storing a program executable by a computer to execute a recommended learning word extraction process,
The learning recommended word extraction process is
Familiarity is an index representing the familiarity with a word, and there is a word familiarity DB that stores a plurality of words and a plurality of familiarity levels corresponding to the plurality of words, and a value based on the familiarity corresponding to each word. acquisition probability, which is the probability that a certain person has acquired each word included in the input word set, using at least a model representing a relationship between a value based on the probability that a certain person has acquired each word; get
extracting recommended learning words from the word set based on the acquired acquisition probability;
Non-transitory storage medium.

All documents, patent applications, and technical standards mentioned herein are incorporated by reference to the same extent as if each individual document, patent application, and technical standard were specifically and individually indicated to be incorporated by reference. Incorporated herein by reference.

Claims

The degree of familiarity is an index representing the degree of familiarity with a word, and a storage unit in which a word familiarity degree DB storing a plurality of words and a plurality of degrees of familiarity corresponding to the plurality of words, respectively, is stored;
a word selection unit that selects a plurality of test words from the plurality of words using a word familiarity DB stored in the storage unit such that familiarity intervals corresponding to the test words are constant intervals;
A word selection device containing.
The word selection device according to claim 1,
a presentation unit that presents the test words to the user;
an answer reception unit that receives answers regarding the user's knowledge of the test words;
A word selection device further comprising:
The word selection unit uses a word familiarity DB that stores a plurality of words and a plurality of familiarity degrees corresponding to the plurality of words, where the degree of familiarity is an index representing the degree of intimacy with a word. a word selection step of selecting a plurality of test words from the words such that familiarity intervals corresponding to the test words are constant;
Word selection methods including.
A program for causing a computer to function as each part of the word selection device according to claim 1.