WO2008017188A1

WO2008017188A1 - System and method for making teaching material of language class

Info

Publication number: WO2008017188A1
Application number: PCT/CN2006/001722
Authority: WO
Inventors: Luntang Liu; Song Liu; Xiaokui Wang
Original assignee: Luntang Liu; Song Liu; Xiaokui Wang
Priority date: 2006-07-17
Filing date: 2006-07-17
Publication date: 2008-02-14

Abstract

A system and method for making teaching material of language class. The system includes input device (1), output device (7), word cutter (2), judge device (3), prompt device (4), select device (6), and storage device (5). A certain content of one class from a primary teaching material is input the system through the input device (1) by the user. The word cutter (2) analyzes the input content of teaching material and obtains the word information after being cut. The judge device (3) compares the word after being cut with the word stored in the storage device and judges whether it is new. The prompt device (4) provides the new word informationto the user. The user selects the word through the select device (6) as a new word. The selected word is stored in the storage device (5) and outputted to the user as a result.

Description

System and method for producing language teaching materials

The invention relates to a system and a method for producing a language teaching material, in particular to a system and a method thereof for automatically recognizing and identifying a new word in a text, thereby reducing the labor required for producing a language teaching material, belonging to a computer Auxiliary teaching technology field.

Background technique

In the production of language-based textbooks, a basic work is to identify and identify new words in the many language materials (ie articles) that make up the language-based textbooks. This work seems easy, but it is very cumbersome and often takes a lot of time for the producer. Moreover, in the process of finding new words, the problem often encountered is that it is easy to confuse the position of the new words. Especially when there are a lot of courses in a textbook, it is often unclear whether a new word has appeared in the previous course and where it has appeared. It takes a lot of unnecessary. Time to find this type of information. Especially in the later stage of the production of language-based textbooks, if you need to modify the content of one of the courses or change the position of a class, you will be involved in the whole body. The relevant new words in other courses need to be adjusted accordingly. And modified. Under the current method of manually finding and identifying new words, it is conceivable how much labor will be added. Moreover, such modifications are almost impossible to complete accurately in a short period of time.

Nowadays, with the development of new technologies such as computers, databases and multimedia, e-learning has become a trend that cannot be ignored. However, in language teaching, e-learning is more used in the production of textbooks with multimedia animation. Although many corpora that can be used to produce language-based textbooks have been fully electronic, it has brought great convenience to the production of language-based textbooks. However, the above work of finding new words in the text cannot be easily realized by using a computer. This is mainly due to the complexity of the language itself, especially the complexity of the language of East Asian countries.

Language is sound, meaningful, and a combination of speech and semantics. This is the basic feature of grammar units. Take Chinese as an example. The largest grammatical unit is a sentence, which is smaller than the grammatical unit of a sentence, followed by a phrase, a word, and a word. However, individual Chinese characters often do not have independent semantics. Therefore, when teaching Chinese, it is useless to simply teach a single Chinese character. It is necessary to teach a word composed of multiple words to make sense. From this perspective, words are the smallest meaningful linguistic components that can act independently. It is the basic object that is used to find a new word work when making a language textbook.

In English, a word is a single English word. In the corresponding article, the word and the word are empty The cells are separated, so different words can be accurately segmented by identifying spaces in English articles. This word segmentation without word segmentation technology is relatively simple to implement. In East Asian languages such as Chinese and Japanese, there are no spaces between words and words. It is necessary to understand the meaning of the whole sentence to accurately distinguish different words. In this case, the word segmentation technique is required to perform accurate word segmentation.

Word segmentation technology is the core of computer natural language class processing. It is a technique for recombining continuous word sequences into word sequences according to certain specifications. At present, word segmentation technology has developed to a relatively mature stage, which is widely used in machine translation and massive information retrieval of the Internet. However, word segmentation techniques have not yet been applied to the production of language-based textbooks.

Summary of the invention

SUMMARY OF THE INVENTION It is an object of the present invention to provide a system and method for automatically identifying and identifying new words in a text, thereby reducing the amount of labor required to produce a language-based teaching material.

In order to achieve the above object of the invention, the present invention adopts the following technical solutions:

A system for making a language teaching material, comprising an input device and an output device, further characterized by:

a word slicer for dividing the input original textbook content into words;

a judging device for judging whether a word in the original textbook has appeared;

a prompting device, configured to prompt the judgment result of the determining device to the user;

Selecting a device for the user to select a word as a new word;

a storage device, configured to store a new word selected by the user;

The input device, the word slicer, the determination device, and the presentation device are sequentially connected, and the storage device is connected to the determination device, the presentation device, the selection device, and the output device, respectively.

Wherein, the word slicer has a space recognition unit capable of recognizing a space.

Alternatively, the word slicer has a Chinese word segmentation unit for implementing Chinese word segmentation. The word slicer includes a language discriminating unit, a space recognizing unit and a Chinese word segmentation unit, wherein the language discriminating unit is respectively connected to the space recognizing unit and the Chinese word segment unit, and the space recognizing unit and the Chinese word segment unit are selected externally. Output.

The word slicer further includes a word change type database and an analysis operator.

The word change type database is used for storing a word change pattern with high popularity, and the analysis operator further divides the divided word sequence against the word change type database. Reason.

The system further includes a sentence analysis device for dividing the content of the original textbook into a word sequence, and adding grammar and pronunciation information;

The sentence pattern analysis device is connected to the determination device.

The system also includes a network communication module that connects the input device and the output device.

A method for producing a language teaching material, implemented by the system for producing a language teaching material according to claim 1, the system comprising an input device, an output device, a word slicer, a judging device, a cue device, a selection device And a storage device; characterized in that it comprises the following steps:

(1) The user inputs the content of a lesson of the original textbook into the system through the input device;

(2) The word slicer analyzes the content of the input textbook to obtain the word information after the segmentation;

(3) The judging device compares the word obtained in the step (2) with the saved word in the storage device to determine whether it is a new word;

(4) The prompting device prompts the user with the new word information;

(5) The user selects a word that needs to be a new new word by the selection means, and the selected word is stored in the storage device;

(6) Output the result to the user through the output device.

among them,

In the step (2), before the word segmentation, the textbook of the language is determined by identifying the ASCII code corresponding to the character, thereby determining which word segmentation scheme to use.

If the textbook is a Chinese textbook, the word slicer divides the word by the Chinese word segmentation algorithm.

If the textbook is an English textbook, the word slicer segments the words by identifying spaces in the textbook content.

In the step (2), the word slicer further includes a word change type database and an analysis operator, and the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device. A good word sequence, in contrast to a word-changing database, if the commonly-divided word is divided into smaller units in the word-changing database, the degree of commonity is higher, and the segmented good words are further subdivided; A few words before and after the segmentation are used, and when the word-changing database is combined as a larger unit, the degree of commonality is higher, and the words are combined.

In the step (3), the determining device determines whether the condition of the new word is the word of the new word itself. The three words of the word itself equal to the existing word, the grammatical information of the new word equal to the grammatical information of the existing word, the pronunciation information of the new word, and the pronunciation information of the existing word are simultaneously established.

The storage device saves a production history when the teaching material is created in a minimum unit in which the word is saved, and the production history includes: time information of the teaching material creation, personal information of the author, and class information to which the word belongs, and the word appears in the class. Location information, grammar and pronunciation information of the word itself.

The system and method for producing a language-based teaching material provided by the invention can automatically identify and identify the new words in the text, thereby solving the problem that the user appears confused in the position of the word in the process of making the teaching material, thereby greatly reducing the cost of the user. Time, and can effectively reduce the probability of errors in the production of textbooks.

DRAWINGS

The invention will now be further described with reference to the drawings and specific embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram showing the structure of an embodiment of a language teaching material creation system according to the present invention. 2 is a basic structural block diagram of a fully automatic Chinese word segmentation system as a prior art. Fig. 3 is a flow chart showing the division of the word slicer including the word change type database and the analysis operator.

Fig. 4 is a flow chart showing the judging means judging whether or not there is a new word in the segmentation result.

FIG. 5 is an overall flow chart of a method for producing a language teaching material according to the present invention.

detailed description

The language-based textbooks referred to in the present invention are not limited to the so-called textbooks, but also include language game software, software for various language learning, and the like which serve as a language teaching function. As shown in Figure 1, the language teaching material production system consists of the following components:

Input device 1 for the user to input the content of a lesson in the original textbook;

a word slicer 2, for dividing the input original textbook content into words;

The judging device 3 combines the result of the word slicer 2 with the storage device to determine whether the word in the original teaching material has appeared;

a prompting device 4, configured to prompt the judgment result of the determining device to the user;

a storage device 5, configured to store a new word selected by the user and related information;

Selecting device 6 for the user to select a specific word as a new word;

Output device 7, for outputting the content of the edited class;

Wherein, the input device 1, the word slicer 2, the determination device 3, and the presentation device 4 are sequentially connected. The storage device 5 is connected to the determination device 3, the presentation device 4, the selection device 6, and the output device 7, respectively.

The above language teaching material production system can be realized by a computer. In the case of using a computer, the input device 1 and the input device 7 can use a human-computer interaction interface of the computer such as a keyboard, a display, etc., the storage device 5 can be a hard disk of the computer, and the selection device 6 can be a mouse or a keyboard of the computer. The prompting device 4 can be a peripheral such as a display or a speaker of a computer.

In addition, the language teaching material production system can also be realized by using information terminal devices such as smartphones and PDAs. In this case, the input device 1, the prompting device 4, the storage device 5, the selecting device 6, and the input device 7 can all be realized by a corresponding functional unit in a smartphone or a PM. I won't go into details here.

In the present invention, the word slicer 2 and the judging means 3 are core functional components in the language-based textbook production system, and the detailed description thereof will be respectively made below.

The function of the word slicer 2 is to divide the text input by the input device 1 into words. It can be implemented in a variety of ways.

As mentioned above, for European countries such as English, recognizing words in an article is a relatively simple task, and it is only necessary to identify the spaces in the article to accurately segment the words. Therefore, if it is only used to make English textbooks, the word slicer in the present invention only needs to have a space recognition unit, and it is possible to find a space in the article, which is easily realized by software technology.

However, for the languages of East Asian countries such as Chinese or Japanese, the word slicer described above is not very practical. Take Chinese as an example. Chinese is a word, and all the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that the student is a word by a space, but it is not easy to understand that the words "study" and "sheng" are combined to represent a word. Regardless of this point, the word segmentation is mechanically performed by a space, and the obtained segmentation result has a large error. Therefore, it is necessary to use the Chinese word segmentation unit to divide the Chinese character sequence into meaningful words.

Now, the main Chinese word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. The word segmentation method based on string matching is also called mechanical word segmentation method. It is to match the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, , then The match is successful (a word is recognized). The basic idea of the word-sharing method based on understanding is to perform syntactic and semantic analysis at the same time as word segmentation, and to use syntactic information and semantic information to deal with ambiguity. The statistical-based word segmentation method is to count the frequency of the combination of adjacent co-occurrence words in the corpus, and calculate their mutual information. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary, so it is also called the no-word dictionary or statistical word-taking method.

Chinese word segmentation technology has been relatively mature. The earliest Chinese word segmentation system was designed and implemented by the Department of Computer Science of Beijing University of Aeronautics and Astronautics in 1983. The Chinese automatic word segmentation model of the National Language Committee has considered the role of syntactic analysis in the automatic word segmentation system to better solve the divergence. The word-cutting process takes into account all the possible segmentation possibilities, and uses Chinese syntax and other information to select reasonable segmentation results from various segmentation possibilities. Since the early 1990s, Microsoft Research has developed a general-purpose multi-language class processing platform, NLPWin. According to reports, NLPWin's parsing section uses a two-way Chart Parsing that uses grammar rules and is guided by a probabilistic model, and separates the grammar from the parser. The experimental results show that the system can correctly handle 85% of the semantic segmentation fields.

Figure 2 shows the basic structure of a fully automatic Chinese word segmentation system. The Chinese word segmentation system is a Chinese invention patent with the patent number 96100831. 8, including: (1) Chinese source input device; (2) a device for automatically breaking sentences according to punctuation at the end of a Chinese sentence; (3) converting sentence characters into The node structure generating device of the graph node; (4) the edge solving device for determining the word length, the device performs the ambiguous judgment while solving the edge, and performs corresponding ambiguous identification; (5) using the ambiguity rule according to the ambiguous identifier Reasoning disambiguation reasoning disambiguation device, which comprises an ambiguous rule base and a word stack rule device, and the form of the disambiguation rule is: a precursor edge attribute set current edge attribute set a context condition test I action function name; (6) a result output device The device obtains a word segmentation structure for output by traversing the word segmentation path, and the Chinese source input device starts the automatic sentence breaking device operation, and the node structure generating device converts the characters in the sentence broken by the device that automatically breaks the sentence into a graph. a node, forming a node sequence sending edge solving device to be cut, and the edge solving device determines the edge sequence, and the inference disambiguating device seeks Edge reasoning, sentences obtained after segmentation, the result into the output device.

The above various Chinese word segmentation techniques can be directly used in the present invention to implement Chinese word segmentation units.

As a mature product, this language teaching material production system should be able to automatically identify different language environments, so as to segment words for different languages: For European countries and other European countries, without word segmentation technology, directly through the space in the article Word segmentation; for Chinese, etc. The texts of East Asian countries actively use the corresponding word segmentation techniques to segment words. To do this, a language discriminating unit needs to be added to the word slicer 2, and before the word segmentation is performed, the language is first identified by it to determine whether the word segmentation technique and the word segmentation technique are used. The language discriminating unit is also simple to implement, and by recognizing the ASCII code of the text input by the input device 1, it is possible to accurately determine which language.

In addition, in order to obtain a better word segmentation result, a word change type database and an analysis operator may be included in the word slicer 2. As shown in FIG. 3, the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device into a good word sequence, and compares the word change type database, if the good word is divided. When the word-changing database is divided into smaller units, the degree of commonality is higher, then it is further subdivided; if several words before and after the segmentation are combined, when the word-changing database is combined as a larger unit The higher the degree of commonality, the combination of these words. In this way, the segmentation of the word is adjusted again according to the change pattern with higher degree of commonness, and the segmentation result with higher accuracy is obtained.

The word segmentation result obtained by the word slicer 2 is input to the judging device 3, and the judging device 3 compares the segmented word with the thesaurus already stored in the storage device 5, and judges whether or not the word appearing in the text has appeared. Over. The judgment performed by the judging means 3 can be realized by a conventional query algorithm. In this regard, there will be further introduction in the following.

The function of the prompting device 4 can not only display the judgment result of the judging device 3 to the user, but also display the rich information in the monogram library to the user in combination with the existing single word library in the storage device, thereby further improving the quality of the teaching material editing. .

The storage device 5 can save the production history of the textbook creator when creating the teaching material, and the word is the smallest unit of storage. The content included in the production history is: time information of the teaching material creation, personal information of the producer, class information to which the word belongs, position information of the word in the class, and grammar and pronunciation information of the word itself.

In addition, in order to further improve the accuracy and speed of teaching materials, the language teaching material production system can also add a sentence analysis device (not shown). The sentence pattern analyzing device is connected to the judging device 3, and divides the content of the original textbook into a word sequence, and adds grammar and pronunciation information. When the judging device 3 makes a judgment, the word is combined with the grammar and the pronunciation information to judge, and the unnecessary error is automatically removed.

The basic flow of automatic recognition and identification of new words using the language teaching material production system will be specifically described below with reference to FIG. 1. First, the user inputs a course content of the original textbook to the system through the input device 1;

2. The system automatically calls the word slicer 2, and the word slicer 2 analyzes the content of the obtained textbook and obtains the word information after the segmentation.

3. As shown in FIG. 4, the system automatically calls the determining device 3 to compare and analyze the newly obtained words with the saved words in the storage device 5, and determine whether the words appear in the texts that have been processed, the number of occurrences, and the number of occurrences. What position appears. At the time of judgment, the content of the word itself can be combined with the grammatical information (part of speech, etc.) and the pronunciation information (pinyin) of the word to be compared with the saved words.

The words stored in the storage device 5 (abbreviated as existing words) are words that have been applied to other courses of the textbook, and the information format is [word itself, grammatical information of the word, pronunciation information of the word, and position information to which the word is applied] The word obtained by the word slicer (referred to as a new word) is in the format of [the word itself, the grammar information of the word, the pronunciation information of the word], and the judging device sets the word of the new word, the grammar information of the word, and the pronunciation information of the word. ] information, find the existing word collection, the search condition is (the word of the new word itself = the word itself of the existing word) AND (the grammatical information of the new word = the grammatical information of the existing word) AND (the pronunciation information of the new word = the existing word Pronunciation information). If the same existing word is found, it means that the new word has appeared in other courses of the textbook. At this time, the sign indicating whether the new word appears is marked as "already appearing", and the information of the appearance position is recorded; If the same existing word is found, it means that the new word has not appeared in other courses of the textbook. At this time, the sign indicating whether the new word appears is "never appear".

4. Then, in the prompting device 4, the words "already appearing" in the original textbook are distinguished from the words "unappeared" in a different form, and the information such as the appearance position of the already appearing word is also presented to the user.

5. At this time, the user selects a word to be selected as a new new word by the selection means 6 according to the prompted information, the selected word, according to [the word itself, the grammatical information of the word, the pronunciation information of the word, the position at which the word is applied. The format of the message] is automatically added to the storage device 5.

6. Finally, the result (a list of words and related information) is automatically output to the user via the output device 7.

In this way, the user is relieved of the task of constantly finding out whether a word has appeared and appears, shortening the time for making language teaching materials.

The invention automatically divides an article into a word sequence by a word slicer, and utilizes a judging device and The storage device automatically finds out information such as whether the word has appeared and appeared. With this information, you can do related statistics and analysis, such as word distribution, frequency of occurrence, and rate of recurrence. In addition, by using the storage device and the judging device in the teaching material production system, the word rate of the language material can be determined, thereby helping to determine the difficulty level of the language material, and helping the user to select an appropriate language material for learning.

In this language teaching material production system, a network communication module that can be used for a local area network, a wide area network, or the Internet can also be added. The network communication module is connected to the input device and the output device. Using the network communication module, the system can transmit the results through the network, and can accept external teaching resources to automatically publish the teaching materials in the network.

In the network environment, the language-based textbook production system can provide an open textbook format. Users can integrate other external learning materials into unified management, which can increase the diversity and flexibility of language teaching materials. In addition, the storage device can store the online usage history of the user and the content of the teaching material separately, so that editing of the teaching material for multi-person cooperation can be realized. At the same time, each person's usage history can also be recorded separately, so that when editing the same textbook in the network, each work task and process can be clearly opened separately, and the resources that everyone needs can be easily shared.

It should be noted that the language teaching material production system can produce textbooks of various languages, and is not limited to the production of the above-mentioned English, Chinese or Japanese. When it is necessary to make a textbook of different languages, the word segmentation algorithm in the word slicer 2 can be used to achieve the goal. The word segmentation algorithms of different language classes are mature existing technologies. For example, the most commonly used Google search engine now adopts multiple word segmentation algorithms suitable for different language classes. These techniques can be directly applied to the present invention.

The system and method for producing a language-based textbook according to the present invention have been described in detail above. Any obvious changes made to the invention without departing from the spirit of the invention will constitute an infringement of the patent right of the invention and will bear the corresponding legal responsibility.

Claims

A system for producing a language teaching material, comprising an input device and an output device, the feature further comprising:

a word slicer for dividing the input original textbook content into words;

Selecting a device for the user to select a right word as a new word;

a storage device, configured to store a new word selected by the user;

Book

2. The system for producing a language teaching material according to claim 1, wherein the word slicer has a space recognition unit capable of recognizing a space.

3. The system for making a language teaching material according to claim 1, wherein: the word slicer has a Chinese word segmentation unit for implementing Chinese word segmentation.

4. The system for creating a language teaching material according to claim 1, wherein: the word slicer comprises a language discriminating unit, a space recognizing unit, and a Chinese word segmentation unit, wherein the language discriminating unit and the space respectively The identification unit is connected to the Chinese word segmentation unit, and the space recognition unit and the Chinese word segmentation unit are selected for external output.

The system for producing a language teaching material according to any one of claims 1 to 4, characterized in that:

The word change type database is used for storing a word change pattern having a high degree of commonness, and the analysis operator further processes the cut word sequence against the word change type database.

The system for producing a language teaching material according to any one of claims 1 to 5, characterized in that:

The sentence pattern analysis device is connected to the determination device.

The system for producing a language teaching material according to any one of claims 1 to 6, wherein the system further comprises a network communication module, and the network communication module is connected to the input device and the output device.

8. A method for producing a language teaching material, implemented by the system for producing a language teaching material according to claim 1, the system comprising an input device, an output device, a word slicer, a judging device, a cue device, Selecting device and storage device; characterized in that it comprises the following steps -

(4) The prompting device prompts the user with the new word information;

(6) Output the result to the user through the output device.

9. The method for producing a language teaching material according to claim 8, wherein: in the step (2), before the word segmentation, determining the language by identifying the ASCII code corresponding to the text The textbook, which determines which word segmentation scheme to use.

10. The method for producing a language teaching material according to claim 8 or 9, wherein: if the teaching material is a Chinese textbook, the word slicer segments the words by a Chinese word segmentation algorithm.

11. The method for producing a language teaching material according to claim 8 or 9, wherein: if the teaching material is an English textbook, the word slicer segments the words by identifying spaces in the textbook content.

12. The method for producing a language teaching material according to claim 8, wherein: in the step (2), the word slicer further comprises a word change type database and an analysis operator, and the word change type database The word change type with high degree of common usage is saved, and the analysis operator divides the regular word segmentation device into a good word sequence, and compares the word change type database, if the segmented good words are divided into smaller in the word change type database When the unit has a high degree of commonality, the segmented good words are further subdivided; if several words before and after the segmentation are well-defined, the common degree is higher when combined as a larger unit in the word-changing database. Then combine these words.

13. The method for producing a language teaching material according to claim 8, wherein in the step (3), the determining means determines whether the condition of the new word is the word itself of the new word is equal to the word itself of the existing word, The grammatical information of the new word is equal to the grammatical information of the existing word, and the pronunciation information of the new word is equal to the pronunciation information of the existing word.

14. The method for producing a language teaching material according to claim 8, wherein: the storage device saves a production history when the teaching material is created in a minimum unit in which the word is saved, and the production history includes: a time when the teaching material is produced. Information, the personal information of the author, the class information to which the word belongs, the position information of the word in the class, the grammar and pronunciation information of the word itself.

15. The method for producing a language teaching material according to claim 8, wherein: different users respectively implement the method and share the results through a network.

16. The method for producing a language teaching material according to claim 15, wherein: the storage device stores the online usage history of the different users and the corresponding teaching material content.