WO2008017188A1 - System and method for making teaching material of language class - Google Patents
System and method for making teaching material of language class Download PDFInfo
- Publication number
- WO2008017188A1 WO2008017188A1 PCT/CN2006/001722 CN2006001722W WO2008017188A1 WO 2008017188 A1 WO2008017188 A1 WO 2008017188A1 CN 2006001722 W CN2006001722 W CN 2006001722W WO 2008017188 A1 WO2008017188 A1 WO 2008017188A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- teaching material
- language
- producing
- slicer
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B19/00—Teaching not covered by other main groups of this subclass
- G09B19/06—Foreign languages
Definitions
- the invention relates to a system and a method for producing a language teaching material, in particular to a system and a method thereof for automatically recognizing and identifying a new word in a text, thereby reducing the labor required for producing a language teaching material, belonging to a computer Auxiliary teaching technology field.
- Language is sound, meaningful, and a combination of speech and semantics. This is the basic feature of grammar units. Take Chinese as an example. The largest grammatical unit is a sentence, which is smaller than the grammatical unit of a sentence, followed by a phrase, a word, and a word. However, individual Chinese characters often do not have independent semantics. Therefore, when teaching Chinese, it is useless to simply teach a single Chinese character. It is necessary to teach a word composed of multiple words to make sense. From this perspective, words are the smallest meaningful linguistic components that can act independently. It is the basic object that is used to find a new word work when making a language textbook.
- a word is a single English word. In the corresponding article, the word and the word are empty The cells are separated, so different words can be accurately segmented by identifying spaces in English articles. This word segmentation without word segmentation technology is relatively simple to implement. In East Asian languages such as Chinese and Japanese, there are no spaces between words and words. It is necessary to understand the meaning of the whole sentence to accurately distinguish different words. In this case, the word segmentation technique is required to perform accurate word segmentation.
- Word segmentation technology is the core of computer natural language class processing. It is a technique for recombining continuous word sequences into word sequences according to certain specifications. At present, word segmentation technology has developed to a relatively mature stage, which is widely used in machine translation and massive information retrieval of the Internet. However, word segmentation techniques have not yet been applied to the production of language-based textbooks.
- a system for making a language teaching material comprising an input device and an output device, further characterized by:
- a prompting device configured to prompt the judgment result of the determining device to the user
- a storage device configured to store a new word selected by the user
- the input device, the word slicer, the determination device, and the presentation device are sequentially connected, and the storage device is connected to the determination device, the presentation device, the selection device, and the output device, respectively.
- the word slicer has a space recognition unit capable of recognizing a space.
- the word slicer has a Chinese word segmentation unit for implementing Chinese word segmentation.
- the word slicer includes a language discriminating unit, a space recognizing unit and a Chinese word segmentation unit, wherein the language discriminating unit is respectively connected to the space recognizing unit and the Chinese word segment unit, and the space recognizing unit and the Chinese word segment unit are selected externally. Output.
- the word slicer further includes a word change type database and an analysis operator.
- the word change type database is used for storing a word change pattern with high popularity, and the analysis operator further divides the divided word sequence against the word change type database. Reason.
- the system further includes a sentence analysis device for dividing the content of the original textbook into a word sequence, and adding grammar and pronunciation information;
- the sentence pattern analysis device is connected to the determination device.
- the system also includes a network communication module that connects the input device and the output device.
- a method for producing a language teaching material implemented by the system for producing a language teaching material according to claim 1, the system comprising an input device, an output device, a word slicer, a judging device, a cue device, a selection device And a storage device; characterized in that it comprises the following steps:
- the word slicer analyzes the content of the input textbook to obtain the word information after the segmentation
- the judging device compares the word obtained in the step (2) with the saved word in the storage device to determine whether it is a new word;
- the prompting device prompts the user with the new word information
- the user selects a word that needs to be a new new word by the selection means, and the selected word is stored in the storage device;
- the textbook of the language is determined by identifying the ASCII code corresponding to the character, thereby determining which word segmentation scheme to use.
- the word slicer divides the word by the Chinese word segmentation algorithm.
- the word slicer segments the words by identifying spaces in the textbook content.
- the word slicer further includes a word change type database and an analysis operator, and the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device.
- a good word sequence in contrast to a word-changing database, if the commonly-divided word is divided into smaller units in the word-changing database, the degree of commonity is higher, and the segmented good words are further subdivided; A few words before and after the segmentation are used, and when the word-changing database is combined as a larger unit, the degree of commonality is higher, and the words are combined.
- the determining device determines whether the condition of the new word is the word of the new word itself.
- the three words of the word itself equal to the existing word, the grammatical information of the new word equal to the grammatical information of the existing word, the pronunciation information of the new word, and the pronunciation information of the existing word are simultaneously established.
- the storage device saves a production history when the teaching material is created in a minimum unit in which the word is saved, and the production history includes: time information of the teaching material creation, personal information of the author, and class information to which the word belongs, and the word appears in the class. Location information, grammar and pronunciation information of the word itself.
- the system and method for producing a language-based teaching material provided by the invention can automatically identify and identify the new words in the text, thereby solving the problem that the user appears confused in the position of the word in the process of making the teaching material, thereby greatly reducing the cost of the user. Time, and can effectively reduce the probability of errors in the production of textbooks.
- Fig. 1 is a block diagram showing the structure of an embodiment of a language teaching material creation system according to the present invention.
- 2 is a basic structural block diagram of a fully automatic Chinese word segmentation system as a prior art.
- Fig. 3 is a flow chart showing the division of the word slicer including the word change type database and the analysis operator.
- Fig. 4 is a flow chart showing the judging means judging whether or not there is a new word in the segmentation result.
- FIG. 5 is an overall flow chart of a method for producing a language teaching material according to the present invention.
- the language-based textbooks referred to in the present invention are not limited to the so-called textbooks, but also include language game software, software for various language learning, and the like which serve as a language teaching function.
- the language teaching material production system consists of the following components:
- Input device 1 for the user to input the content of a lesson in the original textbook
- a word slicer 2 for dividing the input original textbook content into words
- the judging device 3 combines the result of the word slicer 2 with the storage device to determine whether the word in the original teaching material has appeared;
- a prompting device configured to prompt the judgment result of the determining device to the user
- a storage device 5 configured to store a new word selected by the user and related information
- Selecting device 6 for the user to select a specific word as a new word
- the input device 1, the word slicer 2, the determination device 3, and the presentation device 4 are sequentially connected.
- the storage device 5 is connected to the determination device 3, the presentation device 4, the selection device 6, and the output device 7, respectively.
- the above language teaching material production system can be realized by a computer.
- the input device 1 and the input device 7 can use a human-computer interaction interface of the computer such as a keyboard, a display, etc.
- the storage device 5 can be a hard disk of the computer
- the selection device 6 can be a mouse or a keyboard of the computer.
- the prompting device 4 can be a peripheral such as a display or a speaker of a computer.
- the language teaching material production system can also be realized by using information terminal devices such as smartphones and PDAs.
- the input device 1, the prompting device 4, the storage device 5, the selecting device 6, and the input device 7 can all be realized by a corresponding functional unit in a smartphone or a PM. I won't go into details here.
- the word slicer 2 and the judging means 3 are core functional components in the language-based textbook production system, and the detailed description thereof will be respectively made below.
- the function of the word slicer 2 is to divide the text input by the input device 1 into words. It can be implemented in a variety of ways.
- the word segmentation method based on string matching is also called mechanical word segmentation method. It is to match the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, , then The match is successful (a word is recognized).
- the basic idea of the word-sharing method based on understanding is to perform syntactic and semantic analysis at the same time as word segmentation, and to use syntactic information and semantic information to deal with ambiguity.
- the statistical-based word segmentation method is to count the frequency of the combination of adjacent co-occurrence words in the corpus, and calculate their mutual information. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary, so it is also called the no-word dictionary or statistical word-taking method.
- Figure 2 shows the basic structure of a fully automatic Chinese word segmentation system.
- the Chinese word segmentation system is a Chinese invention patent with the patent number 96100831. 8, including: (1) Chinese source input device; (2) a device for automatically breaking sentences according to punctuation at the end of a Chinese sentence; (3) converting sentence characters into The node structure generating device of the graph node; (4) the edge solving device for determining the word length, the device performs the ambiguous judgment while solving the edge, and performs corresponding ambiguous identification; (5) using the ambiguity rule according to the ambiguous identifier Reasoning disambiguation reasoning disambiguation device, which comprises an ambiguous rule base and a word stack rule device, and the form of the disambiguation rule is: a precursor edge attribute set current edge attribute set a context condition test I action function name; (6) a result output device The device obtains a word segmentation structure for output by traversing the word segmentation path, and the Chinese source input device starts the automatic sentence breaking device operation, and the node structure generating device converts
- this language teaching material production system should be able to automatically identify different language environments, so as to segment words for different languages: For European countries and other European countries, without word segmentation technology, directly through the space in the article Word segmentation; for Chinese, etc.
- the texts of East Asian countries actively use the corresponding word segmentation techniques to segment words.
- a language discriminating unit needs to be added to the word slicer 2, and before the word segmentation is performed, the language is first identified by it to determine whether the word segmentation technique and the word segmentation technique are used.
- the language discriminating unit is also simple to implement, and by recognizing the ASCII code of the text input by the input device 1, it is possible to accurately determine which language.
- a word change type database and an analysis operator may be included in the word slicer 2.
- the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device into a good word sequence, and compares the word change type database, if the good word is divided.
- the degree of commonality is higher, then it is further subdivided; if several words before and after the segmentation are combined, when the word-changing database is combined as a larger unit The higher the degree of commonality, the combination of these words. In this way, the segmentation of the word is adjusted again according to the change pattern with higher degree of commonness, and the segmentation result with higher accuracy is obtained.
- the word segmentation result obtained by the word slicer 2 is input to the judging device 3, and the judging device 3 compares the segmented word with the thesaurus already stored in the storage device 5, and judges whether or not the word appearing in the text has appeared. Over.
- the judgment performed by the judging means 3 can be realized by a conventional query algorithm. In this regard, there will be further introduction in the following.
- the function of the prompting device 4 can not only display the judgment result of the judging device 3 to the user, but also display the rich information in the monogram library to the user in combination with the existing single word library in the storage device, thereby further improving the quality of the teaching material editing. .
- the storage device 5 can save the production history of the textbook creator when creating the teaching material, and the word is the smallest unit of storage.
- the content included in the production history is: time information of the teaching material creation, personal information of the producer, class information to which the word belongs, position information of the word in the class, and grammar and pronunciation information of the word itself.
- the language teaching material production system can also add a sentence analysis device (not shown).
- the sentence pattern analyzing device is connected to the judging device 3, and divides the content of the original textbook into a word sequence, and adds grammar and pronunciation information.
- the judging device 3 makes a judgment, the word is combined with the grammar and the pronunciation information to judge, and the unnecessary error is automatically removed.
- the system automatically calls the word slicer 2, and the word slicer 2 analyzes the content of the obtained textbook and obtains the word information after the segmentation.
- the system automatically calls the determining device 3 to compare and analyze the newly obtained words with the saved words in the storage device 5, and determine whether the words appear in the texts that have been processed, the number of occurrences, and the number of occurrences. What position appears.
- the content of the word itself can be combined with the grammatical information (part of speech, etc.) and the pronunciation information (pinyin) of the word to be compared with the saved words.
- the words stored in the storage device 5 are words that have been applied to other courses of the textbook, and the information format is [word itself, grammatical information of the word, pronunciation information of the word, and position information to which the word is applied]
- the word obtained by the word slicer (referred to as a new word) is in the format of [the word itself, the grammar information of the word, the pronunciation information of the word], and the judging device sets the word of the new word, the grammar information of the word, and the pronunciation information of the word.
- the words "already appearing” in the original textbook are distinguished from the words “unappeared” in a different form, and the information such as the appearance position of the already appearing word is also presented to the user.
- the user selects a word to be selected as a new new word by the selection means 6 according to the prompted information, the selected word, according to [the word itself, the grammatical information of the word, the pronunciation information of the word, the position at which the word is applied.
- the format of the message] is automatically added to the storage device 5.
- the invention automatically divides an article into a word sequence by a word slicer, and utilizes a judging device and
- the storage device automatically finds out information such as whether the word has appeared and appeared. With this information, you can do related statistics and analysis, such as word distribution, frequency of occurrence, and rate of recurrence.
- the word rate of the language material can be determined, thereby helping to determine the difficulty level of the language material, and helping the user to select an appropriate language material for learning.
- a network communication module that can be used for a local area network, a wide area network, or the Internet can also be added.
- the network communication module is connected to the input device and the output device. Using the network communication module, the system can transmit the results through the network, and can accept external teaching resources to automatically publish the teaching materials in the network.
- the language-based textbook production system can provide an open textbook format. Users can integrate other external learning materials into unified management, which can increase the diversity and flexibility of language teaching materials.
- the storage device can store the online usage history of the user and the content of the teaching material separately, so that editing of the teaching material for multi-person cooperation can be realized. At the same time, each person's usage history can also be recorded separately, so that when editing the same textbook in the network, each work task and process can be clearly opened separately, and the resources that everyone needs can be easily shared.
- the language teaching material production system can produce textbooks of various languages, and is not limited to the production of the above-mentioned English, Chinese or Japanese.
- the word segmentation algorithm in the word slicer 2 can be used to achieve the goal.
- the word segmentation algorithms of different language classes are mature existing technologies. For example, the most commonly used Google search engine now adopts multiple word segmentation algorithms suitable for different language classes. These techniques can be directly applied to the present invention.
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Physics & Mathematics (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Machine Translation (AREA)
Abstract
A system and method for making teaching material of language class. The system includes input device (1), output device (7), word cutter (2), judge device (3), prompt device (4), select device (6), and storage device (5). A certain content of one class from a primary teaching material is input the system through the input device (1) by the user. The word cutter (2) analyzes the input content of teaching material and obtains the word information after being cut. The judge device (3) compares the word after being cut with the word stored in the storage device and judges whether it is new. The prompt device (4) provides the new word informationto the user. The user selects the word through the select device (6) as a new word. The selected word is stored in the storage device (5) and outputted to the user as a result.
Description
用于制作语言教材的系统及其方法 技术领域 System and method for producing language teaching materials
本发明涉及一种用于制作语言类教材的系统及其方法, 尤其涉及一种 可以实现自动识别并标识课文中的生词, 从而减少制作语言类教材所需劳 动量的系统及其方法, 属于计算机辅助教学技术领域。 The invention relates to a system and a method for producing a language teaching material, in particular to a system and a method thereof for automatically recognizing and identifying a new word in a text, thereby reducing the labor required for producing a language teaching material, belonging to a computer Auxiliary teaching technology field.
背景技术 Background technique
在制作语言类教材时, 一项基础的工作就是在组成语言类教材的众多 语言材料(即文章) 中识别和标识生词。 这项工作看似轻松, 但却十分繁 琐, 往往需要花费制作人大量的时间。 而且在査找生词的过程中, 经常遇 到的问题是容易混淆生词出现的位置。 特别是在一本教材的课程比较多的 时候, 编写后面的课程时往往会搞不清楚某个生词在前面的课程里有没有 出现过, 以及在哪里出现过, 这就需要花费大量不必要的时间来查找这类 信息。 特别是在语言类教材制作的后期, 如果需要修改其中的某一课的内 容或者改变某一课的位置, 则会牵一发而动全身, 其他课程中的相关生词 信息都需要作相应的调整和修改。 在目前以手工査找和标识生词的制作方 式下, 可以想象这样将会增加多少劳动量。 而且, 进行这样的修改在短时 间内几乎是无法准确无误完成的。 In the production of language-based textbooks, a basic work is to identify and identify new words in the many language materials (ie articles) that make up the language-based textbooks. This work seems easy, but it is very cumbersome and often takes a lot of time for the producer. Moreover, in the process of finding new words, the problem often encountered is that it is easy to confuse the position of the new words. Especially when there are a lot of courses in a textbook, it is often unclear whether a new word has appeared in the previous course and where it has appeared. It takes a lot of unnecessary. Time to find this type of information. Especially in the later stage of the production of language-based textbooks, if you need to modify the content of one of the courses or change the position of a class, you will be involved in the whole body. The relevant new words in other courses need to be adjusted accordingly. And modified. Under the current method of manually finding and identifying new words, it is conceivable how much labor will be added. Moreover, such modifications are almost impossible to complete accurately in a short period of time.
现在, 随着计算机、 数据库和多媒体等新技术的发展, 电化教学已经 成为不可忽视的潮流, 但是在语言教学方面, 电化教学更多被应用在制作 一些配有多媒体动画的教材方面。 虽然很多可以用于制作语言类教材的语 料库已经完全电子化, 给语言类教材的制作带来了很大的便利。 但是, 上 述査找课文中生词的工作却无法轻而易举地利用计算机来实现。 这主要是 由于语言本身的复杂性, 特别是东亚国家语言的复杂性所决定的。 Nowadays, with the development of new technologies such as computers, databases and multimedia, e-learning has become a trend that cannot be ignored. However, in language teaching, e-learning is more used in the production of textbooks with multimedia animation. Although many corpora that can be used to produce language-based textbooks have been fully electronic, it has brought great convenience to the production of language-based textbooks. However, the above work of finding new words in the text cannot be easily realized by using a computer. This is mainly due to the complexity of the language itself, especially the complexity of the language of East Asian countries.
语言是有声音、 有意义的, 是语音和语义的结合体, 这便是语法单位 基本的特点。以汉语为例,最大的语法单位是句子, 比句子小的语法单位, 依次是短语、 词、 字。 但是, 单个的汉字往往不具有独立的语义, 因此在 进行汉语教学时, 简单地教授单个的汉字没有用处, 必须教授由多个单字 组成的词才有意义。 从这个角度来说, 词是最小的能够独立活动的有意义 的语言成分。 它是在制作语言教材时, 査找生词工作所处理的基本对象。 Language is sound, meaningful, and a combination of speech and semantics. This is the basic feature of grammar units. Take Chinese as an example. The largest grammatical unit is a sentence, which is smaller than the grammatical unit of a sentence, followed by a phrase, a word, and a word. However, individual Chinese characters often do not have independent semantics. Therefore, when teaching Chinese, it is useless to simply teach a single Chinese character. It is necessary to teach a word composed of multiple words to make sense. From this perspective, words are the smallest meaningful linguistic components that can act independently. It is the basic object that is used to find a new word work when making a language textbook.
英语中, 词就是单个的英文单词。 在相应的文章中, 词和词之间靠空
格隔开, 因此通过识别英语文章中的空格就可以对不同的单词进行准确的 切分。 这种不采用分词技术的单词切分实现起来比较简单。 而在中文、 日 文等东亚国家语言中, 词与词之间不存在空格, 需要通过对整句意思的理 解才能准确地分辨出不同的单词。 在这种情况下, 进行准确的单词切分需 要用到分词技术。 In English, a word is a single English word. In the corresponding article, the word and the word are empty The cells are separated, so different words can be accurately segmented by identifying spaces in English articles. This word segmentation without word segmentation technology is relatively simple to implement. In East Asian languages such as Chinese and Japanese, there are no spaces between words and words. It is necessary to understand the meaning of the whole sentence to accurately distinguish different words. In this case, the word segmentation technique is required to perform accurate word segmentation.
分词技术是计算机自然语言类处理的核心, 它是将连续的字序列按照 一定的规范重新组合成词序列的技术。 目前, 分词技术已经发展到了一个 相对成熟的阶段, 在机器翻译以及互联网海量信息检索等方面得到广泛的 使用。 但是, 分词技术还没有应用到制作语言类教材这项工作中。 Word segmentation technology is the core of computer natural language class processing. It is a technique for recombining continuous word sequences into word sequences according to certain specifications. At present, word segmentation technology has developed to a relatively mature stage, which is widely used in machine translation and massive information retrieval of the Internet. However, word segmentation techniques have not yet been applied to the production of language-based textbooks.
发明内容 Summary of the invention
本发明的目的是提供一种可以实现自动识别和标识课文中的生词, 从 而减少制作语言类教材所需劳动量的系统及其方法。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a system and method for automatically identifying and identifying new words in a text, thereby reducing the amount of labor required to produce a language-based teaching material.
为实现上述的发明目的, 本发明采用下述的技术方案: In order to achieve the above object of the invention, the present invention adopts the following technical solutions:
一种用于制作语言教材的系统, 包括输入装置和输出装置, 其特征在 于还包括: A system for making a language teaching material, comprising an input device and an output device, further characterized by:
单词切分器, 用于把输入的原始教材内容分割成单词; a word slicer for dividing the input original textbook content into words;
判断装置, 用于判断原始教材中的单词是否已经出现过; a judging device for judging whether a word in the original textbook has appeared;
提示装置, 用于将所述判断装置的判断结果提示给用户; a prompting device, configured to prompt the judgment result of the determining device to the user;
选择装置, 供用户选择单词作为生词; Selecting a device for the user to select a word as a new word;
存储装置, 用于存储用户所选择的生词; a storage device, configured to store a new word selected by the user;
其中, 所述输入装置、 单词切分器、 判断装置、 提示装置顺序连接, 所述存储装置分别与所述判断装置、 提示装置、 选择装置和输出装置相连 接。 The input device, the word slicer, the determination device, and the presentation device are sequentially connected, and the storage device is connected to the determination device, the presentation device, the selection device, and the output device, respectively.
其中, 所述单词切分器具有能够识别空格的空格识别单元。 Wherein, the word slicer has a space recognition unit capable of recognizing a space.
或者, 所述单词切分器具有用于实现中文分词的中文分词单元。 所述单词切分器包括语言判别单元、 空格识别单元和中文分词单元, 所述语言判别单元分别与所述空格识别单元和中文分词单元相连接, 所述 空格识别单元和中文分词单元择一对外输出。 Alternatively, the word slicer has a Chinese word segmentation unit for implementing Chinese word segmentation. The word slicer includes a language discriminating unit, a space recognizing unit and a Chinese word segmentation unit, wherein the language discriminating unit is respectively connected to the space recognizing unit and the Chinese word segment unit, and the space recognizing unit and the Chinese word segment unit are selected externally. Output.
所述单词切分器还包括单词变化型资料库和分析运算器, The word slicer further includes a word change type database and an analysis operator.
所述单词变化型资料库用于保存常用度较高的单词变化型态, 所述分 析运算器将切分好的单词序列对照所述单词变化型资料库进行进一步处
理。 The word change type database is used for storing a word change pattern with high popularity, and the analysis operator further divides the divided word sequence against the word change type database. Reason.
所述系统还包括句型分析装置, 用于将原始教材的内容分割为单词序 列, 并附加语法和发音信息; The system further includes a sentence analysis device for dividing the content of the original textbook into a word sequence, and adding grammar and pronunciation information;
所述句型分析装置与所述判断装置相连接。 The sentence pattern analysis device is connected to the determination device.
所述系统还包括网络通信模块, 所述网络通信模块连接所述输入装置 和输出装置。 The system also includes a network communication module that connects the input device and the output device.
一种用于制作语言教材的方法, 通过如权利要求 1所述的用于制作语 言教材的系统实现, 所述系统包括输入装置、 输出装置、 单词切分器、 判 断装置、 提示装置、 选择装置和存储装置; 其特征在于包括如下步骤: A method for producing a language teaching material, implemented by the system for producing a language teaching material according to claim 1, the system comprising an input device, an output device, a word slicer, a judging device, a cue device, a selection device And a storage device; characterized in that it comprises the following steps:
(1)用户把原始教材的某一课内容通过输入装置输入系统; (1) The user inputs the content of a lesson of the original textbook into the system through the input device;
(2)单词切分器对输入的教材内容进行分析, 得到切分后的单词信息; (2) The word slicer analyzes the content of the input textbook to obtain the word information after the segmentation;
(3)判断装置将步骤 (2)获得的单词与存储装置中已保存的单词进行对 比分析, 判断是否为生词; (3) The judging device compares the word obtained in the step (2) with the saved word in the storage device to determine whether it is a new word;
(4)提示装置将生词信息提示给用户; (4) The prompting device prompts the user with the new word information;
(5)用户通过选择装置选择需要作为新的生词的单词, 该被选择的单 词存入存储装置中; (5) The user selects a word that needs to be a new new word by the selection means, and the selected word is stored in the storage device;
(6)通过输出装置将结果输出给用户。 (6) Output the result to the user through the output device.
其中, among them,
所述步骤 (2)中, 在进行单词切分之前, 通过识别文字所对应的 ASCII 码确定是何种语言的教材, 从而决定采用何种单词切分方案。 In the step (2), before the word segmentation, the textbook of the language is determined by identifying the ASCII code corresponding to the character, thereby determining which word segmentation scheme to use.
如果教材是中文教材, 则单词切分器通过中文分词算法对单词进行切 分。 If the textbook is a Chinese textbook, the word slicer divides the word by the Chinese word segmentation algorithm.
如果教材是英文教材, 则单词切分器通过识别教材内容中的空格对单 词进行切分。 If the textbook is an English textbook, the word slicer segments the words by identifying spaces in the textbook content.
所述步骤 (2)中, 单词切分器还包括单词变化型资料库和分析运算器, 单词变化型资料库中保存常用度较高的单词变化型态, 分析运算器将常规 分词装置切分好的单词序列, 对照单词变化型资料库, 如果切分好的单词 在单词变化型资料库中分为更小的单位时的常用度较高, 则将切分好的单 词进一步细分; 如果切分好的前后几个单词, 在单词变化型资料库中联合 作为更大的单位时的常用度较高, 则联合这几个单词。 In the step (2), the word slicer further includes a word change type database and an analysis operator, and the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device. A good word sequence, in contrast to a word-changing database, if the commonly-divided word is divided into smaller units in the word-changing database, the degree of commonity is higher, and the segmented good words are further subdivided; A few words before and after the segmentation are used, and when the word-changing database is combined as a larger unit, the degree of commonality is higher, and the words are combined.
所述步骤 (3)中, 判断装置判断是否为生词的条件是新单词的单词本身
等于既存单词的单词本身、 新单词的语法信息等于既存单词的语法信息、 新单词的发音信息等于既存单词的发音信息三个条件同时成立。 In the step (3), the determining device determines whether the condition of the new word is the word of the new word itself. The three words of the word itself equal to the existing word, the grammatical information of the new word equal to the grammatical information of the existing word, the pronunciation information of the new word, and the pronunciation information of the existing word are simultaneously established.
所述存储装置以单词为保存的最小单位保存制作教材时的制作历史, 所述制作历史包括: 教材制作的时间信息, 制作者的个人信息, 单词所属 的课别信息, 单词在所属课别出现的位置信息, 单词本身的语法、 发音信 息。 The storage device saves a production history when the teaching material is created in a minimum unit in which the word is saved, and the production history includes: time information of the teaching material creation, personal information of the author, and class information to which the word belongs, and the word appears in the class. Location information, grammar and pronunciation information of the word itself.
本发明所提供的用于制作语言类教材的系统及其方法可以自动识别 并标识课文中的生词, 从而解决用户在制作教材过程中对单词出现的位置 出现混淆的问题, 大大减少用户因此而花费的时间, 并能有效减少教材制 作过程中出错的概率。 The system and method for producing a language-based teaching material provided by the invention can automatically identify and identify the new words in the text, thereby solving the problem that the user appears confused in the position of the word in the process of making the teaching material, thereby greatly reducing the cost of the user. Time, and can effectively reduce the probability of errors in the production of textbooks.
附图说明 DRAWINGS
下面结合附图和具体实施方式对本发明作进一步的说明。 The invention will now be further described with reference to the drawings and specific embodiments.
图 1是本发明所述的语言类教材制作系统的一个实施例的结构框图。 图 2是作为现有技术的一种全自动汉语分词系统的基本结构框图。 图 3是包含有单词变化型资料库和分析运算器的单词切分器进行切分 校正的流程图。 BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram showing the structure of an embodiment of a language teaching material creation system according to the present invention. 2 is a basic structural block diagram of a fully automatic Chinese word segmentation system as a prior art. Fig. 3 is a flow chart showing the division of the word slicer including the word change type database and the analysis operator.
图 4是判断装置判断切分结果中是否有新生词的流程图。 Fig. 4 is a flow chart showing the judging means judging whether or not there is a new word in the segmentation result.
图 5是本发明所述的制作语言类教材方法的整体流程图。 FIG. 5 is an overall flow chart of a method for producing a language teaching material according to the present invention.
具体实施方式 detailed description
本发明中所说的语言类教材并不限于通常所说的教材, 还包括语言游 戏软件、各种学习语言的软件等起到语言教学作用的媒体等。如图 1所示, 该语言类教材制作系统由如下组件所组成: The language-based textbooks referred to in the present invention are not limited to the so-called textbooks, but also include language game software, software for various language learning, and the like which serve as a language teaching function. As shown in Figure 1, the language teaching material production system consists of the following components:
输入装置 1, 供使用者输入原始教材中某一课的内容; Input device 1 for the user to input the content of a lesson in the original textbook;
单词切分器 2, 用于把输入的原始教材内容分割成单词; a word slicer 2, for dividing the input original textbook content into words;
判断装置 3, 将单词切分器 2的结果与存储装置结合起来判断原始教 材中的单词是否已经出现过; The judging device 3 combines the result of the word slicer 2 with the storage device to determine whether the word in the original teaching material has appeared;
提示装置 4, 用于将判断装置的判断结果提示给用户; a prompting device 4, configured to prompt the judgment result of the determining device to the user;
存储装置 5, 用于存储用户所选择的生词以及相关信息; a storage device 5, configured to store a new word selected by the user and related information;
选择装置 6, 供用户选择特定的单词作为生词; Selecting device 6 for the user to select a specific word as a new word;
输出装置 7, 用于输出编辑好的课的内容; Output device 7, for outputting the content of the edited class;
其中,输入装置 1、单词切分器 2、判断装置 3、提示装置 4顺序连接,
存储装置 5分别与判断装置 3、 提示装置 4、 选择装置 6和输出装置 7相 连接。 Wherein, the input device 1, the word slicer 2, the determination device 3, and the presentation device 4 are sequentially connected. The storage device 5 is connected to the determination device 3, the presentation device 4, the selection device 6, and the output device 7, respectively.
上述的语言类教材制作系统可以通过计算机来实现。 在使用计算机来 实现的情况下, 输入装置 1和输入装置 7可以使用计算机的人机交互接口 如键盘、 显示器等, 存储装置 5可以是计算机的硬盘, 选择装置 6可以是 计算机的鼠标或者键盘, 提示装置 4可以是计算机的显示器或者音箱等外 设。 The above language teaching material production system can be realized by a computer. In the case of using a computer, the input device 1 and the input device 7 can use a human-computer interaction interface of the computer such as a keyboard, a display, etc., the storage device 5 can be a hard disk of the computer, and the selection device 6 can be a mouse or a keyboard of the computer. The prompting device 4 can be a peripheral such as a display or a speaker of a computer.
另外, 本语言类教材制作系统还可以使用智能手机、 PDA等信息终端 设备来实现。 在这种情况下, 输入装置 1、提示装置 4、 存储装置 5、 选择 装置 6和输入装置 7都可以由智能手机或者 PM中的对应功能单元来实现。 在此就不一一赘述了。 In addition, the language teaching material production system can also be realized by using information terminal devices such as smartphones and PDAs. In this case, the input device 1, the prompting device 4, the storage device 5, the selecting device 6, and the input device 7 can all be realized by a corresponding functional unit in a smartphone or a PM. I won't go into details here.
在本发明中, 单词切分器 2和判断装置 3是本语言类教材制作系统中 的核心功能部件, 下面分别对其展开详细的说明。 In the present invention, the word slicer 2 and the judging means 3 are core functional components in the language-based textbook production system, and the detailed description thereof will be respectively made below.
单词切分器 2的作用在于将输入装置 1输入的课文分割成单词。 它可 以有多种实现方式。 The function of the word slicer 2 is to divide the text input by the input device 1 into words. It can be implemented in a variety of ways.
前已述及, 对于英语等欧洲国家语言而言, 在文章中识别单词是一件 相对简单的工作, 只需要识别文章中的空格就可以对单词进行准确的切 分。 因此如果仅用来制作英语教材, 本发明中的单词切分器只需要具有空 格识别单元, 能够査找文章中的空格就可以了, 这一点很容易通过软件技 术来实现。 As mentioned above, for European countries such as English, recognizing words in an article is a relatively simple task, and it is only necessary to identify the spaces in the article to accurately segment the words. Therefore, if it is only used to make English textbooks, the word slicer in the present invention only needs to have a space recognition unit, and it is possible to find a space in the article, which is easily realized by software technology.
但是, 对于中文或者日文等东亚国家的语言而言, 上述的单词切分器 实用性并不好。 以中文为例, 中文是以字为单位, 句子中所有的字连起来 才能描述一个意思。 例如, 英文句子 I am a student, 用中文则为: "我 是一个学生"。 计算机可以很简单通过空格知道 student是一个单词, 但 是不能很容易明白 "学"、 "生"两个字合起来才表示一个词。 不考虑这 一点而机械地通过空格进行单词切分,所获得的切分结果误差很大。因此, 有必要采用中文分词单元把中文的汉字序列切分成有意义的词。 However, for the languages of East Asian countries such as Chinese or Japanese, the word slicer described above is not very practical. Take Chinese as an example. Chinese is a word, and all the words in a sentence can be combined to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that the student is a word by a space, but it is not easy to understand that the words "study" and "sheng" are combined to represent a word. Regardless of this point, the word segmentation is mechanically performed by a space, and the obtained segmentation result has a large error. Therefore, it is necessary to use the Chinese word segmentation unit to divide the Chinese character sequence into meaningful words.
现在, 主要的中文分词算法可分为三大类: 基于字符串匹配的分词方 法、 基于理解的分词方法和基于统计的分词方法。 基于字符串匹配的分词 方法又叫做机械分词方法, 它是按照一定的策略将待分析的汉字串与一个 "充分大的"机器词典中的词条进行配, 若在词典中找到某个字符串, 则
匹配成功 (识别出一个词)。 基于理解的分词方法的基本思想就是在分词 的同时进行句法、 语义分析, 利用句法信息和语义信息来处理歧义现象。 基于统计的分词方法是对语料中相邻共现的各个字的组合的频度进行统 计, 计算它们的互现信息。 这种方法只需对语料中的字组频度进行统计, 不需要切分词典, 因而又叫做无词典分词法或统计取词方法。 Now, the main Chinese word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. The word segmentation method based on string matching is also called mechanical word segmentation method. It is to match the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, , then The match is successful (a word is recognized). The basic idea of the word-sharing method based on understanding is to perform syntactic and semantic analysis at the same time as word segmentation, and to use syntactic information and semantic information to deal with ambiguity. The statistical-based word segmentation method is to count the frequency of the combination of adjacent co-occurrence words in the corpus, and calculate their mutual information. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary, so it is also called the no-word dictionary or statistical word-taking method.
中文分词技术已经比较成熟。 最早的中文分词系统早在 1983年由北 京航空航天大学计算机系设计实现。 国家语委文字所的汉语自动分词模型 考虑了句法分析在自动分词系统中的作用, 以更好地解决切分歧义。 切词 过程考虑到了所有的切分可能, 并运用汉语句法等信息从各种切分可能中 选择出合理的切分结果。 微软研究院从 90年代初开始开发了一个通用型 的多国语言类处理平台 NLPWin。 据报道, NLPWin 的语法分析部分使用的 是一种双向的 Chart Parsing, 使用了语法规则并以概率模型作导向, 并 且将语法和分析器独立开。 实验结果表明, 该系统可以正确处理 85%的歧 义切分字段。 Chinese word segmentation technology has been relatively mature. The earliest Chinese word segmentation system was designed and implemented by the Department of Computer Science of Beijing University of Aeronautics and Astronautics in 1983. The Chinese automatic word segmentation model of the National Language Committee has considered the role of syntactic analysis in the automatic word segmentation system to better solve the divergence. The word-cutting process takes into account all the possible segmentation possibilities, and uses Chinese syntax and other information to select reasonable segmentation results from various segmentation possibilities. Since the early 1990s, Microsoft Research has developed a general-purpose multi-language class processing platform, NLPWin. According to reports, NLPWin's parsing section uses a two-way Chart Parsing that uses grammar rules and is guided by a probabilistic model, and separates the grammar from the parser. The experimental results show that the system can correctly handle 85% of the semantic segmentation fields.
图 2显示了一种全自动汉语分词系统的基本结构。 该汉语分词系统是 专利号为 96100831. 8的中国发明专利,包括: (1)汉语源语输入装置; (2) 根据汉语句末的标点符号自动断句的装置; (3)将句子字符转变成图结点 的结点结构生成装置; (4)确定词长的边求解装置, 该装置在边求解的同 时, 进行歧义判断, 并做相应的歧义标识; (5)根据歧义标识, 运用歧义 规则推理消除歧义的推理消歧装置, 其含有歧义规则库和叠词规则装置, 消歧规则的形式为: 前驱边属性集当前边属性集一上下文条件测试 I动作 函数名; (6)结果输出装置, 该装置通过遍历词切分路径得到用于输出的 词切分结构, 汉语源语输入装置启动自动断句装置工作, 结点结构生成装 置将自动断句的装置所断的句子中的字符转变成图结点, 形成待切的结点 序列送边求解装置, 边求解装置对结点序列求边, 推理消歧装置对求得的 边进行推理, 得到切分后的句子, 送入结果输出装置。 Figure 2 shows the basic structure of a fully automatic Chinese word segmentation system. The Chinese word segmentation system is a Chinese invention patent with the patent number 96100831. 8, including: (1) Chinese source input device; (2) a device for automatically breaking sentences according to punctuation at the end of a Chinese sentence; (3) converting sentence characters into The node structure generating device of the graph node; (4) the edge solving device for determining the word length, the device performs the ambiguous judgment while solving the edge, and performs corresponding ambiguous identification; (5) using the ambiguity rule according to the ambiguous identifier Reasoning disambiguation reasoning disambiguation device, which comprises an ambiguous rule base and a word stack rule device, and the form of the disambiguation rule is: a precursor edge attribute set current edge attribute set a context condition test I action function name; (6) a result output device The device obtains a word segmentation structure for output by traversing the word segmentation path, and the Chinese source input device starts the automatic sentence breaking device operation, and the node structure generating device converts the characters in the sentence broken by the device that automatically breaks the sentence into a graph. a node, forming a node sequence sending edge solving device to be cut, and the edge solving device determines the edge sequence, and the inference disambiguating device seeks Edge reasoning, sentences obtained after segmentation, the result into the output device.
上述的各种中文分词技术都可以直接用在本发明中, 用于实现中文分 词单元。 The above various Chinese word segmentation techniques can be directly used in the present invention to implement Chinese word segmentation units.
作为一个成熟的产品, 本语言类教材制作系统应该能够自动识别不同 的语言环境, 从而针对不同的语言进行单词切分: 对于英文等欧洲国家文 字, 不采用分词技术, 直接通过文章中的空格进行单词切分; 对于中文等
东亚国家的文字,主动采用相应的分词技术进行单词切分。要做到这一点, 需要在单词切分器 2中增加一个语言判别单元, 在进行单词切分之前, 首 先通过它进行语言的识别, 从而决定是否需要使用分词技术和使用何种分 词技术。 该语言判别单元实现起来也很简单, 通过识别输入装置 1所输入 的文本中, 文字的 ASCII码就可以准确地判断是何种语言。 As a mature product, this language teaching material production system should be able to automatically identify different language environments, so as to segment words for different languages: For European countries and other European countries, without word segmentation technology, directly through the space in the article Word segmentation; for Chinese, etc. The texts of East Asian countries actively use the corresponding word segmentation techniques to segment words. To do this, a language discriminating unit needs to be added to the word slicer 2, and before the word segmentation is performed, the language is first identified by it to determine whether the word segmentation technique and the word segmentation technique are used. The language discriminating unit is also simple to implement, and by recognizing the ASCII code of the text input by the input device 1, it is possible to accurately determine which language.
另外, 为了获得更优的单词切分结果, 在单词切分器 2中还可以包括 单词变化型资料库和分析运算器。 如图 3所示, 单词变化型资料库中保存 常用度较高的单词变化型态, 分析运算器将常规分词装置切分好的单词序 列, 对照单词变化型资料库, 如果切分好的单词, 在单词变化型资料库中 分为更小的单位时的常用度较高, 则进一步细分; 如果切分好的前后几个 单词, 在单词变化型资料库中联合作为更大的单位时的常用度较高, 则联 合这几个单词。 这样, 按照常用度较高的变化型态再次调整单词的切分, 得到准确度更高的切分结果。 In addition, in order to obtain a better word segmentation result, a word change type database and an analysis operator may be included in the word slicer 2. As shown in FIG. 3, the word change type database stores a word change pattern with a high degree of common degree, and the analysis operator divides the regular word segmentation device into a good word sequence, and compares the word change type database, if the good word is divided. When the word-changing database is divided into smaller units, the degree of commonality is higher, then it is further subdivided; if several words before and after the segmentation are combined, when the word-changing database is combined as a larger unit The higher the degree of commonality, the combination of these words. In this way, the segmentation of the word is adjusted again according to the change pattern with higher degree of commonness, and the segmentation result with higher accuracy is obtained.
单词切分器 2所获得的单词切分结果输入判断装置 3中, 判断装置 3 将切分后的单词与已经在存储装置 5中保存的词库进行比较, 判断课文中 出现的单词是否已经出现过。 该判断装置 3所进行的判断工作可以通过常 规的查询算法来实现。 对此, 在后文中还将有进一步的介绍。 The word segmentation result obtained by the word slicer 2 is input to the judging device 3, and the judging device 3 compares the segmented word with the thesaurus already stored in the storage device 5, and judges whether or not the word appearing in the text has appeared. Over. The judgment performed by the judging means 3 can be realized by a conventional query algorithm. In this regard, there will be further introduction in the following.
提示装置 4的作用除了将判断装置 3的判断结果提示给用户之外, 还 可以与存储装置中已有的单词语料库结合, 将单词语料库中的丰富信息显 示给用户, 从而进一步提高教材编辑的质量。 The function of the prompting device 4 can not only display the judgment result of the judging device 3 to the user, but also display the rich information in the monogram library to the user in combination with the existing single word library in the storage device, thereby further improving the quality of the teaching material editing. .
存储装置 5可以保存教材制作者在制作教材时的制作历史, 以单词为 保存的最小单位。 制作历史所包含的内容是: 教材制作的时间信息, 制作 者的个人信息, 该单词所属的课别信息, 该单词在所属课别出现的位置信 息, 单词本身的语法、 发音信息。 The storage device 5 can save the production history of the textbook creator when creating the teaching material, and the word is the smallest unit of storage. The content included in the production history is: time information of the teaching material creation, personal information of the producer, class information to which the word belongs, position information of the word in the class, and grammar and pronunciation information of the word itself.
另外, 为了进一步提高教材制作的准确性和速度, 本语言类教材制作 系统还可以增加句型分析装置(图中未示)。该句型分析装置与判断装置 3 相连接, 将原始教材的内容分割为单词序列, 并附加语法和发音信息。 在 判断装置 3进行判断的时候, 把单词与语法和发音信息结合起来判断, 自 动去除不必要的错误。 In addition, in order to further improve the accuracy and speed of teaching materials, the language teaching material production system can also add a sentence analysis device (not shown). The sentence pattern analyzing device is connected to the judging device 3, and divides the content of the original textbook into a word sequence, and adds grammar and pronunciation information. When the judging device 3 makes a judgment, the word is combined with the grammar and the pronunciation information to judge, and the unnecessary error is automatically removed.
下面结合图 5, 具体介绍利用本语言类教材制作系统进行生词自动识 别和标识的基本流程。
1. 首先, 用户把原始教材的某一课内容通过输入装置 1输入给系统;The basic flow of automatic recognition and identification of new words using the language teaching material production system will be specifically described below with reference to FIG. 1. First, the user inputs a course content of the original textbook to the system through the input device 1;
2. 系统自动调用单词切分器 2,单词切分器 2对得到的教材内容进行 分析, 并且得到切分后的单词信息。 2. The system automatically calls the word slicer 2, and the word slicer 2 analyzes the content of the obtained textbook and obtains the word information after the segmentation.
3. 如图 4所示, 系统自动调用判断装置 3, 将新获得的单词与存储装 置 5中已保存的单词进行对比分析, 判断单词在已经处理完成的课文中是 否出现, 出现的次数以及在什么位置出现。 在判断的时候, 可以将单词本 身的内容结合单词的语法信息 (词性等) 和发音信息 (拼音), 与已保存 的单词进行对比。 3. As shown in FIG. 4, the system automatically calls the determining device 3 to compare and analyze the newly obtained words with the saved words in the storage device 5, and determine whether the words appear in the texts that have been processed, the number of occurrences, and the number of occurrences. What position appears. At the time of judgment, the content of the word itself can be combined with the grammatical information (part of speech, etc.) and the pronunciation information (pinyin) of the word to be compared with the saved words.
存储装置 5中保存的单词 (简称既存单词) 是已经被应用到本教材其 他课程里的单词, 其信息格式为 [单词本身, 单词的语法信息, 单词的发 音信息, 单词被应用的位置信息], 经过单词切分器得到的单词 (简称新 单词) 格式为 [单词本身, 单词的语法信息, 单词的发音信息], 判断装置 将新单词的 [单词本身, 单词的语法信息, 单词的发音信息]信息, 查找既 存单词集合, 査找的条件是 (新单词的单词本身 =既存单词的单词本身) AND (新单词的语法信息 =既存单词的语法信息) AND (新单词的发音信息 =既存单词的发音信息)。 如果查找到相同的既存单词, 则说明该新单词 已经在本教材其他课程里出现过, 这时, 将该新单词是否出现的标志记为 "已经出现", 并记录出现位置的信息; 如果没有查找到相同的既存单词, 则说明该新单词还没有在本教材其他课程里出现, 这时, 将该新单词是否 出现的标志记为 "从未出现"。 The words stored in the storage device 5 (abbreviated as existing words) are words that have been applied to other courses of the textbook, and the information format is [word itself, grammatical information of the word, pronunciation information of the word, and position information to which the word is applied] The word obtained by the word slicer (referred to as a new word) is in the format of [the word itself, the grammar information of the word, the pronunciation information of the word], and the judging device sets the word of the new word, the grammar information of the word, and the pronunciation information of the word. ] information, find the existing word collection, the search condition is (the word of the new word itself = the word itself of the existing word) AND (the grammatical information of the new word = the grammatical information of the existing word) AND (the pronunciation information of the new word = the existing word Pronunciation information). If the same existing word is found, it means that the new word has appeared in other courses of the textbook. At this time, the sign indicating whether the new word appears is marked as "already appearing", and the information of the appearance position is recorded; If the same existing word is found, it means that the new word has not appeared in other courses of the textbook. At this time, the sign indicating whether the new word appears is "never appear".
4. 然后在提示装置 4 中将原教材中 "已经出现" 的单词与 "从未出 现"的单词用不同的形式区别开来, 并且把已经出现的单词的出现位置等 信息也提示给用户。 4. Then, in the prompting device 4, the words "already appearing" in the original textbook are distinguished from the words "unappeared" in a different form, and the information such as the appearance position of the already appearing word is also presented to the user.
5. 此时, 用户根据提示出的信息, 通过选择装置 6 选择需要作为新 的生词的单词, 被选择的单词, 按照 [单词本身, 单词的语法信息, 单词 的发音信息, 单词被应用的位置信息]的格式自动加入存储装置 5中。 5. At this time, the user selects a word to be selected as a new new word by the selection means 6 according to the prompted information, the selected word, according to [the word itself, the grammatical information of the word, the pronunciation information of the word, the position at which the word is applied. The format of the message] is automatically added to the storage device 5.
6. 最后, 通过输出装置 7, 将结果(单词及相关信息一览表) 自动输 出给用户。 6. Finally, the result (a list of words and related information) is automatically output to the user via the output device 7.
这样, 用户就免去了不断査找某个单词是否已经出现过和出现位置的 工作, 缩短了制作语言类教材的时间。 In this way, the user is relieved of the task of constantly finding out whether a word has appeared and appears, shortening the time for making language teaching materials.
本发明通过单词切分器自动将文章切分为单词序列, 利用判断装置和
存储装置自动查找出单词是否已经出现和出现的位置等信息。 有了这些信 息, 就可以做相关的统计和分析, 比如单词的分布, 出现频度, 重现率等 统计。 另外, 利用本教材制作系统中的存储装置和判断装置, 可以判断语 言材料的生词率, 从而有助于确定语言材料的难度级别, 帮助使用者选择 合适的语言材料进行学习。 The invention automatically divides an article into a word sequence by a word slicer, and utilizes a judging device and The storage device automatically finds out information such as whether the word has appeared and appeared. With this information, you can do related statistics and analysis, such as word distribution, frequency of occurrence, and rate of recurrence. In addition, by using the storage device and the judging device in the teaching material production system, the word rate of the language material can be determined, thereby helping to determine the difficulty level of the language material, and helping the user to select an appropriate language material for learning.
在本语言类教材制作系统中, 还可以增加一个可以用于局域网、 广域 网或者因特网的网络通信模块。 该网络通信模块连接输入装置和输出装 置。 利用网络通信模块, 本系统可以将结果通过网络进行传送, 并且可以 接受外来的教学资源, 实现在网络中自动发布教材。 In this language teaching material production system, a network communication module that can be used for a local area network, a wide area network, or the Internet can also be added. The network communication module is connected to the input device and the output device. Using the network communication module, the system can transmit the results through the network, and can accept external teaching resources to automatically publish the teaching materials in the network.
在网络环境中, 利用本语言类教材制作系统可以提供开放式的教材格 式。 使用者将其他的外部学习资料纳入统一管理, 可以增加语言教材的多 样性和灵活性。 另外, 存储装置可以将使用者的在线使用活动履历与教材 内容分别存储, 因此可以实现多人合作的教材编辑。 同时, 每个人的使用 履历也可以被单独纪录, 这样在网络中编辑同一本教材的时候, 可以清楚 地分别开各自的工作任务和进程, 大家共同需要的资源也可以方便地实现 共享。 In the network environment, the language-based textbook production system can provide an open textbook format. Users can integrate other external learning materials into unified management, which can increase the diversity and flexibility of language teaching materials. In addition, the storage device can store the online usage history of the user and the content of the teaching material separately, so that editing of the teaching material for multi-person cooperation can be realized. At the same time, each person's usage history can also be recorded separately, so that when editing the same textbook in the network, each work task and process can be clearly opened separately, and the resources that everyone needs can be easily shared.
需要说明的是, 本语言类教材制作系统可以制作各种语言类的教材, 并不限于制作上面提到的英语、 汉语或者日语等几种。 当需要制作不同语 言类的教材时, 更换单词切分器 2中的分词算法就可以达到目的。 而各种 不同语言类的分词算法本身是成熟的现有技术, 例如现在使用最为普遍的 Google搜索引擎就采用了适合不同种语言类的多种分词算法。这些技术都 可以直接应用于本发明。 It should be noted that the language teaching material production system can produce textbooks of various languages, and is not limited to the production of the above-mentioned English, Chinese or Japanese. When it is necessary to make a textbook of different languages, the word segmentation algorithm in the word slicer 2 can be used to achieve the goal. The word segmentation algorithms of different language classes are mature existing technologies. For example, the most commonly used Google search engine now adopts multiple word segmentation algorithms suitable for different language classes. These techniques can be directly applied to the present invention.
以上对本发明所述的用于制作语言类教材的系统及其方法进行了详 细的说明。 对本领域的一般技术人员而言, 在不背离本发明实质精神的前 提下对它所做的任何显而易见的改动, 都将构成对本发明专利权的侵犯, 将承担相应的法律责任。
The system and method for producing a language-based textbook according to the present invention have been described in detail above. Any obvious changes made to the invention without departing from the spirit of the invention will constitute an infringement of the patent right of the invention and will bear the corresponding legal responsibility.
Claims
1. 一种用于制作语言教材的系统, 包括输入装置和输出装置, 其特 征在于还包括: A system for producing a language teaching material, comprising an input device and an output device, the feature further comprising:
单词切分器, 用于把输入的原始教材内容分割成单词; a word slicer for dividing the input original textbook content into words;
判断装置, 用于判断原始教材中的单词是否已经出现过; a judging device for judging whether a word in the original textbook has appeared;
提示装置, 用于将所述判断装置的判断结果提示给用户; a prompting device, configured to prompt the judgment result of the determining device to the user;
选择装置, 供用户选择权单词作为生词; Selecting a device for the user to select a right word as a new word;
存储装置, 用于存储用户所选择的生词; a storage device, configured to store a new word selected by the user;
其中, 所述输入装置、 单词切分器、 判断装置、 提示装置顺序连接, 所述存储装置分别与所述判断装置、 提示装置、 选择装置和输出装置相连 接。 The input device, the word slicer, the determination device, and the presentation device are sequentially connected, and the storage device is connected to the determination device, the presentation device, the selection device, and the output device, respectively.
书 Book
2. 如权利要求 1所述的用于制作语言教材的系统, 其特征在于- 所述单词切分器具有能够识别空格的空格识别单元。 2. The system for producing a language teaching material according to claim 1, wherein the word slicer has a space recognition unit capable of recognizing a space.
3. 如权利要求 1所述的用于制作语言教材的系统, 其特征在于: 所述单词切分器具有用于实现中文分词的中文分词单元。 3. The system for making a language teaching material according to claim 1, wherein: the word slicer has a Chinese word segmentation unit for implementing Chinese word segmentation.
4. 如权利要求 1所述的用于制作语言教材的系统, 其特征在于: 所述单词切分器包括语言判别单元、 空格识别单元和中文分词单元, 所述语言判别单元分别与所述空格识别单元和中文分词单元相连接, 所述 空格识别单元和中文分词单元择一对外输出。 4. The system for creating a language teaching material according to claim 1, wherein: the word slicer comprises a language discriminating unit, a space recognizing unit, and a Chinese word segmentation unit, wherein the language discriminating unit and the space respectively The identification unit is connected to the Chinese word segmentation unit, and the space recognition unit and the Chinese word segmentation unit are selected for external output.
5. 如权利要求 1〜4中任意一项所述的用于制作语言教材的系统, 其 特征在于: The system for producing a language teaching material according to any one of claims 1 to 4, characterized in that:
所述单词切分器还包括单词变化型资料库和分析运算器, The word slicer further includes a word change type database and an analysis operator.
所述单词变化型资料库用于保存常用度较高的单词变化型态, 所述分 析运算器将切分好的单词序列对照所述单词变化型资料库进行进一步处 理。 The word change type database is used for storing a word change pattern having a high degree of commonness, and the analysis operator further processes the cut word sequence against the word change type database.
6. 如权利要求 1〜5中任意一项所述的用于制作语言教材的系统, 其 特征在于: The system for producing a language teaching material according to any one of claims 1 to 5, characterized in that:
所述系统还包括句型分析装置, 用于将原始教材的内容分割为单词序 列, 并附加语法和发音信息; The system further includes a sentence analysis device for dividing the content of the original textbook into a word sequence, and adding grammar and pronunciation information;
所述句型分析装置与所述判断装置相连接。
The sentence pattern analysis device is connected to the determination device.
7. 如权利要求 1〜6中任意一项所述的用于制作语言教材的系统, 其 特征在于- 所述系统还包括网络通信模块, 所述网络通信模块连接所述输入装置 和输出装置。 The system for producing a language teaching material according to any one of claims 1 to 6, wherein the system further comprises a network communication module, and the network communication module is connected to the input device and the output device.
8. 一种用于制作语言教材的方法, 通过如权利要求 1所述的用于制 作语言教材的系统实现,所述系统包括输入装置、输出装置、单词切分器、 判断装置、 提示装置、 选择装置和存储装置; 其特征在于包括如下步骤- 8. A method for producing a language teaching material, implemented by the system for producing a language teaching material according to claim 1, the system comprising an input device, an output device, a word slicer, a judging device, a cue device, Selecting device and storage device; characterized in that it comprises the following steps -
(1)用户把原始教材的某一课内容通过输入装置输入系统; (1) The user inputs the content of a lesson of the original textbook into the system through the input device;
(2)单词切分器对输入的教材内容进行分析, 得到切分后的单词信息; (2) The word slicer analyzes the content of the input textbook to obtain the word information after the segmentation;
(3)判断装置将步骤 (2)获得的单词与存储装置中已保存的单词进行对 比分析, 判断是否为生词; (3) The judging device compares the word obtained in the step (2) with the saved word in the storage device to determine whether it is a new word;
(4)提示装置将生词信息提示给用户; (4) The prompting device prompts the user with the new word information;
(5)用户通过选择装置选择需要作为新的生词的单词, 该被选择的单 词存入存储装置中; (5) The user selects a word that needs to be a new new word by the selection means, and the selected word is stored in the storage device;
(6)通过输出装置将结果输出给用户。 (6) Output the result to the user through the output device.
9. 如权利要求 8所述的用于制作语言教材的方法, 其特征在于: 所述步骤 (2)中, 在进行单词切分之前, 通过识别文字所对应的 ASCII 码确定是何种语言的教材, 从而决定采用何种单词切分方案。 9. The method for producing a language teaching material according to claim 8, wherein: in the step (2), before the word segmentation, determining the language by identifying the ASCII code corresponding to the text The textbook, which determines which word segmentation scheme to use.
10. 如权利要求 8或 9所述的用于制作语言教材的方法,其特征在于: 如果教材是中文教材, 则单词切分器通过中文分词算法对单词进行切 分。 10. The method for producing a language teaching material according to claim 8 or 9, wherein: if the teaching material is a Chinese textbook, the word slicer segments the words by a Chinese word segmentation algorithm.
11. 如权利要求 8或 9所述的用于制作语言教材的方法,其特征在于: 如果教材是英文教材, 则单词切分器通过识别教材内容中的空格对单 词进行切分。 11. The method for producing a language teaching material according to claim 8 or 9, wherein: if the teaching material is an English textbook, the word slicer segments the words by identifying spaces in the textbook content.
12. 如权利要求 8所述的用于制作语言教材的方法, 其特征在于: 所述步骤 (2)中, 单词切分器还包括单词变化型资料库和分析运算器, 单词变化型资料库中保存常用度较高的单词变化型态, 分析运算器将常规 分词装置切分好的单词序列, 对照单词变化型资料库, 如果切分好的单词 在单词变化型资料库中分为更小的单位时的常用度较高, 则将切分好的单 词进一步细分; 如果切分好的前后几个单词, 在单词变化型资料库中联合 作为更大的单位时的常用度较高, 则联合这几个单词。
12. The method for producing a language teaching material according to claim 8, wherein: in the step (2), the word slicer further comprises a word change type database and an analysis operator, and the word change type database The word change type with high degree of common usage is saved, and the analysis operator divides the regular word segmentation device into a good word sequence, and compares the word change type database, if the segmented good words are divided into smaller in the word change type database When the unit has a high degree of commonality, the segmented good words are further subdivided; if several words before and after the segmentation are well-defined, the common degree is higher when combined as a larger unit in the word-changing database. Then combine these words.
13. 如权利要求 8所述的用于制作语言教材的方法, 其特征在于- 所述步骤 (3)中, 判断装置判断是否为生词的条件是新单词的单词本身 等于既存单词的单词本身、 新单词的语法信息等于既存单词的语法信息、 新单词的发音信息等于既存单词的发音信息三个条件同时成立。 13. The method for producing a language teaching material according to claim 8, wherein in the step (3), the determining means determines whether the condition of the new word is the word itself of the new word is equal to the word itself of the existing word, The grammatical information of the new word is equal to the grammatical information of the existing word, and the pronunciation information of the new word is equal to the pronunciation information of the existing word.
14. 如权利要求 8所述的用于制作语言教材的方法, 其特征在于: 所述存储装置以单词为保存的最小单位保存制作教材时的制作历史, 所述制作历史包括: 教材制作的时间信息, 制作者的个人信息, 单词所属 的课别信息, 单词在所属课别出现的位置信息, 单词本身的语法、 发音信 息。 14. The method for producing a language teaching material according to claim 8, wherein: the storage device saves a production history when the teaching material is created in a minimum unit in which the word is saved, and the production history includes: a time when the teaching material is produced. Information, the personal information of the author, the class information to which the word belongs, the position information of the word in the class, the grammar and pronunciation information of the word itself.
15. 如权利要求 8所述的用于制作语言教材的方法, 其特征在于: 不同的使用者各自实施本方法, 并通过网络共享结果。 15. The method for producing a language teaching material according to claim 8, wherein: different users respectively implement the method and share the results through a network.
16. 如权利要求 15所述的用于制作语言教材的方法, 其特征在于: 存储装置分别存储不同使用者的在线使用履历与相应的教材内容。
16. The method for producing a language teaching material according to claim 15, wherein: the storage device stores the online usage history of the different users and the corresponding teaching material content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2006/001722 WO2008017188A1 (en) | 2006-07-17 | 2006-07-17 | System and method for making teaching material of language class |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2006/001722 WO2008017188A1 (en) | 2006-07-17 | 2006-07-17 | System and method for making teaching material of language class |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008017188A1 true WO2008017188A1 (en) | 2008-02-14 |
Family
ID=39032590
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2006/001722 WO2008017188A1 (en) | 2006-07-17 | 2006-07-17 | System and method for making teaching material of language class |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008017188A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR3033430A1 (en) * | 2015-03-06 | 2016-09-09 | Pro Form'arts | METHOD, SERVER AND GRAMMATICAL LEARNING OR REMEDIATION SYSTEM |
CN109558596A (en) * | 2018-12-14 | 2019-04-02 | 平安城市建设科技(深圳)有限公司 | Recognition methods, device, terminal and computer readable storage medium |
CN112165627A (en) * | 2020-09-28 | 2021-01-01 | 腾讯科技(深圳)有限公司 | Information processing method, device, storage medium, terminal and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1142644A (en) * | 1996-06-27 | 1997-02-12 | 英业达股份有限公司 | Sentence pattern training instrument for language teaching |
JP2001305945A (en) * | 2000-04-18 | 2001-11-02 | Mitsuyuki Masaji | Learning system guiding process of translation |
KR20040044797A (en) * | 2002-11-22 | 2004-05-31 | 엘지전자 주식회사 | Apparatus and method for studying language using mobile |
-
2006
- 2006-07-17 WO PCT/CN2006/001722 patent/WO2008017188A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1142644A (en) * | 1996-06-27 | 1997-02-12 | 英业达股份有限公司 | Sentence pattern training instrument for language teaching |
JP2001305945A (en) * | 2000-04-18 | 2001-11-02 | Mitsuyuki Masaji | Learning system guiding process of translation |
KR20040044797A (en) * | 2002-11-22 | 2004-05-31 | 엘지전자 주식회사 | Apparatus and method for studying language using mobile |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR3033430A1 (en) * | 2015-03-06 | 2016-09-09 | Pro Form'arts | METHOD, SERVER AND GRAMMATICAL LEARNING OR REMEDIATION SYSTEM |
CN109558596A (en) * | 2018-12-14 | 2019-04-02 | 平安城市建设科技(深圳)有限公司 | Recognition methods, device, terminal and computer readable storage medium |
CN112165627A (en) * | 2020-09-28 | 2021-01-01 | 腾讯科技(深圳)有限公司 | Information processing method, device, storage medium, terminal and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304375B (en) | Information identification method and equipment, storage medium and terminal thereof | |
CN104050256B (en) | Initiative study-based questioning and answering method and questioning and answering system adopting initiative study-based questioning and answering method | |
US10496756B2 (en) | Sentence creation system | |
EP1814047A1 (en) | Linguistic user interface | |
EP0971294A2 (en) | Method and apparatus for automated search and retrieval processing | |
JP2007122719A (en) | Automatic completion recommendation word provision system linking plurality of languages and method thereof | |
CN113505209A (en) | Intelligent question-answering system for automobile field | |
KR20120006489A (en) | Input method editor | |
CN110991180A (en) | Command identification method based on keywords and Word2Vec | |
CN105630770A (en) | Word segmentation phonetic transcription and ligature writing method and device based on SC grammar | |
JP2011118689A (en) | Retrieval method and system | |
CN113821593A (en) | Corpus processing method, related device and equipment | |
CN110321434A (en) | A kind of file classification method based on word sense disambiguation convolutional neural networks | |
US20220365956A1 (en) | Method and apparatus for generating patent summary information, and electronic device and medium | |
Antony et al. | A survey of advanced methods for efficient text summarization | |
JPH10207910A (en) | Related word dictionary preparing device | |
CN113361252A (en) | Text depression tendency detection system based on multi-modal features and emotion dictionary | |
CN103164398A (en) | Chinese-Uygur language electronic dictionary and automatic translating Chinese-Uygur language method thereof | |
CN103164397A (en) | Chinese-Kazakh electronic dictionary and automatic translating Chinese- Kazakh method thereof | |
WO2008017188A1 (en) | System and method for making teaching material of language class | |
CN111914533A (en) | Method and system for analyzing English long sentence | |
CN103164396A (en) | Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof | |
Fenogenova et al. | A general method applicable to the search for anglicisms in russian social network texts | |
CN103164395A (en) | Chinese-Kirgiz language electronic dictionary and automatic translating Chinese-Kirgiz language method thereof | |
CN109960720B (en) | Information extraction method for semi-structured text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 06761458 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06761458 Country of ref document: EP Kind code of ref document: A1 |