CN101576909A - Mongolian digital knowledge base system construction method - Google Patents

Mongolian digital knowledge base system construction method Download PDF

Info

Publication number
CN101576909A
CN101576909A CNA2009100837496A CN200910083749A CN101576909A CN 101576909 A CN101576909 A CN 101576909A CN A2009100837496 A CNA2009100837496 A CN A2009100837496A CN 200910083749 A CN200910083749 A CN 200910083749A CN 101576909 A CN101576909 A CN 101576909A
Authority
CN
China
Prior art keywords
mongolian
mongol
knowledge
stem
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100837496A
Other languages
Chinese (zh)
Inventor
苏雅拉图
白双成
巴图赛恒
六月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INNER MONGOLIA MENKSOFT SOFTWARE CO Ltd
Original Assignee
INNER MONGOLIA MENKSOFT SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INNER MONGOLIA MENKSOFT SOFTWARE CO Ltd filed Critical INNER MONGOLIA MENKSOFT SOFTWARE CO Ltd
Priority to CNA2009100837496A priority Critical patent/CN101576909A/en
Publication of CN101576909A publication Critical patent/CN101576909A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a Mongolian digital knowledge base system construction method, comprising the following steps of: acquiring Mongolian roots/stems and describing relevant knowledge attribute information of roots/stems, and generating a knowledge processing field unit of roots/stems; acquiring members with different forms of Mongolian roots/stems to form a member database; establishing a regular system for defining the regulations of the combinable members of roots/stems, carrying out free combination in the members and encapsulation and nestification of the members; generating an attribute filed unit used for limiting word combination relation of Mongolian; and generating a statistic tool unit for carrying out Mongolian real-time statistics. The Mongolian digital knowledge base system dynamically acquires Mongolian information by a known Mongolian keyboard input mode and a Mongolian OCR identification input mode, converts the Mongolian information into rich knowledge of Mongolian in real time, and provides technical support for digital application, digital teaching, digital learning, digital study and digital development of Mongolian.

Description

A kind of Mongolian digital knowledge base system construction method
Technical field
The present invention relates to a kind of natural language alphanumeric knowledge processing method, particularly a kind of by the computer-controlled Mongolian digital knowledge base system construction method that is used for Mongolian language alphanumeric application, digitizing teaching, digitizing study, Digital Study, digitizing exploitation.
Background technology
Mongol is an agglutinative language.Because Mongolian linguistics natural quality and characteristics, each speech then is a difference of expressing numerous complicated contents such as tense, form, mode, style, purpose according to it in the language chain, often be in the middle of the dynamic change that does not stop, verb particularly, a verb can be derived by various affixe/supplementary elements and is thousands of different dynamic change shapes, what people saw in the Mongol dictionary only is its static interpreter shape (the termination shape that is equivalent to Japanese), and shape dynamically expressed in Mongol is a variation body that is difficult to limit.Chinese does not have such dynamic change, even western language literal such as English have, but simple, do not have Mongolian complicated and changeable.Be Japanese like the comparing class, but the word dynamic change of Japanese is easy to limit, the dynamic change of Mongol word is difficult for limit.Mongol surely belongs to special case in world's existing natural spoken and written languages in this.This enriches the dynamic change body of multiterminal for the Mongol word, and domestic and international up to now any linguist did not do limit yet and generates and statistics, and reason is neither one scientific and feasible method and means.
The Mongol digital machine keyboard I/O technology that adopts is not owing to there is Mongolian digital knowledge base system to do support at present, cause people's spoken and written languages input error to be controlled, Mongolian voice messaging, morphological information, lexical information, syntactic information, pragmatic information can't be kept in its I/O process by the intrinsic natural structure of Mongolian spoken and written languages, and the time be converted into Mongolian voice knowledge on the spot, morphology knowledge, vocabulary knowledge, syntactic knowledge, pragmatics is so that make a large amount of electronic documents of Mongol that input forms need not be through repeatedly, repeatedly, complicated processing just can be directly multiplexing.
Mongol vocabulary is the same with the vocabulary of other natural language literal, is the individual magnanimity knowledge hierarchy that is made of the set of a N word N compound word (generalized compound speech, that is: non-word) on the whole.It simultaneously is again a dynamic knowledge system that constantly changes development.Along with the evolution of history, some speech need not or be of little use, and constantly produce new speech simultaneously again.Up to now, people use " dictionary made of paper " these ancient vocabulary equipments of recording to describe this dynamic magnanimity knowledge hierarchy always, the result is original and backward because of it, the past tense of vocabulary can only be write down and propagate, the present progressive tense (that is: writing down the neologisms new knowledge that constantly produces all the time on the spot with knowledge explosion one time-out) of vocabulary can not be write down and propagate.Can only write down statically and propagation vocabulary with sealing, can not write down dynamically openly and propagate vocabulary.Can only and propagate vocabulary by limited medium recording, can not and propagate vocabulary by the magnanimity medium recording.Having only could behave after the publication provides service, can not and the time behave service be provided.Can only be compiled by the sub-fraction expert, it is integrated and compile to participate in vocabulary by the most of expert of every profession and trade even the whole people.
Summary of the invention
The purpose of this invention is to provide a kind of Mongolian digital knowledge base system construction method, this Mongolian digital knowledge base system writes down dynamically openly and propagates Mongol vocabulary, dynamically obtain Mongol information by known Mongol keyboard entry method and Mongol OCR identification input mode, generate the word that comes out and the monogram mistake to occur, thereby saved the artificial check and correction of the heavy complexity of between one group of Mongol pinyin character, carrying out; The present invention has reached Mongol has been carried out digital applications, digitizing teaching, digitizing study, Digital Study, digitizing exploitation, so that greatly bring into play the power of digital machine as the human knowledge handling implement.
For achieving the above object, the present invention adopts following technical scheme:
A kind of Mongolian digital knowledge base system construction method, this method may further comprise the steps:
S1 obtains Mongol root/stem and describes the relevant knowledge attribute information of root/stem, generates root/stem knowledge processing FU;
S2 obtains the multi-form member member of formation database of Mongol root/stem;
S3 sets up the rule system that carries out the nested rule of encapsulation between independent assortment and each member between definition root/stem member capable of being combined and the member.
Preferably, also comprise after the step S3:
S4 generates by Mongol phrase knowledge description field, syntactic knowledge description field, attribute field unit that agent/word denoting the receiver of an action knowledge description field is formed, is used to limit Mongolian word syntagmatic.
Preferably, described member database comprises the affixe database, sticks together compound affixe storehouse, the non-compound affixe storehouse of sticking together, and step S2 comprises substep:
Integrated Mongolian affixe constitutes the affixe storehouse, is used to corresponding stem to provide and sticks together the additional calculations object to generate required word;
Integratedly stick together additional affixe and constitute and stick together compound affixe storehouse, be used to corresponding stem to provide and stick together the additional calculations object to generate the required additional word that sticks together;
The integrated non-non-compound affixe storehouse of sticking together of compound affixe formation of sticking together is used to corresponding compound root to provide the non-additional calculations object that sticks together to generate the required non-compound word that sticks together.
Preferably, described member database also comprises technical term storehouse, changeable body attachment component storehouse and self-defining data storehouse, and step S2 also comprises substep:
Mongol mathematics, physics, chemistry, medical science, biology, all kinds of different technical terms of computer technology science are integrated into the technical term storehouse;
The changeable body attachment component of Mongol is integrated into changeable body attachment component storehouse, is used to the knowledge processing of changeable body attachment component that data and rule are provided;
Generation is used to the storage and the generation of the personalized word of user that instrument is provided by the self-defined storehouse that the user fills.
Preferably, described affixe database, stick together compound affixe storehouse, non-compound affixe storehouse, changeable body attachment component storehouse, the self-defined storehouse of sticking together constantly expanded as required.
Preferably, every group of rule is described by BDQ rule description language in the described rule system, BDQ rule description language is made of the digital machine keyboard symbol: the English capitalization input code represents can be used as the member database value type of infix, and the english lowercase input code represents can be used as the member database value type that tail is sewed; 0 to 9 numeral can be used as the member database type set that the verb tail is sewed; Slash is represented or is concerned; Parenthesis represent to embed the nest relation of member database; Underscore is represented part of speech; One group of route rule of combination of # number expression is described and is finished, and another group route rule of combination is described beginning.
Preferably, every group of rule permission stem in the rule system makes up the member in a plurality of dissimilar member databases, generates with the multiway tree structure.
Preferably, this method also comprises step:
S5, generating with all kinds of language elements of Mongol and array configuration is statistical unit, is used to carry out the serial statistical tool unit of Mongol real-time statistics.
The present invention also provides a kind of Mongolian digital knowledge base system, and this system comprises:
The knowledge processing FU is used to digital machine to provide Mongolian root/stem, and describes the relevant knowledge attribute information of Mongol root/stem;
The member database, collection has the multi-form member of Mongol root/stem;
Rule system is used to define and carries out the nested rule of encapsulation between independent assortment and each member between root/stem member capable of being combined and the member;
The attribute field unit is made up of Mongol phrase knowledge description field, syntactic knowledge description field, attribute field unit that agent/word denoting the receiver of an action knowledge description field is formed, is used to limit Mongolian word syntagmatic.
Preferably, this system also comprises:
Series statistical tool unit, being used for all kinds of language elements of Mongol and array configuration is statistical unit, carries out Mongolian real-time statistics.
Utilize Mongolian digital knowledge base system construction method provided by the invention to have following technique effect:
1) control people's input error guarantees that the word of output does not have monogram mistake, morphology construction error, needn't manually proofread;
2) keep the intrinsic natural-sounding information of Mongol and structure and the time be translated into the phonetics knowledge that Mongol enriches on the spot, make it to calculate reusable;
3) keep the intrinsic natural morphology structure of Mongol and the time be translated into the abundant morphology of Mongol on the spot and gain knowledge, make it to calculate reusable;
4) keep the intrinsic vocabulary complex information of Mongol and the time be translated into the Mongol abundant vocabulary on the spot and gain knowledge, make it to calculate reusable;
5) keep the intrinsic word combined information of Mongol, phrase concerns knowledge, makes it to calculate reusable;
6) support that the no paper application of Mongol magnanimity vocabulary, no paper are learnt, no paper is imparted knowledge to students, no paper is studied, the exploitation of no paper, to reach Mongol is carried out digital applications, digitizing teaching, digitizing study, Digital Study, digitizing exploitation, greatly bring into play the power of digital machine as the human knowledge handling implement.
Description of drawings
Fig. 1 is a Mongolian digital knowledge base system construction method process flow diagram of the present invention;
Fig. 2 is a Mongol word input method process flow diagram of the present invention;
Fig. 3 is a Mongol word input method process flow diagram in the embodiment of the invention.
Embodiment
The Mongolian digital knowledge base system construction method that the present invention proposes is described as follows in conjunction with the accompanying drawings and embodiments.
The constructed Mongolian digital knowledge base system of Mongolian digital knowledge base system construction method provided by the present invention comprises: the member of a Mongolian root/stem knowledge description unit, a Mongolian member database that connects therewith and a decision root/stem and selection predetermined rule system thereof, an attribute field unit and a serial statistical tool unit that is used to carry out the Mongol real-time statistics that is used to describe Mongolian word syntagmatic.Utilize the constructed Mongolian digital knowledge base system of the inventive method, dynamically obtain Mongolian word or phrase by known Mongol keyboard entry method and Mongol OCR (Optical Character Recognition) identification input mode, and will be converted into the Mongol rich knowledge at that time on the spot, for Mongolian digital applications, digitizing teaching, digitizing study, Digital Study, digitizing exploitation provide technical support.
Embodiment 1
As shown in Figure 1, Mongolian digital knowledge base system construction method may further comprise the steps:
S1, obtain Mongol root/stem, reach the relevant knowledge attribute information of describing root/stem, generate root/stem knowledge processing FU, as shown in table 1 is Mongolian root/stem knowledge processing FU structural table, wherein input code is represented corresponding root or stem, code value A and code value B represent the multi-form of its correspondence, as above shape and independent shape and other various relevant knowledge attribute informations.
Table 1 Mongol root/stem knowledge processing FU
Sequence number Input code Code value A Code value B Descriptor code Part of speech Part of speech Written pronunciation Spoken pronunciation Near synonym Antonym Lexical or textual analysis The foreign language paginal translation ... ... The Mongol linguist thinks that the institute that needs might attribute
1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ......N
" root/stem " is that the source takes place Mongolian knowledge.Research through the domestic and international Mongol linguist centuries has had at present and has correctly set up Mongol root/stem database, correctly decomposes the scientific basis of Mongol stem and scientific description Mongol root/stem knowledge attribute.Present embodiment is based on above-mentioned scientific basis, set up the knowledge description FU that to describe the man-machine interaction of Mongolian root/stem relevant knowledge attribute, it acts in the input process and provides corresponding Mongolian root/stem according to information behind the input coding, and describes the relevant knowledge attribute information of Mongol root/stem in detail.Knowledge attribute information comprises that part of speech, part of speech (comprise verb and thing feature), written pronunciation, spoken pronunciation, near synonym, antonym, lexical or textual analysis, foreign language paginal translation or the like Mongol linguist thinks all possible attribute information of needs;
S2 obtains the multi-form member member of formation database of Mongol root/stem, and as shown in table 2 is the constructed various member database structures of present embodiment.
Table 2 Mongol word member database structure
Figure A20091008374900111
" member " of word is Mongol word inscape.The present invention reduces one " member database " to the abundant member of Mongol various forms." member database " also is divided into the affixe storehouse, sticks together compound affixe storehouse, the non-multiple value type component bases such as compound affixe storehouse, self-defined storehouse, technical term storehouse, changeable body attachment component storehouse that stick together, and step S2 comprises substep:
Integrated Mongol abundant affixe constitute the affixe storehouse, be used to corresponding stem to provide and stick together the additional calculations object to generate word, to generate needed word;
Integrated Mongol is abundant stick together additional affixe (generally be used for name and place name, for example:
Figure A20091008374900112
One speech be by
Figure A20091008374900113
Two independently speech stick together to write and form, claim the former for sticking together compound stem, claim that the latter is for sticking together compound affixe) constitute and stick together compound affixe storehouse, be used to corresponding stem to provide to stick together the additional calculations object and stick together additional word, to satisfy the required generation that adds word of sticking together with generation;
The abundant non-compound affixe (claiming that one group of speech behind compound word first speech is an affixe) that sticks together of integrated Mongol constitutes the non-compound affixe storehouse of sticking together, be used to corresponding compound root (first speech that refers to one group of compound word) to provide the non-additional calculations object that sticks together to generate the non-compound word that sticks together, to satisfy required non-generation of sticking together compound word;
All kinds of different technical terms such as Mongol mathematics, physics, chemistry, medical science, biology, computer technology science are integrated into the technical term storehouse, be used to the knowledge processing of " technical term " that data are provided, for the storage and the generation of the personalized word of people provides instrument (filling of its content is fully by user's decision);
Generate its content and fill the self-defined storehouse that determines by the user fully, be used to the storage and the generation of the personalized word of people that instrument is provided;
The changeable body attachment component of Mongol is integrated into changeable body attachment component storehouse, is used to the knowledge processing of changeable body attachment component that data and rule are provided, for root/stem provides required changeable body attachment component calculating object.
Brainstrust also can constantly expand various databases according to the needs of oneself under the computable prerequisite of knowledge.
S3 sets up the rule system that encapsulates nested rule between regular and each member that carries out independent assortment between definition root/stem member capable of being combined and the member.
Each Mongol root/stem is by the domination of a corresponding component composition rule.This rule system will be told the digital machine member that this stem can make up, and carry out between the member independent assortment the rule and each member between the encapsulation nested rule.Because the combination individual character of Mongol stem and member is different, the component composition rule of each stem varies, and the length of rule also is not quite similar.All rules of combination must be described one by one according to its abundant Mongol linguistic knowledge by the Mongol linguist and provide.In order to make the Mongol linguist can describe the word create-rule of each stem of Mongol in detail, the interactive rule description language of one personal-machine BDQ has been proposed in the present embodiment, every group of rule is described by BDQ rule description language in the rule system, and BDQ rule description language is fully by digital machine Keyboard Control and I/O.BDQ rule description language is described by known digital machine keyboard symbols such as the English lower case of the English capitalization of the digital machine keyboard input code name of Mongol word, Mongol word input code name, 0 to 9 numeral, slash, parenthesis, underscore (part of speech), # number.Wherein, the English capitalization input code represents can be used as the member database value type of infix, and said here value type is meant the type of top listed different types of member database.The english lowercase input code represents can be used as the member database value type that tail is sewed.0 to 9 numeral can be used as the member database typelib set that the verb tail is sewed.Slash is represented or is concerned.Parenthesis represent to embed the nest relation of member database value type.Underscore is represented part of speech.One group of route rule of combination of # number expression is described and is finished, another group route rule of combination is described beginning, to guarantee that stem can generate legal dynamic change word by a plurality of tree forks, as table 3 is the component composition rule of a certain root/stem, create-rule permission root/stem makes up the member in a plurality of different components storehouse, generates fully with the multiway tree structure.
Table 3 uses the word create-rule of BDQ rule description language to describe example
ID+ 1/2/3/4/g_n/e_tm/br_n/mhi_tm/mtgi_tm/ti_tm/tz_n/lk_tm/ gr_n/vg_n/ksa_n/TI(hz_tm)/ta_n/L/GL/VQ/QH/JG/EH/KD(GV/ GL/mv_tm)#BR(ti_tm/TI(hz_tm))#ph_n/PH(L(ksa_n)/R/S1/J( GV/GL/QH/V(ti_tm/t_tm/tz_n/ta_n))/ti_tm/t_tm/tz_n/ta_n /TI(hz_tm))#G2(L/S1(mv_tm/mk_n/L)/z_n/t_tm/ti_tm)#GV(g _n)#E1(S1/J/L/sk_tm/qa1_n/qv_n)#s_n/S2(GV/GL/QH/tz_n/z _n/L(GV/GL/QH))……N
S4 generates the attribute field unit that is made of Mongol phrase knowledge description field, syntactic knowledge description field (comprising punctuation mark attributive character mark), agent/word denoting the receiver of an action knowledge description field, is used to limit Mongolian word syntagmatic.
As shown in table 1, the fundamental purpose of this attribute field unit is after the rule and corresponding component composition of root/stem by its correspondence, if obtain the phrase that some speech constitute, for phrase structure relation and the rule of describing corresponding word provides instrument; The purpose of syntactic knowledge description field is, after the rule and corresponding component composition of root/stem by its correspondence, for syntactic relation and the rule of describing correlation word provides instrument.The fundamental purpose of agent/word denoting the receiver of an action knowledge description field is, after the rule and corresponding component composition of root/stem by its correspondence, for agent/word denoting the receiver of an action relation and the rule of describing corresponding word provides instrument.The attribute labeling language and the rule description language of these fields are made by oneself by the Mongol linguist, but linguistic notation must be known digital machine keyboard symbol, and the combination of keyboard symbol.
The effect of Mongol series statistical tool is, various different demands according to Mongol application person, learner, instructor, researcher, developer, with all kinds of language elements of Mongol and array configuration is statistical unit, carry out word frequency statistics in real time to determine vocabulary degree commonly used, satisfy the various purpose of people.
Embodiment 2
Before discussing the present invention, the general consideration that relates to the Typical Digital computing machine is provided by following background technology.A typical digital machine is made up of three formants: (a) CPU (central processing unit) (CPU); (b) storer; (c) a plurality of I/O ports.Storer plays a part storage instruction and data, instructs to instructing the information coded portion of CPU action.The instruction that is stored in one group of logical relation in the storer is called as program.Therefore CPU " reads " each instruction with logical order from storer, and operates with its start up process.If instruction sequences be concerned with and logic, then the program of Chu Liing will produce and understand and gratifying result.Storer also is used for storing the instruction of guiding operation and data to be operated.The structure of this program must make CPU when it thinks to instruct, and does not read non-instruction speech.
As shown in Figures 2 and 3, the Mongol word input in the present embodiment may further comprise the steps:
S21 imports and finishes a complete word coding according to Mongol coding input mode;
Input mode is imported, Mongol phoneme encoding input mode is imported, Mongol character shape coding input mode is imported, Mongol OCR identification input mode is imported to allow to adopt known Mongol voice fuzzy to encode, no matter adopt the Mongol keyboard entry method, still Mongol OCR discerns input mode, and all requiring with the whole speech of a complete Mongol is I/O unit.
S22, the root/stem and the corresponding relevant knowledge attribute information thereof that utilize root/stem knowledge processing FU to obtain to be imported according to the information behind the coding;
After input is finished a complete word coding and is utilized existing method to obtain root/stem, digital machine at first will begin to travel through computing from Mongol root/stem, as shown in Figure 6, judge that at first its relevant knowledge attribute information judges whether to finish, if finish, the correlation attribute information of obtaining this root/stem is described, if do not finish, carry out knowledge processing by root/stem knowledge processing FU, be specially the relevant knowledge attribute information that obtains resulting root/stem, the relevant knowledge attribute information of the root/stem that provides in advance in the acquisition table 2.
S23, the root/stem that is obtained according to step S21 by rule system itself and corresponding relevant knowledge attribute information obtain the nested rule of encapsulation between the rule of carrying out independent assortment between this root/stem member capable of being combined and the member and each member;
Behind the relevant knowledge attribute information of the root/stem that provides in advance in the acquisition table 1, the component composition rule of visit this root/stem as shown in table 3 (having comprised the nested Rule Information of encapsulation between the rule of carrying out independent assortment between this root/stem member capable of being combined and the member and each member) in the rule system again.
S24, according to the rule that has access among the step S23, select member according to the member capable of being combined that obtains from the member database, carry out root/stem that encapsulation nested rule and step S21 obtain between the rule of independent assortment and each member set by step between the member that obtains among the S23 and make up the generation word.
S25 utilizes phrase knowledge description field, syntactic knowledge description field, agent/word denoting the receiver of an action knowledge description field restriction language syntagmatic and output in the attribute field unit to select to carry out the input of Mongol word for the user.
Just mean input error if can not fail encoded radio among the step S25, should give correction at once.Certainly, get rid of input error fully, just belonged to the problem of knowledge base system itself certainly, that is: or not provided corresponding stem or do not give enough respective members, or do not given enough respective combination rule coverage.Knowledge base system has reached the full ripe stage through test repeatedly many times, and the probability that this class problem occurs is few, even occur also can only occurring under very seldom used condition, generally can not occur under typical conditions.Say that again even such situation occurred, because of the natural structure and the open characteristics of knowledge base system itself, it is very convenient to revise, the Mongol linguistics expert who is responsible for the linguistry engineering all can solve.
According to Mongolian natural quality and characteristics, a Mongolian root can directly be generated as a complete speech (for example:
Figure A20091008374900151
), stem make up a member can generate a complete speech (for example: ), stem continuously the several members of combination generate a complete speech (for example:
Figure A20091008374900153
), it is varied very abundant that the dynamic combined of visible Mongol word generates mode.According to Mongolian these characteristics, the create-rule of present embodiment table 3 permission root/stem makes up the member in a plurality of dissimilar member databases, generates fully with the multiway tree structure.Generate the word of coming out like this, owing to be subjected to the data structure control of table 1, table 2, be subjected to the rule of combination domination of table 3 simultaneously, even the entry personnel inputs by mistake, as Chinese character keyboard input technology a Chinese character inside stroke mistake and member collocation error can not appear like that, the monogram mistake can not occur, thereby save the artificial check and correction of the heavy complexity of between one group of Mongol pinyin character, carrying out.
The voice component unit of Mongol minimum is a phoneme.Phoneme divides two kinds of vowel phoneme and consonant phonemes, and vowel phoneme is represented that by vowel consonant phoneme is represented by consonant.Owing to Mongolian sticking together write reason, cause a Mongolian phoneme generally by form (letter), three multi-form formations of suffix form (letter) between prefix form (letter), speech.To write structure requirement attractive in appearance in order reaching to stick together between letter and letter, to also have some different variant forms (letter) in the mutually same alphabetical form interval.This has just caused the same letter of Mongol at prefix some different variants to be arranged, and some different variants are arranged between speech, and suffix has the complicated reality of some different variants.Than phoneme again the Mongol voice component unit of big one-level be syllable.Syllable is minimum in Mongol will be made of a vowel phoneme, consonant phoneme must and vowel phoneme stick together together could syllabication.Speech is minimum in Mongol is made of a syllable, generally is made of several syllables.The present invention generates by the voice messaging that methods such as Mongol encoded radio, voice split mark, voice combination mark have provided each word of Mongol." coding code value " is meant " Mongol character "." voice split mark " is meant in " root/stem/affixe " lining and carries out voice annotation respectively." voice combination mark " is meant that each component composition goes out the secondary mark after realize voice changes when being a speech.This method can satisfy study, application, teaching, research, the exploitation of user to Mongol voice knowledge, and the time be translated into the abundant phonetics knowledge of Mongol on the spot, make it to calculate reusable.
As mentioned above, freely generating owing to be subjected to the data structure control of table 1, table 2 of Mongol word is subjected to the rule of combination domination of table 3 simultaneously, the morphology structure that guarantees any speech be nature with correct.Therefore, at any time wonder the morphology structure of the speech of any I/O when you, system shows the tortuous change procedure of root knowledge, stem knowledge, affixe knowledge, morphology structure of your this speech, relevant supplementary element knowledge or the like on the spot during meeting, and assurance Mongol morphology knowledge all can be calculated reusable.
As mentioned above, the create-rule permission stem of table 3 makes up the member in a plurality of different components storehouse, generate fully with the multiway tree structure, in fact provided the technological means of each stem limit generation Mongol dynamic change form of Mongol on theoretical and the method, by these means, people are according to different purposes, not only can limit generate all dynamic change forms that the relevant stem rule of combination of Mongol is stipulated, and can limit generate all dynamic change magnanimity forms of Mongol of all stem rule of combination regulations of Mongol, and make it can calculate reusable.For example, above-mentioned Mongol word dynamic change generates the result, has just provided the correct data matching parameter of the various machine critique systems of Mongol, and this has important scientific meaning for the automatic critique system of the exploitation various machines of Mongol etc.
Because Mongolian natural quality and characteristics have many similar shapes, unisonance, inhomogeneous speech in Mongol, for example:
Figure A20091008374900171
(verb, meaning is promptly eaten) //
Figure A20091008374900172
(noun, meaning are strength).The present invention is that the source takes place Mongolian knowledge with Mongolian root/stem, (not carrying out part of speech divided when each root/stem was all distinguished choice with part of speech, from now on Mongolian phrase structure and meaning, syntactic structure and meaning can't be described), must cause this class root/stem input code, voice descriptor code, code value three's phenomenon in full accord (have only just can do distinguish by part of speech).This phenomenon can strengthen the Mongol keyboard entry method and the OCR recognition method produces a large amount of repeated code speech generating probabilities, increases the weight of the workload of artificial selection output repeated code speech, thereby reduces literal I/O work efficiency.The collocations relation rule of the phrase knowledge description field by this invention (for example: according to " consequent collocation speech " formerly, " preceding paragraph collocation speech " after order the time provide corresponding repeated code contamination collocation speech on the spot), do not need artificial selection when running into the repeated code speech, continue the follow-up speech of the corresponding repeated code speech of input, system will finish the selection of repeated code speech automatically by the collocations relation rule, and guarantees that the correct selection rate of repeated code speech reaches more than 98%.
The Mongolian digital knowledge base system that generates among the present invention is by intrinsic natural structure and the attribute of Mongol, keep Mongol vocabulary abundant information during employing on the spot, and it is converted into the method for Mongol vocabulary knowledge in real time, none is saved in the system with omitting by the Mongol vocabulary of I/O of digital machine keyboard and OCR identification I/O with each.Be saved the Mongol vocabulary that gets off like this, just become the digitizing of Mongol vocabulary magnanimity knowledge hierarchy by accumulating over a long period, its advantage has: a) not only resemble the past tense that writes down and propagate Mongol vocabulary the dictionary made of paper, but also can the time write down and propagate the present progressive tense of Mongol vocabulary on the spot.B) not to resemble the dictionary made of paper statically sealing ground record and propagate vocabulary, but write down dynamically openly and propagate vocabulary.C) not to resemble the dictionary made of paper by the finite medium record and propagate vocabulary, but by digitizing magnanimity medium recording and propagation vocabulary.D) be not to resemble that having only the dictionary made of paper could behave after the publication provides service, but can also by digitizing means such as network, mobile phone, PDA behave provide and the time service.E) be not to resemble only to compile the dictionary made of paper by the sub-fraction expert, but by the most of expert of every profession and trade in addition all personnel who allows to touch known digitizing means such as network, mobile phone, PDA participate in vocabulary integrated with compile (by being arranged at online aggregate resource management system audit issue).F) support that the no paper application of Mongol vocabulary, no paper are learnt, no paper is imparted knowledge to students, no paper is studied, the exploitation of no paper.
Above embodiment only is used to illustrate the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make various variations and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (10)

1, a kind of Mongolian digital knowledge base system construction method is characterized in that, this method may further comprise the steps:
S1 obtains Mongol root/stem and describes the relevant knowledge attribute information of root/stem, generates root/stem knowledge processing FU;
S2 obtains the multi-form member member of formation database of Mongol root/stem;
S3 sets up the rule system that carries out the nested rule of encapsulation between independent assortment and each member between definition root/stem member capable of being combined and the member.
2, method according to claim 1 is characterized in that, also comprises after the step S3:
S4 generates by Mongol phrase knowledge description field, syntactic knowledge description field, attribute field unit that agent/word denoting the receiver of an action knowledge description field is formed, is used to limit Mongolian word syntagmatic.
3, method according to claim 1 is characterized in that, described member database comprises the affixe database, sticks together compound affixe storehouse, the non-compound affixe storehouse of sticking together, and step S2 comprises substep:
Integrated Mongolian affixe constitutes the affixe storehouse, is used to corresponding stem to provide and sticks together the additional calculations object to generate required word;
Integratedly stick together additional affixe and constitute and stick together compound affixe storehouse, be used to corresponding stem to provide and stick together the additional calculations object to generate the required additional word that sticks together;
The integrated non-non-compound affixe storehouse of sticking together of compound affixe formation of sticking together is used to corresponding compound root to provide the non-additional calculations object that sticks together to generate the required non-compound word that sticks together.
4, method according to claim 3 is characterized in that, described member database also comprises technical term storehouse, changeable body attachment component storehouse and self-defining data storehouse, and step S2 also comprises substep:
Mongol mathematics, physics, chemistry, medical science, biology, all kinds of different technical terms of computer technology science are integrated into the technical term storehouse;
The changeable body attachment component of Mongol is integrated into changeable body attachment component storehouse, is used to the knowledge processing of changeable body attachment component that data and rule are provided;
Generation is used to the storage and the generation of the personalized word of user that instrument is provided by the self-defined storehouse that the user fills.
5, method according to claim 4 is characterized in that, described affixe database, stick together compound affixe storehouse, the non-compound affixe storehouse, changeable body attachment component storehouse, self-defined storehouse of sticking together can constantly be expanded as required.
6, method according to claim 1, it is characterized in that, every group of rule is described by BDQ rule description language in the described rule system, BDQ rule description language is made of the digital machine keyboard symbol: the English capitalization input code represents can be used as the member database value type of infix, and the english lowercase input code represents can be used as the member database value type that tail is sewed; 0 to 9 numeral can be used as the member database type set that the verb tail is sewed; Slash is represented or is concerned; Parenthesis represent to embed the nest relation of member database; Underscore is represented part of speech; One group of route rule of combination of # number expression is described and is finished, and another group route rule of combination is described beginning.
7, method according to claim 6 is characterized in that, every group of rule permission stem in the rule system makes up the member in a plurality of dissimilar member databases, generates with the multiway tree structure.
8, method according to claim 1 is characterized in that, also comprises step:
S5, generating with all kinds of language elements of Mongol and array configuration is statistical unit, is used to carry out the serial statistical tool unit of Mongol real-time statistics.
9, a kind of Mongolian digital knowledge base system is characterized in that, this system comprises:
The knowledge processing FU is used to digital machine to provide Mongolian root/stem, and describes the relevant knowledge attribute information of Mongol root/stem;
The member database, collection has the multi-form member of Mongol root/stem;
Rule system is used to define and carries out the nested rule of encapsulation between independent assortment and each member between root/stem member capable of being combined and the member;
The attribute field unit is made up of Mongol phrase knowledge description field, syntactic knowledge description field, attribute field unit that agent/word denoting the receiver of an action knowledge description field is formed, is used to limit Mongolian word syntagmatic.
10, Mongolian digital knowledge base system according to claim 9 is characterized in that, this system also comprises:
Series statistical tool unit, being used for all kinds of language elements of Mongol and array configuration is statistical unit, carries out Mongolian real-time statistics.
CNA2009100837496A 2009-05-11 2009-05-11 Mongolian digital knowledge base system construction method Pending CN101576909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100837496A CN101576909A (en) 2009-05-11 2009-05-11 Mongolian digital knowledge base system construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100837496A CN101576909A (en) 2009-05-11 2009-05-11 Mongolian digital knowledge base system construction method

Publications (1)

Publication Number Publication Date
CN101576909A true CN101576909A (en) 2009-11-11

Family

ID=41271842

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100837496A Pending CN101576909A (en) 2009-05-11 2009-05-11 Mongolian digital knowledge base system construction method

Country Status (1)

Country Link
CN (1) CN101576909A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681985A (en) * 2012-05-16 2012-09-19 中国科学院计算技术研究所 Translation method and translation system oriented to morphologically-rich language
CN104615724A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Establishing method of knowledge base and information search method and device based on knowledge base
CN104866607A (en) * 2015-06-04 2015-08-26 北京信息科技大学 Dongba character interpretation database building method
CN105210055A (en) * 2013-04-11 2015-12-30 微软技术许可有限责任公司 Word breaker from cross-lingual phrase table
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
CN108334502A (en) * 2017-12-29 2018-07-27 内蒙古蒙科立蒙古文化股份有限公司 A kind of method for mutually conversing of tradition Mongolian and Cyrillic Mongolian
CN109726299A (en) * 2018-12-19 2019-05-07 中国科学院重庆绿色智能技术研究院 A kind of incomplete patent automatic indexing method
CN110837564A (en) * 2019-09-25 2020-02-25 中央民族大学 Construction method of knowledge graph of multilingual criminal judgment books

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681985A (en) * 2012-05-16 2012-09-19 中国科学院计算技术研究所 Translation method and translation system oriented to morphologically-rich language
CN105210055A (en) * 2013-04-11 2015-12-30 微软技术许可有限责任公司 Word breaker from cross-lingual phrase table
CN105210055B (en) * 2013-04-11 2018-06-12 微软技术许可有限责任公司 According to the hyphenation device across languages phrase table
CN104615724B (en) * 2015-02-06 2018-01-23 百度在线网络技术(北京)有限公司 The foundation of knowledge base and the information search method and device in knowledge based storehouse
CN104615724A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Establishing method of knowledge base and information search method and device based on knowledge base
CN104866607B (en) * 2015-06-04 2018-01-12 北京信息科技大学 A kind of Dongba character textual research and explain database building method
CN104866607A (en) * 2015-06-04 2015-08-26 北京信息科技大学 Dongba character interpretation database building method
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
CN105957518B (en) * 2016-06-16 2019-05-31 内蒙古大学 A kind of method of Mongol large vocabulary continuous speech recognition
CN108334502A (en) * 2017-12-29 2018-07-27 内蒙古蒙科立蒙古文化股份有限公司 A kind of method for mutually conversing of tradition Mongolian and Cyrillic Mongolian
CN109726299A (en) * 2018-12-19 2019-05-07 中国科学院重庆绿色智能技术研究院 A kind of incomplete patent automatic indexing method
CN110837564A (en) * 2019-09-25 2020-02-25 中央民族大学 Construction method of knowledge graph of multilingual criminal judgment books
CN110837564B (en) * 2019-09-25 2023-10-27 中央民族大学 Method for constructing multi-language criminal judgment book knowledge graph

Similar Documents

Publication Publication Date Title
West et al. A linguistic-based measure of cultural distance and its relationship to managerial values
CN101576909A (en) Mongolian digital knowledge base system construction method
Kaur et al. Review of machine transliteration techniques
WO2000038083A1 (en) Method and apparatus for performing full bi-directional translation between a source language and a linked alternative language
CN103314369B (en) Machine translation apparatus and method
CN102214238B (en) Device and method for matching similarity of Chinese words
Josan et al. A Punjabi to Hindi machine transliteration system
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN101499056A (en) Backward reference sentence pattern language analysis method
CN105045410A (en) Method for correspondingly identifying formalized phonetic alphabets and Chinese characters
Oliveira et al. On european Portuguese automatic syllabification.
Asahiah Development of a Standard Yorùbá digital text automatic diacritic restoration system
CN102156693A (en) Method and system for inputting braille alphabet
Lane et al. Interactive word completion for Plains Cree
Dutoit et al. TTSBOX: A MATLAB toolbox for teaching text-to-speech synthesis
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation
Li et al. The study of comparison and conversion about traditional Mongolian and Cyrillic Mongolian
CN110955768A (en) Question-answering system answer generating method based on syntactic analysis
CN111814433B (en) Uygur language entity identification method and device and electronic equipment
Basumatary et al. Deep Learning Based Bodo Parts of Speech Tagger
Fagbolu et al. Digital yoruba corpus
Namboodiri et al. On using classical poetry structure for Indian language post-processing
Brierley et al. Translating sacred sounds: Encoding tajwīd rules in automatically generated IPA transcriptions of Quranic Arabic
Amezian et al. Training an LSTM-based Seq2Seq Model on a Moroccan Biscript Lexicon

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20091111