WO2017193472A1 - 一种东巴经典古籍数字化释读库的建立方法 - Google Patents

一种东巴经典古籍数字化释读库的建立方法 Download PDF

Info

Publication number
WO2017193472A1
WO2017193472A1 PCT/CN2016/090275 CN2016090275W WO2017193472A1 WO 2017193472 A1 WO2017193472 A1 WO 2017193472A1 CN 2016090275 W CN2016090275 W CN 2016090275W WO 2017193472 A1 WO2017193472 A1 WO 2017193472A1
Authority
WO
WIPO (PCT)
Prior art keywords
dongba
database
interpretation
event
meaning
Prior art date
Application number
PCT/CN2016/090275
Other languages
English (en)
French (fr)
Inventor
徐小力
吴国新
王红军
李宁
蒋章雷
王少红
Original Assignee
北京信息科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京信息科技大学 filed Critical 北京信息科技大学
Priority to US15/578,671 priority Critical patent/US10380065B2/en
Publication of WO2017193472A1 publication Critical patent/WO2017193472A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • the invention relates to a method for establishing a digital digitized interpretation library, in particular to a method for establishing a digital interpretation library of Dongba classic ancient books for complex pictographs.
  • the Naxi people living in southeastern China are a minority of China.
  • the Dongba pictograms used by the nation since ancient times are considered by the international academic community to be the only hieroglyphics still in use in the world today.
  • the Dongba classics are made up of unique figures. It is written in the Dongba pictorial form of the pronunciation. It was listed as a "World Memory Heritage" by UNESCO in 2003, and established the important position of the Chinese Naxi Dongba culture in the history of cultural development in the world and China.
  • the shape, sound and meaning of the Dongba pictographic text are extremely complicated. They are often polymorphic, polyphonic, polysemy, and have the same meaning. Since ancient times, the inheritance of Dongba culture has been carried out in the form of oral communication within the Dongba family. At present, the old Dongba clergys who can read the classics of Dongba have been more and more ancient. The interpretation of Dongba classic ancient books has obvious complexity and uncertainty in the graphics, voice and content of hieroglyphics. Existing English and Chinese characters recognition systems and translation systems cannot be applied.
  • the object of the present invention is to provide a method for establishing a digital interpretation library of Dongba classic ancient books, which realizes the digital preservation of Dongba classic ancient books and provides for the establishment and operation of the Dongba classic ancient books database and knowledge base. support.
  • the present invention adopts the following technical solution: a method for establishing a digital interpretation library of Dongba classic ancient books, which is characterized in that the method comprises the following steps: 1) collecting existing Dongba classic ancient books data and establishing east Ba's classic ancient books interpretation database, the interpretation database includes graphic template library, audio template library and video template library; 2) According to Dongba classic ancient books interpretation database to establish Dongba classic ancient books interpretation database, interpretation database including word meaning database, sentence Italian database and event database; 3) Establishing the Dongba Classical Ancient Books Interpretation Knowledge Base to manage the interpretation database: Interpreting the knowledge base to interpret the contents of the three interpretation databases according to the interpretation rules, and using the inference engine to facilitate the interpretation of the database of meanings in the database , the integration of the sentence database, the event database; 4) the establishment of Dongba classic ancient books interpretation optimization library, through knowledge mining tools It provides frequent pattern mining and cluster analysis for the contents of the interpretation database and the interpretation of the knowledge base, and provides support for the interpretation and optimization
  • the graphic template library includes a unique graphic code, a standard glyph, and a special-shaped word; wherein the graphic content in the graphic template library has Dongba pictograph, Dongba statement and east
  • the audio template library includes a unique audio code, an audio storage path, and a Naxi phonetic symbol; wherein the audio content in the audio template library also includes the Dongba text, the Dongba statement, and the Dongba event;
  • the video template library includes Unique video code and video storage path; video content in the video template library also includes Dongba text, Dongba statement and Dongba event.
  • the word meaning database extract the Dongba pictograph in the existing Dongba classic as a standard font, encode each character in Unicode, and utilize the existing TrueType
  • the method establishes the standard template library of Dongba pictographic text; the texts in the standard template library of Dongba pictograms are compiled and entered; the fields of the meaning database include Unicode, graphic code, corresponding Chinese characters, classification, Corresponding to English, translator, Chinese interpretation, audio code, Naxi phonetic symbol and video code;
  • the sentence database includes sentence code, Dongba statement, corresponding Chinese, sentence meaning, classification, graphic code, audio code and video code;
  • the event database includes event name code, event name, classification, event content, graphic code, audio code and video code, wherein the content classification includes: philosophy, history, religion, medicine, astronomy, geography, folklore, animal and plant, military, literature And art.
  • the rule is: if the input condition attribute C1 is the range V A certain value Vc1, then the decision attribute D is the corresponding decision value d1, that is, the attribute corresponding to the corresponding field when the output satisfies Vc1; if two condition attributes C2 and C3 are input, where C2 is a value in the value field V Vc2, C3 is a certain value Vc3 in the range V, then the decision attribute D is the decision value d2, that is, the attribute corresponding to the corresponding field when the output satisfies Vc2 and Vc3.
  • the inference engine process is as follows: 1 matching: whether the current fact solving problem in the interpretation database matches the conditional part of the rule in the interpretation knowledge base, if both Matching, the rules in the interpretation knowledge base are enabled, and the step 3 is executed according to the execution operation part of the rule; if the condition part of the multiple rules simultaneously matches the facts related to the solution problem, the process proceeds to step 2; 2 conflict resolution: advance For all rules The condition part sets the priority. When the condition part of the existence of multiple rules matches the facts related to the solution problem, the rule with the higher priority of the condition part is preferentially enabled. 3 Execution operation: the operation part of the execution enable rule is executed after the operation , get new facts, and send the new facts to the current release database.
  • the fusion method of the semantic database, the semantic database and the event database comprises the following steps: 1 uniquely determining a Dongba text according to the semantic database encoding, according to the Dongba text Graphic code, audio code, video code, synchronously retrieve the graphics, audio and video corresponding to the Dongba text, presenting the content and meaning of the Dongba text; 2 according to the corresponding Chinese characters in the word meaning database and the classification, fuzzy search sentence Italian database, search for the Dongba statement that satisfies the same classification, that is, the Dongba statement is matched according to the individual Dongba characters, so that the graphic corresponding to the Dongba statement is retrieved according to the graphic code, the audio code and the video code in the sentence meaning database.
  • Audio and video; 3 according to the corresponding Chinese characters in the word meaning database and the classification, fuzzy search event database, search for the name of the Dongba event that satisfies the same classification, that is, match the Dongba event according to the separate Dongba text, and thus according to the event database
  • search for the name of the Dongba event that satisfies the same classification that is, match the Dongba event according to the separate Dongba text, and thus according to the event database
  • audio code, video code retrieve the corresponding Dongba thing Graphics, audio and video, in order to achieve the meaning of the word database, database integration sentence was intended, the event database.
  • the frequent pattern mining process is: performing frequent pattern mining on the word meaning database, the sentence meaning database, and the event database in the interpretation database, and obtaining the word meaning database and the sentence meaning database. , the frequently used combination of meanings and syntactic combinations in the event database; the frequent pattern mining of the rules in the interpretation knowledge base, summarizing the combination of frequently used conditional attributes and decision values; outputting frequent items to the interpretation knowledge base, In the interpretation process of Dongba Pictograph, the word combination and sentence combination which are the most suitable for the current interpretation are provided as the interpretation option.
  • the FP-growth algorithm is used to search the frequent meaning database as an example: 1 scan word The database, find out the frequently used words and the number of uses, make a list L of frequent items, sort by the number of times of use; 2 scan the database again, and continue to build FP-Tree by each word: the root of FP-Tree The node is set to null; each word is added to the branch of the FP-Tree one by one; 3 the head node table is made, and all the same items are linked.
  • conditional FP-tree obtains a conditional frequent item set; 7 the conditional frequent item set and the suffix pattern of the word meaning are combined to obtain a frequent item set of the word meaning.
  • the cluster analysis method is as follows: 1 using the number of events in the event database as the number of categories for cluster analysis, and encoding the event, event name, classification, and event content.
  • the graphic code information is used as the source data, and the fuzzy C-means method is used to calculate the clustering center of the event; 2 the meaning of the word meaning, the graphic code, the corresponding Chinese character, the classification, the word meaning database in the meaning of the two consecutive words in the interpretation process As a sample, the Chinese interpretation information is used to calculate the membership degree of the sample belonging to a cluster center.
  • the event corresponding to the membership degree with the largest value is used as the implicit event of the meaning of the word, and the relationship between the meaning and the event is output.
  • the invention adopts the above technical solutions, and has the following advantages: 1.
  • the invention realizes the digital preservation of the Dongba classic ancient books by establishing the Dongba classic ancient books interpretation database, and establishes the Dongba classic ancient books database and knowledge base. Run to provide support.
  • the Dongba Classical Ancient Books Interpretation Library established by the present invention can realize the synchronous retrieval of graphics, audio and video of complex pictographic texts, and can provide intrinsic related information such as meanings, meanings and events of complex pictographic characters, in order to solve the Dongba classics. The complexity and hidden problems of interpretation of ancient books help.
  • the Dongba Classical Ancient Books Interpretation Library established by the present invention has the self-learning ability. Through the continuous mining of information, the interpretation rules can be optimized and the accuracy of interpretation can be improved.
  • Figure 1 is a schematic overall flow diagram of the present invention.
  • the present invention provides a method for establishing a digital interpretation library of Dongba classic ancient books, and the steps are as follows:
  • the interpretation database includes a graphic template library, an audio template library and a video template library.
  • the graphic template library is to digitally record and image the Dongba classic ancient books data collected by the non-contact ancient book scanner and professional digital camera, and save it as a JPG file.
  • the graphic template library includes unique graphic codes, standard glyphs (JPG), and special-shaped characters (JPG); the graphic content in the graphic template library includes Dongba pictographs, Dongba statements, and Dongba events.
  • the Audio Template Library is a high-sampling acquisition of digital recording devices using audio editing software.
  • the digital frequency audio resource is clipped and saved as an mp3 format file; the high sampling frequency is 320 kb/s.
  • the audio template library includes a unique audio code, an audio storage path (Nashi pronunciation) and a Naxi phonetic symbol; the audio content in the audio template library also includes Dongba text, Dongba statement and Dongba event.
  • the video template library is to edit the video resources of the collected Dongba classic ancient books data, load the commentary audio, explain the subtitles or soundtrack, and save them as wmv format files.
  • the video template library includes a unique video code and video storage path (video content including song and dance, ritual ceremony, etc.); the video content in the video template library also includes Dongba text, Dongba statement and Dongba event.
  • the interpretation database includes the semantic database, the sentence database and the event database.
  • Dongba pictographs from the existing Dongba classics (such as Mr. Fang Guoyu's "Nasi Pictographs" as standard fonts, encode each character in Unicode, and build Dongba using the existing TrueType method.
  • the pictographic standard template library the texts in the standard template library of Dongba pictograms have been compiled and recorded.
  • the fields of the semantic database include Unicode (word-like coding as the primary key), graphic code (PId), corresponding Chinese (chinese), classification (category), corresponding English (English), translator (interpreter), Chinese Translation, audio code (AId), Naxi phonetic (NaxiP) and video code (VId), as shown in Table 1.
  • the sentence database includes sentence code, Dongba statement, corresponding Chinese, sentence meaning, classification, graphic code, audio code and video code.
  • the event database includes event name code, event name, classification, event content, graphics code, audio code, and video code.
  • the content categories include: philosophy, history, religion, medicine, astronomy, geography, folklore, flora and fauna, military, literature and art.
  • the rule is: if the input condition attribute C1 is a value Vc1 in the value field V, then the decision attribute D is the corresponding decision value d1, that is, the attribute corresponding to the corresponding field when the output meets Vc1; if two conditions are input Attributes C2, C3, where C2 is a value Vc2 in the range V, and C3 is a value Vc3 in the value field V, then the decision attribute D is the decision value d2, that is, the corresponding field corresponding to the output satisfying Vc2, Vc3 Attributes.
  • the current solution problem is whether the relevant facts in the interpretation database match the conditional parts of the rules in the interpretation knowledge base. If the two match, the rules in the interpretation knowledge base are enabled, and the process proceeds to step 3 according to the execution part of the rule. Execution; if the conditional part of the multiple rules simultaneously matches the facts related to the solution problem, proceed to step 2;
  • the fusion method of the semantic database, the semantic database and the event database includes the following steps:
  • the fuzzy search meaning database is searched, and the Dongba statement that satisfies the same classification is searched, that is, the Dongba statement is matched according to the individual Dongba characters, so that the graphic code in the database according to the sentence meaning is , audio code, video code, retrieve the pair Graphics, audio and video of Dongba statement;
  • the fuzzy search event database is searched for the name of the Dongba event that satisfies the same classification, that is, the Dongba event is matched according to the individual Dongba characters, so that according to the graphic code in the event database,
  • the audio code and video code retrieve the graphics, audio and video corresponding to the Dongba event, thereby realizing the fusion of the semantic database, the semantic database and the event database.
  • the frequent pattern mining process is: mining the word meaning database, sentence meaning database and event database in the database for frequent pattern mining, and obtaining the word meaning database, the sentence meaning database, the frequently used word combination and the sentence combination in the event database. Frequent pattern mining of rules in the interpretation knowledge base, summarizing the combination of frequently used conditional attributes and decision values.
  • the frequent items are output to the interpretation knowledge base, and during the interpretation of the Dongba pictographic text, the word combination and sentence combination that best match the current interpretation sentence are provided as an interpretation option.
  • the word pattern database is used as an example to illustrate frequent pattern mining:
  • the prefix path of the 5 word meaning constitutes the conditional pattern base of the meaning of the word.
  • conditional FP-tree is constructed, and the conditional frequent itemset is obtained.
  • conditional frequent item set and the suffix pattern of the word meaning are merged to obtain a frequent item set of the word meaning.
  • the number of events in the event database is used as the number of categories for cluster analysis.
  • the event coding, event name, classification, event content, and graphic code are used as source data, and the fuzzy C-means method is used to calculate the cluster center of the event.
  • the calculation of the sample belongs to a certain sample, such as the meaning code, the graphic code, the corresponding Chinese character, the classification, the Chinese interpretation, etc. in the meaning database of the two consecutive words in the interpretation process.
  • the membership degree of the cluster center is the degree of the cluster center.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明涉及一种东巴经典古籍数字化释读库的建立方法,其步骤:对现有东巴经典古籍资料进行采集并建立东巴经典古籍释读资料库,该释读资料库包括图形模板库、音频模板库和视频模板库;根据东巴经典古籍释读资料库建立东巴经典古籍释读数据库,释读数据库包括词意数据库、句意数据库和事件数据库;建立东巴经典古籍释读知识库对释读数据库进行管理:释读知识库根据释读规则对三种释读数据库进行释读内容的组合,并利用推理引擎促进释读数据库中词意数据库、句意数据库、事件数据库之间的融合;建立东巴经典古籍释读优化库,通过知识挖掘工具对释读数据库、释读知识库的内容进行频繁模式挖掘以及聚类分析,为释读数据库、释读知识库的释读规则优化及更新提供支持。

Description

一种东巴经典古籍数字化释读库的建立方法 技术领域
本发明涉及一种文字数字化释读库的建立方法,特别是关于一种面向复杂象形文字的东巴经典古籍数字化释读库的建立方法。
背景技术
中国西南部地区居住的纳西族是中国的一个少数民族,该民族自古以来使用的东巴象形文被国际学界认为是当今世界上唯一还在使用的象形文字;东巴经典古籍是由具有独特图形及发音的东巴象形文写成,2003年被联合国教科文组织列为“世界记忆遗产”,确立了中国纳西族东巴文化在世界及中国的文化发展史上的重要地位。
东巴象形文的形、音、义都极其复杂,往往一字多形、多音、多义,也有异形同义的。自古以来,东巴文化的传承都是以东巴家族内部口传心授的形式进行的,目前能释读东巴经典古籍的老东巴祭司已年逾古稀。东巴经典古籍的释读在象形文字的图形、语音及内容方面具有明显的复杂性、不确定性,现有的英文、汉字等识别系统及翻译系统不能适用。
发明内容
针对上述问题,本发明的目的是提供一种东巴经典古籍数字化释读库的建立方法,该方法实现了东巴经典古籍的数字化保存,并为东巴经典古籍数据库、知识库的建立及运行提供支撑。
为实现上述目的,本发明采取以下技术方案:一种东巴经典古籍数字化释读库的建立方法,其特征在于,该方法包括以下步骤:1)对现有东巴经典古籍资料进行采集并建立东巴经典古籍释读资料库,该释读资料库包括图形模板库、音频模板库和视频模板库;2)根据东巴经典古籍释读资料库建立东巴经典古籍释读数据库,释读数据库包括词意数据库、句意数据库和事件数据库;3)建立东巴经典古籍释读知识库对释读数据库进行管理:释读知识库根据释读规则对三种释读数据库进行释读内容的组合,并利用推理引擎促进释读数据库中词意数据库、句意数据库、事件数据库之间的融合;4)建立东巴经典古籍释读优化库,通过知识挖掘工具 对释读数据库、释读知识库的内容进行频繁模式挖掘以及聚类分析,为释读数据库、释读知识库的释读规则优化及更新提供支持。
在一个优选实施例中,所述步骤1)中,所述图形模板库内包括唯一图形代码、标准字形、异形字;其中图形模板库中的图形内容有东巴象形文字、东巴语句和东巴事件;所述音频模板库内包括唯一音频代码、音频存储路径和纳西音标;其中音频模板库中的音频内容也包括东巴文字、东巴语句和东巴事件;所述视频模板库内包括唯一视频代码和视频存储路径;其中视频模板库中的视频内容也包括东巴文字、东巴语句和东巴事件。
在一个优选实施例中,所述步骤2)中,所述词意数据库:提取现有东巴经典中的东巴象形文字作为标准字模,采用Unicode对每个字符进行编码,并利用现有TrueType方法建立东巴象形文标准模板库;将东巴象形文标准模板库中的文字已有释读资料进行整理录入;所述词意数据库的字段包括词意编码Unicode、图形代码、对应汉字、分类、对应英文、翻译员、中文释义、音频代码、纳西音标和视频代码;所述句意数据库包括句意编码、东巴语句、对应汉语、语句含义、分类、图形代码、音频代码和视频代码;所述事件数据库包括事件名称代码、事件名称、分类、事件内容、图形代码、音频代码和视频代码,其中内容分类包括:哲学、历史、宗教、医学、天文、地理、民俗、动植物、军事、文学和艺术。
在一个优选实施例中,所述步骤3)中,所述释读规则如下:3.1)定义S为规则集,C={C1、C2...Cn}为条件属性集,V=(Vc1,Vc2...Vcn)是条件属性和决策属性的值域,D是决策属性集,(d1,d2,d3...dv)为决策值;3.2)规则为:如果输入条件属性C1为值域V中的某一值Vc1,那么决策属性D为对应的决策值d1,即输出满足Vc1时相应字段对应的属性;如果输入两个条件属性C2、C3,其中C2为值域V中的某一值Vc2,C3为值域V中的某一值Vc3,那么决策属性D为决策值d2,即输出满足Vc2、Vc3时相应字段对应的属性。
在一个优选实施例中,所述步骤3)中,所述推理引擎过程如下:①匹配:当前求解问题在释读数据库中的相关事实是否与释读知识库中规则的条件部分相匹配,如果两者匹配,则启用释读知识库中的规则,进入步骤③按规则的执行操作部分去执行;若同时存在多条规则的条件部分与求解问题相关事实相匹配,则进入步骤②;②冲突消解:预先给所有规则的 条件部分设定优先级,当存在多条规则的条件部分与求解问题相关事实相匹配时,优先启用条件部分优先级较高的规则;③执行操作:执行启用规则的操作部分,经执行操作后,得到新的事实,将所得新事实送入当前释读数据库。
在一个优选实施例中,所述步骤3)中,所述词意数据库、句意数据库和事件数据库的融合方法包括以下步骤:①根据词意数据库编码唯一确定一个东巴文字,根据东巴文字的图形代码、音频代码、视频代码,同步检索出对应东巴文字的图形、音频及视频,呈现出东巴文字的内容与含义;②根据词意数据库中的对应汉字以及所属分类,模糊检索句意数据库,搜索出满足同一分类的东巴语句,即根据单独的东巴文字匹配出东巴语句,从而根据句意数据库中图形代码、音频代码、视频代码,检索出对应东巴语句的图形、音频及视频;③根据词意数据库中的对应汉字以及所属分类,模糊检索事件数据库,搜索出满足同一分类的东巴事件名称,即根据单独的东巴文字匹配出东巴事件,从而根据事件数据库中图形代码、音频代码、视频代码,检索出对应东巴事件的图形、音频及视频,从而实现词意数据库、句意数据库、事件数据库的融合。
在一个优选实施例中,所述步骤4)中,所述频繁模式挖掘过程为:对释读数据库中的词意数据库、句意数据库、事件数据库进行频繁模式挖掘,得到词意数据库、句意数据库、事件数据库中频繁使用的词意组合、句意组合;对释读知识库中的规则进行频繁模式挖掘,归纳出频繁使用的条件属性与决策值的组合;将频繁项输出给释读知识库,在对东巴象形文进行释读过程中提供与当前释语句最匹配的词意组合、句意组合,作为释读选项供选择;采用FP-growth算法,频繁模式挖掘以词意数据库为例:①扫描词意数据库,找出频繁使用的词意以及使用次数,做出频繁项的列表L,按照使用次数递减排序;②再次扫描数据库,由每个词意不断构建FP-Tree:将FP-Tree的根节点设为null;把每个词意项逐个添加到FP-Tree的分枝上去;③做出头结点表,将所有相同的项链接起来;④根据头结点表找出以某个词意为结尾的路径,即词意的后缀模式;⑤词意的前缀路径构成词意的条件模式基;⑥根据条件模式基构建词意的条件FP-树,得到条件频繁项集;⑦条件频繁项集和词意的后缀模式合并,得到词意的频繁项集。
在一个优选实施例中,所述步骤4)中,所述聚类分析方法如下:①将事件数据库中事件的个数作为聚类分析的类别数,将事件编码、事件名称、分类、事件内容、图形代码信息作为源数据,采用模糊C均值方法计算事件的聚类中心;②将释读过程中的连续两个词意对应的词意数据库中的词意编码、图形代码、对应汉字、分类、中文释义信息作为样本,计算样本隶属于某个聚类中心的隶属度;③将具有最大数值的隶属度所对应的事件作为词意的隐含事件,将词意与事件之间的关联关系输出给释读知识库,改进释读知识库中的融合方法。
本发明由于采取以上技术方案,其具有以下优点:1、本发明通过建立东巴经典古籍释读资料库,实现了东巴经典古籍的数字化保存,并为东巴经典古籍数据库、知识库的建立及运行提供支撑。2、本发明建立的东巴经典古籍释读库能够实现复杂象形文的图形、音频、视频的同步检索,能够提供复杂象形文的词意、句意、事件等内在关联信息,为解决东巴经典古籍释读的复杂性、隐涵性难题提供帮助。3、本发明建立的东巴经典古籍释读库具备自学习能力,通过对信息的不断挖掘可以实现对释读规则的优化,提高释读的准确性。
附图说明
图1是本发明的整体流程示意图。
本发明最佳实施方式
下面结合附图和实施例对本发明进行详细的描述。
如图1所示,本发明提供一种东巴经典古籍数字化释读库的建立方法,其步骤如下:
1)对现有东巴经典古籍资料进行采集并建立东巴经典古籍释读资料库,该释读资料库包括图形模板库、音频模板库和视频模板库。
图形模板库是将通过非接触式古籍扫描仪和专业数码照相机采集的东巴经典古籍资料图片进行数字化录入及图像处理,保存为JPG格式文件。图形模板库内包括唯一图形代码、标准字形(JPG)、异形字(JPG);其中图形模板库中的图形内容有东巴象形文字、东巴语句和东巴事件。
音频模板库是采用音频编辑软件对通过数字录音设备获取的高采样 数字频率音频资源进行剪辑,保存为mp3格式文件;其中高采样频率为320kb/s。音频模版库内包括唯一音频代码、音频存储路径(纳西读音)和纳西音标;其中音频模板库中的音频内容也包括东巴文字、东巴语句和东巴事件。
视频模板库是将采集到的东巴经典古籍资料的视频资源进行剪辑,加载解说音频、解说字幕或配乐,保存为wmv格式文件。视频模版库内包括唯一视频代码和视频存储路径(视频内容包括歌舞、祭祀仪式等);其中视频模板库中的视频内容也包括东巴文字、东巴语句和东巴事件。
2)根据东巴经典古籍释读资料库建立东巴经典古籍释读数据库,释读数据库包括词意数据库、句意数据库和事件数据库。
词意数据库:提取现有东巴经典(例如方国瑜先生的《纳西象形文字谱》)中的东巴象形文字作为标准字模,采用Unicode对每个字符进行编码,并利用现有TrueType方法建立东巴象形文标准模板库;将东巴象形文标准模板库中的文字已有释读资料进行整理录入。
词意数据库的字段包括词意编码(Unicode)(词意编码为主键)、图形代码(PId)、对应汉字(chinese)、分类(category)、对应英文(English)、翻译员(interpreter)、中文释义(Translation)、音频代码(AId)、纳西音标(NaxiP)和视频代码(VId),如表1所示。
表1词意数据库字段
Figure PCTCN2016090275-appb-000001
句意数据库包括句意编码、东巴语句、对应汉语、语句含义、分类、图形代码、音频代码和视频代码。
事件数据库包括事件名称代码、事件名称、分类、事件内容、图形代码、音频代码和视频代码。其中内容分类包括:哲学、历史、宗教、医学、天文、地理、民俗、动植物、军事、文学和艺术。
3)建立东巴经典古籍释读知识库对释读数据库进行管理:释读知识库根据释读规则对三种释读数据库进行释读内容的组合,并利用推理引擎促进释读数据库中词意数据库、句意数据库、事件数据库之间的融合。
3.1)释读规则如下:
3.1.1)定义S为规则集,C={C1、C2...Cn}为条件属性集,V=(Vc1,Vc2...Vcn)是条件属性和决策属性的值域,D是决策属性集,(d1,d2,d3...dv)为决策值。
3.1.2)规则为:如果输入条件属性C1为值域V中的某一值Vc1,那么决策属性D为对应的决策值d1,即输出满足Vc1时相应字段对应的属性;如果输入两个条件属性C2、C3,其中C2为值域V中的某一值Vc2,C3为值域V中的某一值Vc3,那么决策属性D为决策值d2,即输出满足Vc2、Vc3时相应字段对应的属性。
例如:当输入条件属性C1为‘词意编码’属性时,通过规则进行判断,若Vc1=E900时,则D为d1,即输出相应字段对应的属性,如Category为天象,Chinese为天等信息,如下表:
Figure PCTCN2016090275-appb-000002
3.2)推理引擎过程如下:
①匹配:当前求解问题在释读数据库中的相关事实是否与释读知识库中规则的条件部分相匹配,如果两者匹配,则启用释读知识库中的规则,进入步骤③按规则的执行操作部分去执行;若同时存在多条规则的条件部分与求解问题相关事实相匹配,则进入步骤②;
②冲突消解:预先给所有规则的条件部分设定优先级,即值域V中的优先级为:Vc1>Vc2>…>Vcn,当存在多条规则的条件部分与求解问题相关事实相匹配时,优先启用条件部分优先级较高的规则;
③执行操作:执行启用规则的操作部分,经执行操作后,得到新的事实,将所得新事实送入当前释读数据库。
3.3)词意数据库、句意数据库和事件数据库的融合方法包括以下步骤:
①根据词意数据库编码唯一确定一个东巴文字,根据东巴文字的图形代码、音频代码、视频代码,同步检索出对应东巴文字的图形、音频及视频,呈现出东巴文字的内容与含义;
②根据词意数据库中的对应汉字以及所属分类,模糊检索句意数据库,搜索出满足同一分类的东巴语句,即根据单独的东巴文字匹配出东巴语句,从而根据句意数据库中图形代码、音频代码、视频代码,检索出对 应东巴语句的图形、音频及视频;
③根据词意数据库中的对应汉字以及所属分类,模糊检索事件数据库,搜索出满足同一分类的东巴事件名称,即根据单独的东巴文字匹配出东巴事件,从而根据事件数据库中图形代码、音频代码、视频代码,检索出对应东巴事件的图形、音频及视频,从而实现词意数据库、句意数据库、事件数据库的融合。
4)建立东巴经典古籍释读优化库,通过知识挖掘工具对释读数据库、释读知识库的内容进行频繁模式挖掘以及聚类分析,为释读数据库、释读知识库的释读规则优化及更新提供支持。
4.1)频繁模式挖掘过程为:对释读数据库中的词意数据库、句意数据库、事件数据库进行频繁模式挖掘,得到词意数据库、句意数据库、事件数据库中频繁使用的词意组合、句意组合;对释读知识库中的规则进行频繁模式挖掘,归纳出频繁使用的条件属性与决策值的组合。将频繁项输出给释读知识库,在对东巴象形文进行释读过程中提供与当前释语句最匹配的词意组合、句意组合,作为释读选项供选择。
采用FP-growth算法,以词意数据库为例阐述频繁模式挖掘:
①扫描词意数据库,找出频繁使用的词意以及使用次数,做出频繁项的列表L,按照使用次数递减排序。
②再次扫描数据库,由每个词意不断构建FP-Tree:将FP-Tree的根节点设为null;把每个词意项逐个添加到FP-Tree的分枝上。
③做出头结点表,将所有相同的项链接起来。
④根据头结点表找出以某个词意为结尾的路径,即词意的后缀模式。
⑤词意的前缀路径构成词意的条件模式基。
⑥根据条件模式基构建词意的条件FP-树,得到条件频繁项集。
⑦条件频繁项集和词意的后缀模式合并,得到词意的频繁项集。
4.2)聚类分析方法如下:
①将事件数据库中事件的个数作为聚类分析的类别数,将事件编码、事件名称、分类、事件内容、图形代码等信息作为源数据,采用模糊C均值方法计算事件的聚类中心。
②将释读过程中的连续两个词意对应的词意数据库中的词意编码、图形代码、对应汉字、分类、中文释义等信息作为样本,计算样本隶属于某 个聚类中心的隶属度。
③将具有最大数值的隶属度所对应的事件作为词意的隐含事件,将词意与事件之间的关联关系输出给释读知识库,改进释读知识库中的融合方法。
上述各实施例仅用于说明本发明,各部件的结构、尺寸、设置位置及形状都是可以有所变化的,在本发明技术方案的基础上,凡根据本发明原理对个别部件进行的改进和等同变换,均不应排除在本发明的保护范围之外。

Claims (8)

  1. 一种东巴经典古籍数字化释读库的建立方法,其特征在于,该方法包括以下步骤:
    1)对现有东巴经典古籍资料进行采集并建立东巴经典古籍释读资料库,该释读资料库包括图形模板库、音频模板库和视频模板库;
    2)根据东巴经典古籍释读资料库建立东巴经典古籍释读数据库,释读数据库包括词意数据库、句意数据库和事件数据库;
    3)建立东巴经典古籍释读知识库对释读数据库进行管理:释读知识库根据释读规则对三种释读数据库进行释读内容的组合,并利用推理引擎促进释读数据库中词意数据库、句意数据库、事件数据库之间的融合;
    4)建立东巴经典古籍释读优化库,通过知识挖掘工具对释读数据库、释读知识库的内容进行频繁模式挖掘以及聚类分析,为释读数据库、释读知识库的释读规则优化及更新提供支持。
  2. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤1)中,所述图形模板库内包括唯一图形代码、标准字形、异形字;其中图形模板库中的图形内容有东巴象形文字、东巴语句和东巴事件;所述音频模板库内包括唯一音频代码、音频存储路径和纳西音标;其中音频模板库中的音频内容也包括东巴文字、东巴语句和东巴事件;所述视频模板库内包括唯一视频代码和视频存储路径;其中视频模板库中的视频内容也包括东巴文字、东巴语句和东巴事件。
  3. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤2)中,所述词意数据库:提取现有东巴经典中的东巴象形文字作为标准字模,采用Unicode对每个字符进行编码,并利用现有TrueType方法建立东巴象形文标准模板库;将东巴象形文标准模板库中的文字已有释读资料进行整理录入;所述词意数据库的字段包括词意编码Unicode、图形代码、对应汉字、分类、对应英文、翻译员、中文释义、音频代码、纳西音标和视频代码;所述句意数据库包括句意编码、东巴语句、对应汉语、语句含义、分类、图形代码、音频代码和视频代码;所述事件数据库包括事件名称代码、事件名称、分类、事件内容、图形代码、音频代码和视频代码,其中内容分类包括:哲学、历史、宗教、医学、 天文、地理、民俗、动植物、军事、文学和艺术。
  4. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤3)中,所述释读规则如下:
    3.1)定义S为规则集,C={C1、C2...Cn}为条件属性集,V=(Vc1,Vc2...Vcn)是条件属性和决策属性的值域,D是决策属性集,(d1,d2,d3...dv)为决策值;
    3.2)规则为:如果输入条件属性C1为值域V中的某一值Vc1,那么决策属性D为对应的决策值d1,即输出满足Vc1时相应字段对应的属性;如果输入两个条件属性C2、C3,其中C2为值域V中的某一值Vc2,C3为值域V中的某一值Vc3,那么决策属性D为决策值d2,即输出满足Vc2、Vc3时相应字段对应的属性。
  5. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤3)中,所述推理引擎过程如下:
    ①匹配:当前求解问题在释读数据库中的相关事实是否与释读知识库中规则的条件部分相匹配,如果两者匹配,则启用释读知识库中的规则,进入步骤③按规则的执行操作部分去执行;若同时存在多条规则的条件部分与求解问题相关事实相匹配,则进入步骤②;
    ②冲突消解:预先给所有规则的条件部分设定优先级,当存在多条规则的条件部分与求解问题相关事实相匹配时,优先启用条件部分优先级较高的规则;
    ③执行操作:执行启用规则的操作部分,经执行操作后,得到新的事实,将所得新事实送入当前释读数据库。
  6. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤3)中,所述词意数据库、句意数据库和事件数据库的融合方法包括以下步骤:
    ①根据词意数据库编码唯一确定一个东巴文字,根据东巴文字的图形代码、音频代码、视频代码,同步检索出对应东巴文字的图形、音频及视频,呈现出东巴文字的内容与含义;
    ②根据词意数据库中的对应汉字以及所属分类,模糊检索句意数据库,搜索出满足同一分类的东巴语句,即根据单独的东巴文字匹配出东巴语句,从而根据句意数据库中图形代码、音频代码、视频代码,检索出对 应东巴语句的图形、音频及视频;
    ③根据词意数据库中的对应汉字以及所属分类,模糊检索事件数据库,搜索出满足同一分类的东巴事件名称,即根据单独的东巴文字匹配出东巴事件,从而根据事件数据库中图形代码、音频代码、视频代码,检索出对应东巴事件的图形、音频及视频,从而实现词意数据库、句意数据库、事件数据库的融合。
  7. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤4)中,所述频繁模式挖掘过程为:对释读数据库中的词意数据库、句意数据库、事件数据库进行频繁模式挖掘,得到词意数据库、句意数据库、事件数据库中频繁使用的词意组合、句意组合;对释读知识库中的规则进行频繁模式挖掘,归纳出频繁使用的条件属性与决策值的组合;将频繁项输出给释读知识库,在对东巴象形文进行释读过程中提供与当前释语句最匹配的词意组合、句意组合,作为释读选项供选择;采用FP-growth算法,频繁模式挖掘以词意数据库为例:
    ①扫描词意数据库,找出频繁使用的词意以及使用次数,做出频繁项的列表L,按照使用次数递减排序;
    ②再次扫描数据库,由每个词意不断构建FP-Tree:将FP-Tree的根节点设为null;把每个词意项逐个添加到FP-Tree的分枝上去;
    ③做出头结点表,将所有相同的项链接起来;
    ④根据头结点表找出以某个词意为结尾的路径,即词意的后缀模式;
    ⑤词意的前缀路径构成词意的条件模式基;
    ⑥根据条件模式基构建词意的条件FP-树,得到条件频繁项集;
    ⑦条件频繁项集和词意的后缀模式合并,得到词意的频繁项集。
  8. 如权利要求1所述的一种东巴经典古籍数字化释读库的建立方法,其特征在于:所述步骤4)中,所述聚类分析方法如下:
    ①将事件数据库中事件的个数作为聚类分析的类别数,将事件编码、事件名称、分类、事件内容、图形代码信息作为源数据,采用模糊C均值方法计算事件的聚类中心;
    ②将释读过程中的连续两个词意对应的词意数据库中的词意编码、图形代码、对应汉字、分类、中文释义信息作为样本,计算样本隶属于某个聚类中心的隶属度;
    ③将具有最大数值的隶属度所对应的事件作为词意的隐含事件,将词意与事件之间的关联关系输出给释读知识库,改进释读知识库中的融合方法。
PCT/CN2016/090275 2016-05-10 2016-07-18 一种东巴经典古籍数字化释读库的建立方法 WO2017193472A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/578,671 US10380065B2 (en) 2016-05-10 2016-07-18 Method for establishing a digitized interpretation base of dongba classic ancient books

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610304529.1 2016-05-10
CN201610304529.1A CN106021354A (zh) 2016-05-10 2016-05-10 一种东巴经典古籍数字化释读库的建立方法

Publications (1)

Publication Number Publication Date
WO2017193472A1 true WO2017193472A1 (zh) 2017-11-16

Family

ID=57099823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/090275 WO2017193472A1 (zh) 2016-05-10 2016-07-18 一种东巴经典古籍数字化释读库的建立方法

Country Status (3)

Country Link
US (1) US10380065B2 (zh)
CN (1) CN106021354A (zh)
WO (1) WO2017193472A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503247A (zh) * 2016-11-09 2017-03-15 天津赛因哲信息技术有限公司 一种基于知识发现技术的古籍文献管理系统及方法
CN106777327A (zh) * 2017-01-11 2017-05-31 铁道第三勘察设计院集团有限公司 一种bim模板库管理方法
CN110222221A (zh) * 2019-06-13 2019-09-10 谢昱 一种用于古籍书影查询管理平台及方法
CN110728262B (zh) * 2019-10-24 2022-03-22 程少轩 智能古文字数据采集系统
CN113742396B (zh) * 2021-08-26 2023-10-27 华中师范大学 一种对象学习行为模式的挖掘方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902809A (zh) * 2012-10-19 2013-01-30 东南大学 一种新型语义关联挖掘方法
WO2015029241A1 (en) * 2013-08-27 2015-03-05 Nec Corporation Word translation acquisition method
CN104866607A (zh) * 2015-06-04 2015-08-26 北京信息科技大学 一种东巴文释读数据库建立方法
CN105224689A (zh) * 2015-10-30 2016-01-06 北京信息科技大学 一种东巴文献分类方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5835667A (en) * 1994-10-14 1998-11-10 Carnegie Mellon University Method and apparatus for creating a searchable digital video library and a system and method of using such a library
US5664227A (en) * 1994-10-14 1997-09-02 Carnegie Mellon University System and method for skimming digital audio/video data
CN201561900U (zh) * 2009-07-24 2010-08-25 株洲时代新材料科技股份有限公司 轨道车辆用空气弹簧扭转试验装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902809A (zh) * 2012-10-19 2013-01-30 东南大学 一种新型语义关联挖掘方法
WO2015029241A1 (en) * 2013-08-27 2015-03-05 Nec Corporation Word translation acquisition method
CN104866607A (zh) * 2015-06-04 2015-08-26 北京信息科技大学 一种东巴文释读数据库建立方法
CN105224689A (zh) * 2015-10-30 2016-01-06 北京信息科技大学 一种东巴文献分类方法

Also Published As

Publication number Publication date
US10380065B2 (en) 2019-08-13
CN106021354A (zh) 2016-10-12
US20190057085A1 (en) 2019-02-21

Similar Documents

Publication Publication Date Title
CN111753099B (zh) 一种基于知识图谱增强档案实体关联度的方法及系统
CN111143479B (zh) 基于dbscan聚类算法的知识图谱关系抽取与rest服务可视化融合方法
WO2017193472A1 (zh) 一种东巴经典古籍数字化释读库的建立方法
CN111177591B (zh) 面向可视化需求的基于知识图谱的Web数据优化方法
WO2017193471A1 (zh) 一种东巴经典古籍传承体系数字化国际共享平台
CN103678684A (zh) 一种基于导航信息检索的中文分词方法
CN107145479A (zh) 基于文本语义的篇章结构分析方法
CN106502991B (zh) 出版物处理方法和装置
CN111324742A (zh) 一种数字人文知识图谱的构建方法
CN114840677B (zh) 面向多粒度需求的短文本分类与智能分析方法
CN106874397B (zh) 一种面向物联网设备的自动语义标注方法
CN113312922A (zh) 一种改进的篇章级三元组信息抽取方法
CN114997288A (zh) 一种设计资源关联方法
Tarride et al. Large-scale genealogical information extraction from handwritten Quebec parish records
CN101271448A (zh) 汉语基本名词短语的识别及其规则的生成方法和装置
CN109086285B (zh) 基于语素的汉语智能处理方法和系统及装置
CN109145089B (zh) 一种基于自然语言处理的层次化专题属性抽取方法
JP5112027B2 (ja) 文書群提示装置および文書群提示プログラム
CN104866607B (zh) 一种东巴文释读数据库建立方法
Revanth et al. Nl2sql: Natural language to sql query translator
CN106649219A (zh) 一种通信卫星设计文件自动生成方法
KR20220068937A (ko) 기계학습 방법론을 이용한 한국 표준 산업/직업 분류 방법
Fuller et al. Structuring, recording, and analyzing historical networks in the china biographical database
CN113392183A (zh) 一种儿童范畴图谱知识的表征与计算方法
CN116821376B (zh) 煤矿安全生产领域的知识图谱构建方法及系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16901427

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16901427

Country of ref document: EP

Kind code of ref document: A1