WO2012079245A1 - 知识获取装置及方法 - Google Patents

知识获取装置及方法 Download PDF

Info

Publication number
WO2012079245A1
WO2012079245A1 PCT/CN2010/079937 CN2010079937W WO2012079245A1 WO 2012079245 A1 WO2012079245 A1 WO 2012079245A1 CN 2010079937 W CN2010079937 W CN 2010079937W WO 2012079245 A1 WO2012079245 A1 WO 2012079245A1
Authority
WO
WIPO (PCT)
Prior art keywords
arbitrary
model
lattice
frame
unit
Prior art date
Application number
PCT/CN2010/079937
Other languages
English (en)
French (fr)
Inventor
徐金安
孟凡东
陈恰
潘栩
达珍
孟庆辰
Original Assignee
北京交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京交通大学 filed Critical 北京交通大学
Priority to CN201080069243.0A priority Critical patent/CN103119585B/zh
Priority to PCT/CN2010/079937 priority patent/WO2012079245A1/zh
Publication of WO2012079245A1 publication Critical patent/WO2012079245A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to the field of natural language processing research, and in particular to a knowledge acquisition device and a method
  • Plant has two kinds of semantics: "plant” and "workshop". When “plant” and “life” or “eat” appear simultaneously in a sentence, the probability of "plant” is much larger than that of "workshop”; but when the sentence The "plant” and “manufacturing” appear at the same time, and the semantics is mainly "workshop”. If the computer is given the corresponding semantic analysis knowledge, the computer has the corresponding semantic disambiguation ability.
  • grammar is a formal grammatical model that expresses linguistic structure in the "grid frame" (refer to "Nature” Formal Patterns of Language Processing, Feng Zhiwei, China University of Science and Technology Press, p. 293, first edition, January 2010).
  • the grammar was first proposed by the American linguist C. Fillmore and defined the agent, the patient, the instrumental, the objective, the locative, and the subject. Dative ), factitive, benefactive, time, source, goal, comitative, etc.
  • Each grid frame is centered on verbs or adjectives, and has a corresponding case slot.
  • the grid has corresponding attribute features, such as the agent's agent (the subject of the sentence) and the object lattice. (the object of the sentence), and attributes that represent information such as time places, tools, and so on.
  • Disambiguation is one of the fundamental tasks of natural language processing research due to the diversity and complexity of language. Disambiguation tasks are almost all over the various fields of natural language processing, such as word segmentation, part-of-speech tagging, syntactic structure analysis, and semantic analysis. , target language generation, etc., in the field of machine translation, speech recognition, dialogue systems and information retrieval must also solve the problem of disambiguation. In the disambiguation problem, the task of disambiguation of syntactic structure is very arduous.
  • the syntactic structure of predicate components such as verbs is often a bridge from source language analysis to target language generation, which is related to the correctness of production language and The degree of process is one of the key technologies for machine translation research.
  • Syntactic structure disambiguation is one of the premise and key factors of semantic disambiguation.
  • the difficulty of syntactic structure disambiguation lies in the fact that the same verb has a majority of different structures, which is reflected in the diversity of the verb frame.
  • Traditional natural language processing systems often use artificial methods to construct the frame of verbs. However, due to the large number of patterns in the grid, all artificial construction requires a lot of human resources.
  • Patent Document 1 proposes a machine learning method based on probability dependent graphs to realize the lattice processing of the grid frame.
  • Patent Document 1 Japanese Patent No. 3353578;
  • Non-Patent Document 1 Daisuke Kawahara, Kazuo Kasumi. High-performance computing environment ⁇ Web force, D large-scale grid 7 - ⁇ construction;
  • Non-Patent Document 2 Daisuke Kawahara, Kazuo Kazuo: Gege 7 ⁇ Dictionary D Gradually automatic construction, Japan Society of Natural Speech Processing, Vol.12, ⁇ .2, ⁇ .109-131, 2005.
  • a first object of the present invention is to provide an efficient knowledge acquisition device.
  • a second object of the present invention is to propose an efficient knowledge acquisition method.
  • the present invention provides a knowledge acquisition apparatus, including: a grid frame feature extraction unit for extracting a grid frame element of a predicate component in an input sentence and attribute information thereof; a model library for storing Arbitrary lattice model; Arbitrary lattice decision unit is used to perform pattern matching on the extraction result of the lattice frame feature extraction unit and the arbitrary lattice model, and determine the arbitrary lattice information in the lattice frame of the predicate component.
  • the present invention provides a knowledge acquisition method, including: extracting a grid frame element of a predicate component in an input sentence and attribute information thereof; performing pattern matching on the extraction result and the stored arbitrary lattice model, and determining a predicate Arbitrary information in the grid of components.
  • FIG. 1 is a flowchart of Embodiment 1 of a knowledge acquisition method according to the present invention
  • Embodiment 2 is a flowchart of Embodiment 2 of the knowledge acquisition method of the present invention.
  • Embodiment 3 is a flowchart of Embodiment 3 of the knowledge acquisition method of the present invention.
  • Embodiment 1 of the knowledge acquisition apparatus of the present invention
  • FIG. 5 is a structural diagram of Embodiment 2 of the knowledge acquisition apparatus of the present invention.
  • Figure 6 is a schematic diagram of the syntactic structure analysis of Japanese sentences
  • FIG. 7 is a schematic diagram of the extracted verb frame structure.
  • the various embodiments of the present invention are mainly based on the thinking of any lattice in the lattice frame of the predicate component, such as: sentences in Japanese:
  • the bit frame structure reduces the difficulty of sentence analysis, syntactic structure disambiguation and semantic disambiguation in application systems with natural language understanding such as machine translation and dialogue systems.
  • FIG. 1 is a flowchart of Embodiment 1 of a knowledge acquisition method according to the present invention. As shown in Figure 1, this embodiment includes:
  • Step 102 Extract the grid frame element of the predicate component in the input sentence and its attribute information.
  • Step 104 Perform pattern matching on the extracted result and the stored arbitrary lattice model, and determine arbitrary lattice information in the grid frame of the predicate component.
  • the pattern matching is performed according to the stored arbitrary lattice model and the lattice frame of the predicate component, thereby realizing automatic acquisition and effective distinction between the necessary lattice and the arbitrary lattice of the prefix frame of the predicate component, and improving the structure of the natural language processing.
  • FIG. 2 is a flowchart of Embodiment 2 of the knowledge acquisition method of the present invention.
  • This embodiment uses Japanese The relationship between the mandatory lattice and the arbitrary lattice of the verb grid frame is explained as an example, and those skilled in the art can understand that the embodiments of the present invention are not limited to Japanese, and can be applied to any other language. As shown in FIG. 2, this embodiment includes:
  • Step 201 receiving an input sentence, such as receiving a sentence [the other car from the library to the line ⁇ ], in the specific operation, the received sentence can also be read into the memory;
  • Step 202 Perform lexical and syntactic analysis on the input sentence, such as:
  • the lexical analysis includes two steps: the segmentation and the acquisition of the attribute characteristics of the word.
  • the word segmentation is to segment the words of the sentence.
  • the above sentence can be divided into [the / / self-driving car / ⁇ ⁇ library / ⁇ / line ⁇ ],
  • the assignment of the attribute characteristics of the words can be obtained from a machine-readable dictionary, such as part of speech, the use of verbs, and the like;
  • Step 203 Perform a feature extraction of the grid frame on the input sentence
  • the semantic and conceptual information of the keyword is obtained from the read knowledge base information; when the feature extraction of the position frame of the predicate component such as the verb is performed, the predicate word to be extracted needs to be determined in advance.
  • Characteristic elements such as words, part of speech, semantics, concepts, applicable fields, etc., and then extract the attribute values of the corresponding feature elements from the analysis results of step 202 and the knowledge base according to each component of the formulated feature elements; [People from the ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ] ] , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ Attributes (or attribute information) such as semantics and concepts of [self-driving car] and [ ⁇ ]; such as the grammatical frame of the verb [row ⁇ ] extracted from the Japanese sentence [the ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ] ]]]]
  • the attributes [person/animal], [self-driving] attributes [vehicles/items], [books] attributes [buildings/locations], etc. can be obtained; Understand that a specific instance of the knowledge base can be selected according to the type of language input and the selected features, when the input language When you speak Japanese, you can use the EDR dictionary developed by the Japanese information and communication organization. WordNet can be used in English, HowNet can be used in Chinese, and so on.
  • Step 204 Perform pattern matching on the arbitrary lattice model stored in the pattern library and the grid frame of the predicate words extracted in step 203, and determine arbitrary lattice information in the grid frame of the predicate word; briefly described below; Explanation of 3;
  • Step 205 Output the determination result of step 204.
  • the determination result may also be sent to the knowledge base for processing by the lattice feature extraction unit to improve the performance and efficiency of the knowledge acquisition of the system;
  • the output data may be combined in a certain format according to requirements, and the output may be in the form of a file, or may be directly stored in the database, for example, corresponding to the determination result of the above step 204, which may be [self-driving ⁇ '], It can also be [traffic means ⁇ '], that is, the phrase that can be determined to be an arbitrary sentence, or a segment containing semantic information and a specific helper; also can be judged in the sentence for the convenience of information processing and simplifying the processing of the verb frame.
  • the arbitrary pattern is output together with the predicate component in the sentence, or the arbitrary lattice phrase after the arbitrary lattice extraction and the sentence from which the arbitrary lattice phrase is removed are output.
  • the correct distinction between the necessary lattice and the arbitrary lattice in the lattice frame is realized by judging the relationship between the lattice frame and the arbitrary lattice of the predicate components such as verbs in the sentence, so that the structure of the predicate components such as verbs is simple, thereby greatly improving Coverage of the verb grid framework, and improve the accuracy of structural disambiguation and semantic disambiguation in syntactic structure analysis and semantic analysis, and provide efficient and credible knowledge for natural language understanding research fields such as information retrieval, machine translation, and dialogue systems. Get the method.
  • FIG. 3 is a flowchart of Embodiment 3 of the knowledge acquisition method of the present invention. It is mainly used to explain the process of constructing a model library according to a machine learning method. Those skilled in the art can understand that the model library can be established based on learning data according to various machine learning methods. The following uses a support vector machine S VM as an example to use a machine. The learning method establishes a model library for explanation. As shown in FIG.
  • the embodiment includes: Step 301, feature extraction;
  • the theoretical algorithm of the support vector machine can refer to the following non-patent literature: [[Non-non-patented patents for literature 33]] Fang Fangrui Ruiming, support for the theoretical theory of the theory of the direction of the machine and its application analysis;; China National Electric Power Co., Ltd. ,, on the 11th of January, 1100, 1970, IISSBBNN:: 99778877550088336600337799..
  • Hhttttpp :////wwwwww..ccss.. ccoorrnneellll..eedduu//PPeeooppllee//ttjj// ssvvmm lliigghhtt// oolldd// ssvvmm——lliigghhtt——vv44..0000.. hhttmmll
  • the module module block it is possible to pass the command order of the pair of used learning modules, such as SSVVMM LLiigghhtt's ssvvmm - lleeaarrnniinngg learning order command , the selection of the number of parameter functions of the line is performed by pre-presetting the number of parameter parameters of the command command;; at the same time, the support is used Vector measuring machine At the same time, it also involves the calculation of the generation of special features, the selection of the special features, and the calculation of the special features.
  • SSVVMM LLiigghhtt's ssvvmm - lleeaarrnniinngg learning order command the selection of the number of parameter functions of the line is performed by pre-presetting the number of parameter parameters of the command command;
  • the support is used Vector measuring machine
  • Special feature vector feature space between space and space can be based on the data used in the study of learning data, such as the text file of the text file segmentation Processing, calculating the approximate probability rate of the word or frequency of the word, or the occurrence frequency of the NN element model or the probability ratio of the current frequency, and proceeding to remove In addition to the part of the high-frequency frequency word part of the division, and so on, the work is done to complete the feature selection and selection;
  • There are many methods for calculating the special feature rights such as Such as Bubuler's weight, absolute absolute word frequency ((TTFF)), inverted document file frequency ((IIDDFF)), TTFF--IIDDFF, TTFFCC, IITTCC, , entropy entropy weights and as well as TTFF--IIWWFF, etc.;
  • the SSVVMM classification classifier when using the SSVVMM classification classifier, it is necessary to perform pre-pre-processing on the data of the learning learning data, and the pre-preprocessing is removed except the above.
  • the characteristics of the special feature sign to the vector space, the special feature selection and the special feature weighting method, the selection method, the selection method, etc.
  • the wrong example of the wrong case is identified as the class -11;; in addition to this, it is also necessary to root the space between the vector space and the empty space according to the characteristic feature
  • Each element element is converted into a lattice format conversion method for all positive examples and negative negative examples in the data of the learning learning data.
  • the format of the format conversion is changed from time to time, it is generally possible to use it.
  • the line number of each characteristic feature element of the feature set in the set of vector space and space is replaced by the word or short phrase in the data of the learning data. ;; Examples such as:
  • the word frequency is statistically assumed, and the state vector space (ie, the extracted feature) shown in Table 1 is assumed, as an example, and should not be interpreted as a limit;
  • Step 302 according to the above extracted features and machine learning methods, modeling; as described above, if using SVMLight, the above svm-learning can be used Complete the machine learning task, get the SVM-based model library, and the model in the obtained model library is as follows:
  • the feature elements of the predicate word extracted in step 203 in FIG. 2 are required to conform to the requirements of the constituent elements of the model in the model library; for example, using the SVM classifier
  • the feature vector space used for SVM learning should contain knowledge. Semantics, concepts, applicable fields, etc. in the library.
  • This embodiment is explained based on the word and Boolean weighted SVM learning method, and other methods such as supervised learning method, unsupervised learning method, semi-supervised learning method, clustering algorithm, related algorithm, and complex feature set can be used in the specific operation. And unity operation, probability context-free grammar, unitary model, hidden Markov model, naive Bayesian, decision tree model, maximum entropy model, error-driven transformation method, neural network, conditional random field (CRF) At least one of methods such as bootstrapping, Co-Training, and the like.
  • methods such as bootstrapping, Co-Training, and the like.
  • Embodiment 4 is a structural diagram of Embodiment 1 of the knowledge acquisition apparatus of the present invention.
  • the method embodiments shown in Figures 1-3 can be applied to this embodiment.
  • the embodiment includes: a grid frame feature extraction unit 420, which is used for extracting a grid frame element of the predicate component in the input sentence and its attribute information; a model library 4020 for storing the arbitrary lattice model; and an arbitrary lattice determining unit 430 for Pattern extraction is performed on the extraction result of the lattice frame feature extraction unit and the arbitrary lattice model, and the arbitrary lattice information in the lattice frame of the predicate component is determined.
  • the input sentence memory unit 400 is configured to receive an input sentence, and the specific operation module can use various universal input modules, such as a keyboard, a pointing device, a handwritten character recognition, an optical character reader, a voice input recognition to input a sentence, or a text. Inputting in the form of a file or a database; the input sentence memory unit 400 may be a unit of various existing input statements capable of executing processing for obtaining language information;
  • the lexical parsing unit 410 is configured to perform word segmentation processing and syntactic structure analysis on the input sentence; wherein, the word segmentation processing includes segmenting the input sentences, and assigning each word a part of the related attribute features; the syntactic structure analysis includes inputting The structure of the sentence, for example, the syntactic structure analysis of the Chinese sentence to determine the subject, predicate, object, attribute, adverbial and complement of the sentence; the knowledge base 4010 is used to give the output result of the lexical parsing unit 410, that is, the sentence Attributes such as semantics and concepts of words or phrases of each constituent element; for example, WordNet in English, HowNet in Chinese, etc.; The purpose of adding semantic and conceptual attribute features is to abstract the extracted grid frame; for example, Japanese sentences [ He is from the ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]
  • the grid frame feature extracting unit 420 is configured to extract the feature of the cell frame of the object verb for the output result of the lexical parsing unit 410 and the attribute features such as semantics and concepts acquired from the knowledge base 4010, and is the arbitrary cell determining unit 430.
  • the pattern matching processing performed between the model library 4020 and the model library 4020 provides data conditions and basis; the feature selection method of the grid frame feature extraction unit 420 has many methods, and generally can use a feature extraction method based on document frequency, an information gain method, a ⁇ 2 statistical method, and Mutual information methods and so on.
  • model library 4020 can be based on statistical methods.
  • the learning data is used to determine the grid frame features extracted by the grid frame feature extraction unit 420, thereby determining and distinguishing the necessary lattices and arbitrary lattices in the lattice frame elements of the predicate components such as verbs.
  • Models in the model library can be obtained from statistical machine learning methods using learning data, such as support vector machines, decision trees, and the like;
  • the arbitrary cell determining unit 430 is configured to perform pattern matching between the verb cell frame feature extracted by the cell frame feature extracting unit 420 and the model library 4020, and determine the elements of the lattice frame of the predicate component such as the verb, and distinguish the necessary cells.
  • an arbitrary lattice specifically, the model library 4020 established by using the support vector machine SVM, when there is an arbitrary lattice model such as [traffic means ⁇ '] in the model library 4020,
  • the sentence [car] in the sentence [Peter Auto Club: Line:] can obtain semantic information [traffic means] from the knowledge base, and it can be known that the [transport means ⁇ '] in the model library 4020 is an arbitrary grid.
  • [car ⁇ '] is an arbitrary grid;
  • the output unit 440 is configured to output the result of the arbitrary cell determining unit, and the output may be in various forms, which may be a file output, or may be a display output or the like; and corresponding to the input sentence processed by the arbitrary cell determining unit 430, the output may be [Car ⁇ '], or [Car ⁇ '] and [Big Club (: Line:], etc., can also be output according to the needs of users.
  • the output unit 440 writes its output to the knowledge base 4010 for direct processing by the lattice feature extraction unit 420 to improve the performance and efficiency of knowledge acquisition of the system.
  • the arbitrary cell determining unit 430 can successfully divide the lattice elements in the grid frame of the verb into the necessary lattice and the arbitrary lattice, and separate the arbitrary lattice of the verb from the verb lattice frame to achieve the simplified verb lattice.
  • the bit frame the purpose of compressing the number of grid frames, at the same time, can also reduce the difficulty of syntactic structure disambiguation and semantic disambiguation, improve the accuracy of syntactic analysis and semantic analysis, and related research on machine translation, information retrieval and speech recognition. And the application field plays a good role in promoting and improving.
  • FIG. 5 is a structural diagram of a third embodiment of the knowledge acquisition apparatus of the present invention.
  • the method embodiments shown in Figures 1-3 can be applied to this embodiment.
  • the constituent unit and the connection relationship of the present embodiment are substantially the same as the knowledge acquisition apparatus shown in FIG. 5, and the difference is that: a database 5030 for storing learning data (such as a large-scale corpus) and machine learning are added.
  • a database 5030 for storing learning data (such as a large-scale corpus) and machine learning are added.
  • the machine learning unit 510 can perform machine learning using data in the learning database 5030 using methods such as support vector machines, decision trees, etc., thereby constructing a model library 4020, as explained in detail with reference to FIG.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Description

知识获取装置及方法
技术领域 本发明涉及自然语言处理研究领域, 具体涉及一种知识获取装置及方
背景技术 网络技术的发展催生了信息大爆炸, 自然语言处理技术作为信息处理的 重要技术, 在给人们带来便利的同时, 依然遭遇诸多的技术难题。 特别是在 诸如信息检索、 语音识别及机器翻译等方面, 语言知识的自动获取技术自语 料库诞生以来, 已经成为自然语言理解关联研究的重要基础研究课题之一。
要使计算机具有高性能的自然语言理解能力, 一般而言, 需要实现赋予 系统大量的知识。 如为了解决自然语言处理研究中的语义消歧问题, 需要赋 予系统相应的语义消歧规则、 实例或统计模型。 一个简单的例子是英语单词
"plant" 有 "植物" 和 "工场" 两种语义, 当句子中 "plant" 和 "life" 或 者 "eat" 同时出现时, 语义为 "植物" 概率要远远大于 "工场"; 但是当句 子中 "plant" 和 "manufacturing" 同时出现是, 其语义则以 "工场" 为主。 如果赋予计算机相应的语义分析知识, 计算机则具备相应的语义消歧能力。
在自然语言处理研究领域中, 作为语义消歧的基本技术之一, 广为人知 的一种语法结构被称格语法, 它是一种以 "格位框架" 表现语言结构的形式 语法模型 (参照 《自然语言处理的形式模式》, 冯志伟著, 中国科技大学出 版社,第 293页, 2010年 1月第一版)。格语法最早由美国语言学家 C. Fillmore 提出, 并定义了施事格(agentive )、经验格( patient )、工具格( instrumental )、 客体格 ( objective )、处所格 ( locative )、 受格 ( dative )、使成格 ( factitive )、 受益格 ( benefactive )、 时间格 ( time )、 源点格 ( Source )、 终点格 ( goal )、 伴随格 ( comitative )等。 每一个格位框架以动词或形容词为中心词, 并拥有 相应的格槽(case slot ), 格槽具有相应的属性特征, 如表现动作的行为人的 施事格(句子的主语)、 对象格(句子的宾语), 以及表示时间场所、 工具等 信息的属性等等。 众所周知, 由于语言的多样性和复杂性, 消歧是自然语言处理研究的根 本任务之一,消歧任务几乎遍布自然语言处理的各个研究领域,在诸如分词、 词性标注、 句法结构分析、 语义分析、 目标语言生成等, 在机器翻译、 语音 识别、 对话系统和信息检索等领域也都必须解决消歧问题。 消歧问题中, 句 法结构消歧的任务十分艰巨, 尤其是在机器翻译领域中, 动词等谓语成分的 句法结构往往是源语言分析到目标语言生成时的桥梁, 关系到生产语言的正 确性和流程程度, 是机器翻译研究的关键技术之一。
句法结构消歧还是语义消歧的前提和关键因素之一。 句法结构消歧的困 难在于同一个动词拥有多数的不同结构, 体现在动词格位框架的多样性上。 动词格位框架的复杂程度越高, 句法结构消歧时分析的难度也越大。 传统的 自然语言处理系统往往釆用人工方式构筑动词的格位框架,但是由于格位框 架的模式数量庞大, 全部由人工构筑需要花费大量的人力资源。
因而, 从大量的语料中自动抽取动词的格位框架技术得到了一定的发 展,如专利文献 1提出了一种基于概率依存图的机器学习方法来实现格位框 架的定格化处理。 非专利文献 1和非专利文献 2提出的基于 WEB的大规模 格位 4ϋ架构建方法。
【专利文献 1】 日本国专利第 3353578号;
【非专利文献 1】 河原大輔, 黒橋禎夫. 高性能計算環境^用 Web 力、 D大規模格 7 —厶構築;
日本国情報処理学会自然言語処理研究会 171-12, pp.67-73, 2006;
【非专利文献 2】河原大輔, 黒橋禎夫: 格 7 厶辞書 D漸次的自動構 築, 日本国自然言語処理学会志, Vol.12, Νο.2, ρρ.109-131, 2005。
但是,上述传统技术文献中,仅仅解决了动词的格位框架自动抽取问题, 没有对抽取的动词格位框架进行进一步的处理,得到的动词格位框架的复杂 度较高, 动词格位框架的数量不精简, 使得在实际运用过程中会造成增加句 法结构消歧和语义消歧的难度等问题。
因而, 如何简化动词格位框架的复杂程度, 减少动词格位框架的数量, 提高动词格位框架的抽象程度和质量, 是一个必须解决的高难度研究课题。 成功地解决这一难题, 将会降低句法结构消歧和语义消歧的难度, 提高句法 结构分析和语义分析的精度, 提高机器翻译、 信息检索以及语音识别等应用 系统的精度。 发明内容: 本发明的第一目的是提出一种高效的知识获取装置。
本发明的第二目的是提出一种高效的知识获取方法。
为实现上述第一目的, 本发明提供了一种知识获取装置, 包括: 格位框 架特征抽取单元,用于抽取输入语句中谓语成分的格位框架要素及其属性信 息; 模型库, 用于存储任意格模型; 任意格判定单元, 用于对格位框架特征 抽取单元的抽取结果及任意格模型进行模式匹配, 确定谓语成分的格位框架 中的任意格信息。
为实现上述第二目的, 本发明提供了一种知识获取方法, 包括: 抽取输 入语句中谓语成分的格位框架要素及其属性信息; 对抽取结果及存储的任意 格模型进行模式匹配, 确定谓语成分的格位框架中的任意格信息。
本发明各个实施例中,通过根据存储的任意格模型与谓语成分的格位框 架进行模式匹配,进而实现对该谓语成分的格位框架进行必须格和任意格的 自动获取和有效区分, 提高自然语言处理的结构消歧和语义消歧的能力。 附图说明: 附图用来提供对本发明的进一步理解, 并且构成说明书的一部分, 与本 发明的实施例一并用于解释本发明, 并不构成对本发明的限制。 在附图中: 图 1为本发明的知识获取方法的实施例一流程图;
图 2为本发明的知识获取方法的实施例二流程图;
图 3为本发明的知识获取方法的实施例三流程图;
图 4为本发明的知识获取装置的实施例一结构图;
图 5为本发明的知识获取装置的实施例二结构图;
图 6为日语句子的句法结构分析示意图;
图 7为抽取的动词格位框架特征示意图。 具体实施方式: 以下结合附图对本发明的优选实施例进行说明, 应当理解, 此处所描述 的优选实施例仅用于说明和解释本发明, 并不用于限定本发明。
本发明各个实施例, 主要基于区分谓语成分的格位框架中的任意格的思 路, 比如: 日语中的句子:
1.彼 自転車 図書館〜 行〈;
2.自転車 彼 図書館〜 行〈;
3.彼 図書館〜 自転車 "Γ' 行〈;
4.彼 図書館〜 行〈, 自転車 Τ'.
根据背景技术中的传统技术方法从上述句子抽取动词 [行〈 ]的格位框架 时, 其结果将会是复数个, 而不是一个; 而实际上, 在上述句子中, [自転 車 Τ']的 [Τ']表示的是使用交通工具的道具格, 属于任意格。 任意格的特点是 可以在动词 [行:]的格位框架中可有可无, 可以在句子中任意移动, 而句子 所表达的意思不会改变。 上述句子中的 [彼 的 [ ]属于施事格, 是句子的 主语, 为必须格; [図書館〜]的 [〜] 属于处所格, 是句子的宾语, 为必须格。 所谓必须格为动词格位框架中必不可少的格, 而任意格则是在动词的格位框 架中可有可无的格。 如果按照必须格和任意格进行划分, 对上述句子进行动 词格位框架进行抽取处理时,得到的动词格位框架将是唯一的。由此可以见, 对句子中动词等谓语成分的格位框架进行自动抽取时,对动词格位框架要素 进行必须格和任意格的区分处理, 会大大减少动词格位框架的数量, 简化动 词格位框架结构, 降低在机器翻译、 对话系统等自然语言理解为核心的应用 系统中的句子分析、 句法结构消歧和语义消歧的难度。
方法实施例
图 1为本发明的知识获取方法的实施例一流程图。 如图 1所示, 本实施 例包括:
步骤 102: 抽取输入语句中谓语成分的格位框架要素及其属性信息; 步骤 104: 对抽取结果及存储的任意格模型进行模式匹配, 确定谓语成 分的格位框架中的任意格信息。
本实施例通过根据存储的任意格模型与谓语成分的格位框架进行模式 匹配, 进而实现对该谓语成分的格位框架进行必须格和任意格的自动获取和 有效区分, 提高自然语言处理的结构消歧和语义消歧的能力。
图 2为本发明的知识获取方法的实施例二流程图。本实施例利用日语的 动词格位框架的必须格和任意格的关系判定为例来解释说明,本领域技术人 员可以理解, 本发明各实施例不限于日语, 可适用于其他任何语言。 如图 2 所示, 本实施例包括:
步骤 201 , 接收输入语句, 如接收句子 [彼 自転車 図書館〜行〈] , 具体操作时, 还可以将接收的句子读入内存;
步骤 202, 对输入语句进行词法和句法分析, 具体如:
首先, 进行词法分析, 包括分词和词的属性特征的获取等两个步骤, 分 词是对句子的单词进行分割, 如上述句子可以分割为 [彼 / /自転車 /Τ 図書 館 /〜/行〈] , 词的属性特征的赋予可以从机读词典中获取, 如词性、 动词的 活用形等等;
其次, 进行句法分析处理, 句法分析的任务是分析出句子的结构, 图 6 为曰语句子 [彼 自転車 図書館〜行:]的句法结构分析结果, 根据图 6所 示的分析结果, 可以得出句子的中心词为动词 [行〈] , 句子的主语为日语代 名词 [彼] , 宾语为表示场所的名词 [図書館];
再次, 在完成了词法句法分析处理之后, 记忆分析结果; 本领域技术人 员可以理解, 进行词法及句法分析的方法为现有技术, 不再赘述;
步骤 203 , 对输入语句进行格位框架特征抽取; 具体如:
首先, 将知识库的信息读入内存;
其次, 对于步骤 202的分析结果, 从读取的知识库信息中获取关键词的 语义和概念信息; 具体在进行动词等谓语成份的格位框架的特征提取时, 需 要事先确定需要提取的谓语词的特征要素, 如词、 词性、 语义、 概念、 适用 的领域等, 然后根据制定的特征要素的每一个成分, 从步骤 202的分析结果 和知识库中提取相应的特征要素的属性值; 如句子 [彼 自転車 図書館〜 行〈] , 可以分别以 [彼]、 [自転車]、 [図書館]、 [行:]为关键词, 对读入内存 中的知识库信息进行检索, 从中分别获取 [彼]、 [自転車]、 [図書館]的语义、 概念等属性特征(或称为属性信息); 如从日语句子 [彼 自転車 図書館〜 行〈 ]抽取的动词 [行〈 ]的格位框架如图 7所示;
具体地, 从知识库中可以得到 [彼]的属性 [人 /动物]、 [自転車]的属性 [交 通工具 /物品]、 [図書館]的属性 [建筑物 /场所]等; 本领域技术人员可以理解, 知识库的具体实例可以根据输入的语言种类和选取的特征来选定, 当输入语 言是日语时, 可以使用日本情报通信机构研发的 EDR词典, 英语可以使用 WordNet , 中文可以使用 HowNet等等;
步骤 204, 将模式库中存储的任意格模型与步骤 203所抽取的谓语词的 格位框架进行模式匹配, 确定该谓语词的格位框架中的任意格信息; 简述如 下, 具体可参见图 3的解释说明;
如: 从日语句子 [彼 自転車 図書館〜行:]的抽取的格位框架如图 7 所示时, 句子中的词 [自転車 ]可以从上述知识库中获取语义信息 [交通手段] , 符合模型库中 [交通手段 ·Γ']为任意格的判定模型, 则可以得知 [自転車 ·τ']为 任意格;
步骤 205, 输出步骤 204的确定结果; 优选地, 还可以将确定结果输送 给知识库, 以用于格框架特征抽取单元的处理, 以提高系统的知识获取的性 能和效率;
具体操作时, 输出的数据可以根据需要, 以一定的格式进行组合, 输 出的形式可以是文件, 也可以直接存入数据库, 如对应于上述步骤 204的确 定结果, 可以是 [自転車 Τ'] , 也可以是 [交通手段 Τ'],即确定结果可以为任 意格的短语, 或包含语义信息和特定格助词的片段; 还可以为了方便信息处 理以及简化动词格位框架的处理,把句子中判定出的任意格模式和句子中的 谓语成分一起进行输出,或者输出任意格抽取以后的任意格短语和去除任意 格短语的句子。
本实施例通过对句子中动词等谓语成分的格位框架与任意格的关系的 判定, 实现对格位框架中必须格和任意格的正确区分, 使得动词等谓语成分 的结构简洁, 从而大大提高动词格位框架的覆盖率, 并提高句法结构分析和 语义分析中的结构消歧和语义消歧的精准度, 为信息检索、 机器翻译、 对话 系统等自然语言理解研究领域提供高效可信的知识获取方法。
图 3为本发明的知识获取方法的实施例三流程图。其主要用于说明根据 机器学习方法构建模型库的过程, 本领域技术人员可以理解, 该模型库可以 根据各种机器学习方法基于学习数据而建立, 以下以支持向量机 S VM为例 对使用机器学习方法建立模型库予以解释说明。如图 3所示,本实施例包括: 步骤 301 , 特征提取; 支持向量机的理论算法可以参考如下的非专利文 献: 【【非非专专利利文文献献 33】】 方方瑞瑞明明,,支支持持向向理理机机理理论论及及其其应应用用分分析析;;中中国国电电力力出出版版 社社,, 22000077年年 1100月月 11 日日,, IISSBBNN:: 99778877550088336600337799..
【【非非专专利利文文献献 44】】 邓邓乃乃扬扬,, 田田英英杰杰,, 支支持持向向量量机机:: 理理论论、、 算算法法与与拓拓展展,, 科科学学出出版版社社,, 22000099年年 88月月 11 日日,, IISSBBNN:: 99778877003300225500331155..
目目前前,, 支支持持向向量量机机的的机机器器学学习习模模块块有有很很多多开开放放源源代代码码,, 如如参参见见
【【非非专专利利文文献献 55】】
hhttttpp ::////wwwwww..ccss.. ccoorrnneellll..eedduu//PPeeooppllee//ttjj// ssvvmm lliigghhtt// oolldd// ssvvmm—— lliigghhtt—— vv44..0000.. hhttmmll
【【 专专 ll文文献献 όό】】 hhttttpp::////wwwwww..ccssiiee..nnttuu..eedduu..ttww//~~ccjjlliinn//lliibbssvvmm//
根根据据支支持持向向量量机机的的理理论论算算法法原原理理,,使使用用不不同同的的核核函函数数可可以以解解决决数数据据的的线线 性性分分类类或或非非线线性性分分类类问问题题,, 一一般般可可以以使使用用多多项项式式核核函函数数、、 RRBBFF (( RRaaddiiaall BBaassiiss FFuunnccttiioonn ))核核函函数数、、 SSiiggmmooiidd核核函函数数等等等等,, 上上述述【【非非专专利利文文献献 55】】和和【【非非专专利利 文文献献 66】】提提供供的的模模块块中中,, 可可以以通通过过对对所所使使用用学学习习模模块块的的命命令令,, 如如 SSVVMM LLiigghhtt 的的 ssvvmm—— lleeaarrnniinngg学学习习命命令令,, 通通过过预预先先设设定定该该命命令令的的参参数数进进行行核核函函数数的的选选定定;; 同同时时,, 使使用用支支持持向向量量机机时时,, 还还涉涉及及特特征征向向量量空空间间的的生生成成、、 特特征征选选择择和和特特征征权权 重重的的计计算算方方法法等等问问题题;; 特特征征向向量量空空间间可可以以根根据据使使用用的的学学习习数数据据,, 如如对对文文本本文文 件件进进行行单单词词分分割割处处理理,,计计算算词词频频或或词词的的概概率率,,或或 NN元元模模型型的的出出现现频频率率或或概概率率,, 并并进进行行去去除除部部分分高高频频词词部部分分等等工工作作完完成成特特征征选选择择;; 特特征征权权重重的的计计算算方方法法有有很很 多多,, 如如布布尔尔权权重重、、 绝绝对对词词频频((TTFF ))、、 倒倒排排文文档档频频度度((IIDDFF ))、、 TTFF--IIDDFF、、 TTFFCC、、 IITTCC、、 熵熵权权重重以以及及 TTFF--IIWWFF等等等等;;
另另外外,, 使使用用 SSVVMM分分类类器器时时,, 需需要要对对学学习习数数据据进进行行预预处处理理,, 预预处处理理除除了了 上上述述的的特特征征向向量量空空间间的的生生成成、、特特征征选选择择和和特特征征权权重重的的计计算算方方法法的的选选取取等等工工作作 之之外外,, 还还需需要要事事先先对对学学习习数数据据进进行行分分类类,, 如如正正确确的的事事例例标标识识为为类类 ++11 ,, 错错误误的的 事事例例标标识识为为类类 --11 ;; 除除此此以以外外,, 还还需需要要根根据据特特征征向向量量空空间间的的各各个个元元素素对对学学习习数数 据据中中所所有有的的正正例例和和负负例例进进行行数数据据化化的的格格式式转转换换,, 进进行行格格式式转转换换工工作作时时,, 一一般般 可可以以使使用用特特征征向向量量空空间间集集合合中中各各个个特特征征元元素素的的行行号号来来代代替替学学习习数数据据中中的的词词 或或短短语语;; 例例如如::
正正例例:: 彼彼 自自転転車車 図図書書館館 〜〜 行行〈〈
自自転転車車 彼彼 図図書書館館 〜〜 行行〈〈
彼彼 図図書書館館 〜〜 自自転転車車 行行〈〈
Figure imgf000009_0001
二 CD 本 後 読
^f^ 楽 L L、 旅 t ¾ 按上述思路, 统计词频, 假设得到表 1所示的状态向量空间 (即提取的 特征), 为例举, 不应做限定解释;
表 1
Figure imgf000010_0001
果以布尔权重对上述正例和负例进行格式转换,可以得到以下的数据 正例: 1: 1 3: 1 11: 14: 1 7: 1 12: 1 2:
11: 1 4: 1 1: 1 3: 1 7: 1 12: 12:
1: 1 3: 1 7: 1 12: 12: 1 11: 14:
1: 1 3: 1 7: 1 12: 12: 1 8: 113: 负例: 9: 1 5: 1 3: 1 6: 1 4: 1 7:
14: 1 7: 1 15: 1 16: 1 17: 1 18: 步骤 302, 根据上述提取的特征及机器学习方法, 进行建模; 如上所述, 如使用 SVMLight时, 可以使用把上述 svm— learning完成机器学习任务, 得 到基于 SVM的模型库, 得到的模型库中的模型如:
11: 14: 1 7: 1 12: 1 2: 1 8: 1 +0.92411687 本领域技术人员可以理解, 当使用 SVM模型, 任意格判定单元的处理 的实质即使用 SVMLight的 svm— classify模块对新数据(输入语句)在基于 相应的特征向量集合(必要时可以进行格式转换)时进行分类, 以判断是否 含有任意格; 如果对分类结果的权重给予适当的阔值, 即可以判定句子中是 否含有任意格,如对句子 [二 D 学生 自転車 学校 〜 行:]中的 [自 転車 Τ']的部分判定为任意格; 同理, 若特征向量空间中包含 [自転車]的语 义信息 [交通手段] , 可以推论, 当学习数据充分时, 可以获取诸如 [交通手段 τ']为任意格的模型, 并能够对新数据进行判定。
需要说明的是, 图 2中步骤 203所抽取的谓语词的特征要素和模型库之 间存在匹配关系, 即抽取的特征要素要符合模型库中的模型的构成要素的要 求; 如使用 SVM分类器构建的模型库时, 当学习数据经过上述格位框架特 征抽取单元的处理, 从知识库中获取了句子中词或短语的概念、 语义等信息 时, 用于 SVM学习的特征向量空间应包含知识库中的语义、 概念、 适用的 领域等。 同时, 对学习数据以及待分类的数据, 可以根据需要进行适当的格 式转换, 然后分别完成学习数据的机器学习任务和待分类数据的分类任务。 详细方法可以参考【非专利文献 3】、 【非专利文献 4】、 【非专利文献 5】和【非 专利文献 6】。
本实施例基于词和布尔加权的 SVM学习方法进行解释说明, 具体操作 时还可以使用其他方法, 如监督学习方法、 无监督学习方法以及半监督学习 方法、 聚类算法、相关算法、 复杂特征集和合一运算、概率上下文无关文法、 Ν元模型、 隐马尔可夫模型 ΗΜΜ、 朴素的贝叶斯、 决策树模型、 最大熵模 型、基于错误驱动的转换方法、神经元网络、条件随机场 (CRF)、 bootstrapping, Co-Training等方法中的至少一种。
装置实施例
图 4为本发明的知识获取装置的实施例一结构图。 图 1-3所示的各方法 实施例均可适用于本实施例。 本实施例包括: 格位框架特征抽取单元 420 , 用于抽取输入语句中谓语成分的格位框架要素及其属性信息; 模型库 4020 , 用于存储任意格模型; 任意格判定单元 430, 用于对格位框架特征抽取单元 的抽取结果及任意格模型进行模式匹配,确定谓语成分的格位框架中的任意 格信息。
具体操作时, 还可以包括输入语句记忆单元 400 , 词法句法分析单元 410, 知识库 4010和输出单元 440。 本实施例中的各模块及各单元与图 2、 图 3及图 4中的各模块及各单元对应,如图 2中的知识库对应于本实施例中 的知识库 4010。 各单元具体解释如下: 输入语句记忆单元 400 , 用于接收输入语句, 具体操作时可以利用各种 通用输入模块, 如键盘、 定点装置、 手写字符识别、 光学字符读取器、 语音 输入识别进行语句的输入, 或通过文本文件或数据库形式进行输入; 输入语 句记忆单元 400可以为现有各种能够执行处理用于获得语言信息的输入语句 的单元;
词法句法分析单元 410,用于对输入语句进行分词处理和句法结构分析; 其中, 分词处理包括对输入的句子进行切分, 并对每一个词赋予词性等相关 属性特征; 句法结构分析包括对输入的句子的结构, 例如对中文句子进行句 法结构分析判断出句子的主语、 谓语、 宾语、 定语、 状语和补语等部分; 知识库 4010用于给出词法句法分析单元 410的输出结果中, 即句子的 各个构成要素的词或短语的语义、 概念等属性特征; 例如英语的 WordNet、 中文的 HowNet等; 增加语义和概念属性特征的目的在于对抽取的格位框架 进行抽象化处理; 比如日语句子 [彼 自転車 図書館〜 行:]的施事格 [彼 的属性可以代表人称, 工具格 [自転車 Τ']可以是交通工具, 处所格闺 書館〜]可以是场所等等;
格位框架特征抽取单元 420用于针对词法句法分析单元 410的输出结 果、 以及从知识库 4010 中获取的语义、 概念等属性特征, 抽取对象动词的 格位框架的特征, 为任意格判定单元 430和模型库 4020之间进行的模式匹 配处理提供数据条件和依据; 格位框架特征抽取单元 420的特征选取方法有 很多, 一般可以使用基于文档频率的特征提取方法, 信息增益法, χ2统计方 法和互信息方法等等。 特征权重的计算方法也有很多, 如布尔权重、 绝对词 频(TF )、 倒排文档频度(IDF )、 TF-IDF、 TFC、 ITC、 熵权重、 TF-IWF等; 模型库 4020可以根据统计方法使用学习数据获得, 用于为格位框架特 征抽取单元 420抽取的格位框架特征进行判断 , 从而判定和区分出动词等谓 语成分的格框架要素中的必须格和任意格。模型库中的模型可以使用学习数 据由统计机器学习方法获得, 如支持向量机、 决策树等算法;
任意格判定单元 430用于对格位框架特征抽取单元 420抽取的动词格位 框架特征和模型库 4020之间进行的模式匹配, 对动词等谓语成分的格框架 的要素进行判定, 区分出必须格和任意格; 具体如使用支持向量机 SVM建 立的模型库 4020,当模型库 4020中存在诸如 [交通手段 Τ']的任意格模型时, 句子 [彼 汽車 会社 : 行:]中的词 [汽車]可以从知识库中获取语义 信息 [交通手段] , 符合模型库 4020 中 [交通手段 τ']为任意格的判定模型, 则可以得知 [汽車 τ']为任意格;
输出单元 440用于对任意格判定单元的结果进行输出,输出的形式多种 多样, 可以是文件输出, 也可以是显示器输出等; 对应于上述任意格判定单 元 430处理的输入语句, 输出可以是 [汽車 Τ'],或者 [汽車 Τ']和 [彼 会社 (: 行:]等, 此外也可以根据用户的需要进行输出。
优选地,输出单元 440将其输出结果写入知识库 4010,直接用于格框架 特征抽取单元 420的处理, 以提高系统的知识获取的性能和效率。
本实施例通过任意格判定单元 430可以成功地将动词的格位框架中的格 要素划分为必须格和任意格, 并把动词的任意格从动词格位框架中分离出 来, 达到简化动词的格位框架, 压缩格位框架的数量之目的, 同时, 还可以 降低句法结构消歧和语义消歧的难度, 提高句法分析和语义分析的精准度, 对机器翻译、信息检索以及语音识别等相关研究和应用领域起到良好的促进 和改善作用。
图 5为本发明的知识获取装置的实施例三结构图。 图 1-3所示的各方法 实施例均可适用于本实施例。 如图 5所示, 本实施例的构成单元及连接关系 与图 5所示的知识获取装置大体相同, 不同点在于: 增加了用于存储学习数 据的数据库 5030 (如大规模语料库)及机器学习单元 510, 该机器学习单元 510可以釆用如支持向量机、决策树等方法,使用学习用数据库 5030中的数 据进行机器学习, 从而构建模型库 4020 , 详细参见图 3的解释说明。
最后应说明的是: 以上仅为本发明的优选实施例而已, 并不用于限制本 发明, 尽管参照前述实施例对本发明进行了详细的说明, 对于本领域的技术 人员来说, 其依然可以对前述各实施例所记载的技术方案进行修改, 或者对 其中部分技术特征进行等同替换。 凡在本发明的精神和原则之内, 所作的任 何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1. 一种知识获取装置, 其特征在于, 包括:
格位框架特征抽取单元,用于抽取输入语句中谓语成分的格位框架要素 及其属性信息;
模型库, 用于存储任意格模型;
任意格判定单元,用于对所述格位框架特征抽取单元的抽取结果及所述 任意格模型进行模式匹配, 确定所述谓语成分的格位框架中的任意格信息。
2. 根据权利要求 1所述的知识获取装置, 其特征在于, 还包括: 数据库, 用于存储预设的学习数据;
机器学习单元, 用于从所述数据库获取所述学习数据, 并根据预设的机 器学习方法基于所述学习数据训练学习得到所述任意格模型 , 以及将所述任 意格模型发送至所述模型库。
3. 根据权利要求 1或 2所述的知识获取装置, 其特征在于, 还包括: 知识库, 用于存储语句构成要素的属性信息, 为所述格位框架特征抽取 单元提供所述谓语成分的格位框架要素的属性信息。
4. 根据权利要求 3所述的知识获取装置, 其特征在于, 还包括: 输出单元, 用于输出所述任意格判定单元的确定结果, 并将所述确定结 果发送至所述知识库。
5. 根据上述权利要求 4所述的知识获取装置, 其特征在于, 还包括: 词法句法分析单元, 用于所述输入语句进行词法分析及句法结构分析, 并将分析结果发送至所述格位框架特征抽取单元。
6. 根据上述权利要求 5所述的知识获取装置, 其特征在于, 还包括: 输入语句记忆单元, 用于接收所述输入语句, 并将所述输入语句转发至 所述词法句法分析单元。
7. 一种知识获取方法, 其特征在于, 包括:
抽取输入语句中谓语成分的格位框架要素及其属性信息;
对所述抽取结果及存储的任意格模型进行模式匹配,确定所述谓语成分 的格位框架中的任意格信息。
8. 根据权利要求 7所述的知识获取方法,其特征在于,在所述对所述抽 取结果及存储的任意格模型进行模式匹配的步骤之前包括:
根据预设的机器学习方法基于预设的学习数据训练学习得到所述任意 格模型;
存储所述任意格模型。
9. 根据权利要求 7或 8所述的知识获取方法,其特征在于,在所述确定 所述谓语成分的格位框架中的任意格信息的步骤之后还包括:
输出所述确定结果, 并将所述确定结果发送至知识库, 所述知识库用于 存储语句构成要素的属性信息 , 并提供所述谓语成分的格位框架要素的属性 信息。
10. 根据权利要求 8所述的知识获取方法, 其特征在于, 所述预设的机 器学习方法包括: 监督学习方法、 无监督学习方法、 半监督学习方法、 聚类 算法、相关算法、 复杂特征集和合一运算、概率上下文无关文法、 N元模型、 隐马尔可夫模型、 朴素的贝叶斯、 支持向量机、 决策树模型、 最大熵模型、 基于错误驱动的转换方法、 神经网络、 条件随机场中的至少一种。
PCT/CN2010/079937 2010-12-17 2010-12-17 知识获取装置及方法 WO2012079245A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201080069243.0A CN103119585B (zh) 2010-12-17 2010-12-17 知识获取装置及方法
PCT/CN2010/079937 WO2012079245A1 (zh) 2010-12-17 2010-12-17 知识获取装置及方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/079937 WO2012079245A1 (zh) 2010-12-17 2010-12-17 知识获取装置及方法

Publications (1)

Publication Number Publication Date
WO2012079245A1 true WO2012079245A1 (zh) 2012-06-21

Family

ID=46243987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/079937 WO2012079245A1 (zh) 2010-12-17 2010-12-17 知识获取装置及方法

Country Status (2)

Country Link
CN (1) CN103119585B (zh)
WO (1) WO2012079245A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714053A (zh) * 2013-11-13 2014-04-09 北京中献电子技术开发中心 一种面向机器翻译的日语动词识别方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959240A (zh) * 2017-05-26 2018-12-07 上海醇聚信息科技有限公司 一种专有本体自动生成系统及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1255213A (zh) * 1997-03-04 2000-05-31 石仓博 语言分析系统及方法
JP2007206888A (ja) * 2006-01-31 2007-08-16 Toyota Central Res & Dev Lab Inc 応答生成装置、方法及びプログラム
WO2008117432A1 (ja) * 2007-03-27 2008-10-02 Fujitsu Limited 電子文書の秘匿化プログラム

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689411B2 (en) * 2005-07-01 2010-03-30 Xerox Corporation Concept matching
US8301435B2 (en) * 2006-02-27 2012-10-30 Nec Corporation Removing ambiguity when analyzing a sentence with a word having multiple meanings
JP5128328B2 (ja) * 2008-03-13 2013-01-23 日本放送協会 曖昧性評価装置およびプログラム
KR100956794B1 (ko) * 2008-08-28 2010-05-11 한국전자통신연구원 다단계 용언구 패턴을 적용한 번역장치와 이를 위한적용방법 및 추출방법
CN101887443B (zh) * 2009-05-13 2012-12-19 华为技术有限公司 一种文本的分类方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1255213A (zh) * 1997-03-04 2000-05-31 石仓博 语言分析系统及方法
JP2007206888A (ja) * 2006-01-31 2007-08-16 Toyota Central Res & Dev Lab Inc 応答生成装置、方法及びプログラム
WO2008117432A1 (ja) * 2007-03-27 2008-10-02 Fujitsu Limited 電子文書の秘匿化プログラム

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714053A (zh) * 2013-11-13 2014-04-09 北京中献电子技术开发中心 一种面向机器翻译的日语动词识别方法

Also Published As

Publication number Publication date
CN103119585A (zh) 2013-05-22
CN103119585B (zh) 2015-12-02

Similar Documents

Publication Publication Date Title
CN108170749B (zh) 基于人工智能的对话方法、装置及计算机可读介质
CN111143576A (zh) 一种面向事件的动态知识图谱构建方法和装置
WO2020232943A1 (zh) 用于事件预测的知识图构建方法与事件预测方法
CN108334495A (zh) 短文本相似度计算方法及系统
CN110209806A (zh) 文本分类方法、文本分类装置及计算机可读存储介质
CN104391942A (zh) 基于语义图谱的短文本特征扩展方法
KR101627428B1 (ko) 딥 러닝을 이용하는 구문 분석 모델 구축 방법 및 이를 수행하는 장치
CN113377897B (zh) 基于深度对抗学习的多语言医疗术语规范标准化系统及方法
CN110717341B (zh) 一种以泰语为枢轴的老-汉双语语料库构建方法及装置
CN103314369B (zh) 机器翻译装置和方法
WO2017198031A1 (zh) 解析语义的方法和装置
CN112420024A (zh) 一种全端到端的中英文混合空管语音识别方法及装置
Alsallal et al. Intrinsic plagiarism detection using latent semantic indexing and stylometry
CN114217766A (zh) 基于预训练语言微调与依存特征的半自动需求抽取方法
Zhan et al. Survey on event extraction technology in information extraction research area
Sun et al. Multi-channel CNN based inner-attention for compound sentence relation classification
CN114722774B (zh) 数据压缩方法、装置、电子设备及存储介质
Kessler et al. Extraction of terminology in the field of construction
CN115033753A (zh) 训练语料集构建方法、文本处理方法及装置
Yuwana et al. On part of speech tagger for Indonesian language
CN112632272A (zh) 基于句法分析的微博情感分类方法和系统
WO2012079245A1 (zh) 知识获取装置及方法
CN106021225A (zh) 一种基于汉语简单名词短语的汉语最长名词短语识别方法
CN113590768B (zh) 一种文本关联度模型的训练方法及装置、问答方法及装置
Wang et al. Attention-based recurrent neural model for named entity recognition in Chinese social media

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080069243.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10860718

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC DATED 07.10.2013

122 Ep: pct application non-entry in european phase

Ref document number: 10860718

Country of ref document: EP

Kind code of ref document: A1