WO2018000272A1 - 一种语料生成装置和方法 - Google Patents

一种语料生成装置和方法 Download PDF

Info

Publication number
WO2018000272A1
WO2018000272A1 PCT/CN2016/087757 CN2016087757W WO2018000272A1 WO 2018000272 A1 WO2018000272 A1 WO 2018000272A1 CN 2016087757 W CN2016087757 W CN 2016087757W WO 2018000272 A1 WO2018000272 A1 WO 2018000272A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
statement
corpus
sentence structure
cluster
Prior art date
Application number
PCT/CN2016/087757
Other languages
English (en)
French (fr)
Inventor
王昊奋
邱楠
杨新宇
Original Assignee
深圳狗尾草智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳狗尾草智能科技有限公司 filed Critical 深圳狗尾草智能科技有限公司
Priority to PCT/CN2016/087757 priority Critical patent/WO2018000272A1/zh
Priority to CN201680001747.6A priority patent/CN107004000A/zh
Priority to US15/694,918 priority patent/US10268678B2/en
Publication of WO2018000272A1 publication Critical patent/WO2018000272A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present invention relates to the field of word processing, and in particular to a corpus generating apparatus and method.
  • the language expression is rich and varied. It is only necessary to randomly combine several words, and the sentence may be formed. If the corpus collects and inputs all the corpus in turn, it needs to invest too much energy and is easy to miss.
  • the prior art has a method of editing distance, and the corpus is expanded by operations such as deletion, shifting, and insertion, but the actual operation process is cumbersome.
  • the technical problem to be solved by the present invention is to provide a corpus generating apparatus and method, which can obtain a corpus by nesting words into an expanded sentence structure, which is simple in operation, saves resources, and expands the corpus to a large extent.
  • a technical solution adopted by the present invention is to provide a corpus generating device, the device comprising: a word segmentation module, which is connected to at least one monolingual parallel corpus for segmenting sentences in each parallel corpus. And the knowledge-driven segmentation is implemented to realize labeling; the classification module is used to identify the knowledge-driven processed statement, and the sentences with the same meaning of different label sequences are classified into the same sentence cluster; the mapping module is used to analyze each single language parallel The statement in each statement cluster in the corpus determines the sentence structure category of all the statements in the statement cluster, determines and records the labels between the corresponding sentence structure when the transformation between different sentence structure categories in the same sentence cluster is stored.
  • Transformation mapping method a sentence structure generation module for finding the same first category sentence knot in each statement cluster in all monolingual parallel corpora And according to the first category sentence structure of one of the statement clusters and the first type of sentence structure of the other sentence structure in the same sentence cluster, the first category sentence structure according to the mapping manner in the remaining statement clusters
  • the mapping is respectively performed to generate a corresponding sentence structure category
  • the corpus generating module is configured to generate a new monolingual parallel corpus for the words corresponding to the sequence tags of the sentence structure in the newly generated sentence structure nested statement cluster.
  • a technical solution adopted by the present invention is to provide a corpus generating method, the method comprising the steps of: segmenting each sentence in at least one monolingual parallel corpus, and performing knowledge driving on the word segmentation to implement Labeling; identifying knowledge-driven processed statements, classifying statements of the same meaning with different tag sequences into the same statement cluster; analyzing the statements in each statement cluster in each monolingual parallel corpus, determining the sentences of all statements in the statement cluster The structure type, determine and record the mapping of the label transformation between the corresponding sentence structure when transforming between different sentence structure categories in the same sentence cluster; find the same in each statement cluster in all monolingual parallel corpora The first category of sentence structure, and according to the first category sentence structure of one of the statement clusters and the first type of mapping structure of other category sentence structures in the same sentence cluster, the first category in the remaining statement clusters The sentence structure is mapped according to the mapping method to generate the corresponding sentence structure category; the newly generated sentence knot Nested sequence tags corresponding to the words of the statement sentence structure cluster
  • the corpus generating device of the present invention maps the sentences in the existing corpus, and maps different sentence patterns of the tag sequence according to the tags of the sentence pattern to obtain more sentence structure and fills Get more corpus after nesting the words corresponding to the tag.
  • the corpus can be obtained by nesting the words into the expanded sentence structure, the operation is simple, the resources are saved, and the corpus is expanded to a large extent.
  • FIG. 1 is a schematic structural diagram of an embodiment of a corpus generating apparatus provided by the present invention.
  • FIG. 2 is a schematic flow chart of an embodiment of a corpus generating method provided by the present invention.
  • corpus The construction of corpus is an important foundation of statistical learning methods. In recent years, the great value of corpus resources for natural language processing research has been more and more recognized. In particular, the bilingual corpus (Bilingual Corpus) has become an indispensable resource for machine translation, machine-assisted translation, and translation knowledge acquisition research. On the one hand, the emergence of bilingual corpora directly promotes the development of new machine translation technologies, such as parallel corpus provides essential training data for statistical machine translation model construction, based on statistics (Statistic-Based) and instance-based (Example-Based). The corpus-based translation method provides new ideas for machine translation research, effectively improves the quality of translation, and sets off a new climax in the field of machine translation research.
  • the bilingual corpus is an important source of translation knowledge, from which you can learn a variety of fine-grained translation knowledge, such as translation dictionaries and translation templates, to improve traditional machine translation technology.
  • the bilingual corpus is also an important basic resource for cross-language information retrieval, translation dictionary compilation, automatic extraction of bilingual terms, and multi-language comparative research.
  • languages have invested a lot of manpower, material resources and financial resources.
  • the sources of bilingual parallel corpora are mainly concentrated in specific fields such as government reports and news laws, which are not suitable for real text applications.
  • large-scale bilingual texts on the Internet have good timeliness and coverage, which provides a potential solution for the acquisition of bilingual parallel corpora.
  • FIG. 1 is a schematic structural diagram of an embodiment of a corpus generating apparatus provided by the present invention.
  • the apparatus 100 includes a word segmentation module 110, a classification module 120, a mapping module 130, a sentence structure generation module 140, and a corpus generation module 150.
  • the word segmentation module 110 is connected to the monolingual parallel corpus 101 for segmenting the sentences in the parallel corpus 101, and driving the segmentation words to implement tagging.
  • the method of word segmentation is usually the use of word segmentation tool software.
  • the knowledge-driven process is the process of tagging statements.
  • the word segmentation module 110 includes a word segmentation unit 111, a first tag unit 112, and a second tag list. Yuan 113.
  • the word segmentation unit 111 is connected to the word segmentation tool software of the network, and the statement to be decomposed is imported into the word segmentation tool software for word segmentation operation.
  • the first tag unit 112 adds the first tag to the statement according to the part of speech of the words in the sentence of each completed participle.
  • the second tag unit 113 adds a second tag to the statement in the sentence in the sentence according to the word.
  • the sentence in the selected corpus "Xiaohong will go to the 9th floor conference room of Shenzhen Science and Technology Museum to participate in the science knowledge lecture tomorrow afternoon", after the word segmentation operation, get "Little Red / Tomorrow / Afternoon / Go / Go /Shenzhen City / Science and Technology Museum / 9th Floor / Meeting Room / Participation / Science / Knowledge / Lecture.
  • Xiaohong, Shenzhen City, Science and Technology Museum, conference rooms and lectures are nouns.
  • the same first label is marked with the above seven words, usually labeled as N(noun); tomorrow and afternoon character time, first label labeling For T (time); participation as a verb, the first label is marked as V (verb); to, go, nine layers and science knowledge as additional words, can be omitted without marking.
  • the second level labeling is continued.
  • Xiaohong is a character noun, the subject of the sentence
  • the second label is labeled NS (noun/subject); Shenzhen City, Science and Technology
  • NS noun/subject
  • Shenzhen City Science and Technology
  • the halls and conference rooms are nouns that characterize the location, usually adverbial modifiers, but the scope of the points is different.
  • the scope of the Shenzhen, Science and Technology and Conference rooms is from large to small, so it can be marked as NAM1, NAM2, and NAM3; the lecture is the object in the sentence, marked as NO for the nouns that characterize the time tomorrow and afternoon, according to the size of the time range, can be labeled T1 and T2.
  • the statements in the corpus can be characterized by the label sequence. For example, the above statement can be labeled as “NS T1 T2 NAM1 NAM2 NAM3 V NO”.
  • the word segmentation module 110 further includes a third tag unit 114 that adds a third tag to the statement of the different meanings of the tagged processed tag sequence according to the meaning of the word.
  • a third tag unit 114 that adds a third tag to the statement of the different meanings of the tagged processed tag sequence according to the meaning of the word. For example, the sentence "An archaeologist discovered the tooth of the Yuanmou people in Shangnao Village, Yuanmou County, Yunnan province in May 1965". After the word segmentation, the label sequence obtained is exactly the same as the label sequence of the aforementioned sentence. But obviously, the content is very different, so it needs to be differentiated. In the present embodiment, the order in which the three types of labels are added is not limited.
  • the classification module 120 is configured to identify the knowledge-driven processed statement, and classify the sentences of the same meaning with different tag sequences into the same sentence cluster. Through the processing of the classification module 120, each will The statements in the monolingual parallel corpus are divided into a plurality of different sentence clusters, and each sentence cluster contains all the sentence structure types of the same semantic statement, and the statements with the same tag sequence have the same sentence structure.
  • the mapping module 130 analyzes the statements in the statement cluster of each monolingual parallel corpus, determines the sentence structure categories of all the statements in the statement cluster of each monolingual parallel corpus, and determines and records the different sentence structure types in the same sentence cluster.
  • the mapping is performed, the mapping of the label transformation between the corresponding sentence structures is performed.
  • m sentence structure is stored in one of the sentence clusters, and m (m-1)/2 mapping relationships can be generated between the m types of structures, and the sentence structure in all the statement clusters is performed. Map, determine and record how the map is generated by the store.
  • the sentence structure generation module 140 searches for the same first category sentence structure in each statement cluster of each monolingual parallel corpus, and according to the first category sentence structure of one of the statement clusters and other categories in the same sentence cluster
  • the mapping structure of the sentence structure, in the remaining statement clusters respectively maps the first category sentence structure according to the mapping manner, and generates corresponding sentence structure categories in the remaining sentence clusters.
  • the sentence structure generating module 140 searches for the same first category sentence structure in each statement cluster in all the monolingual parallel corpora, and according to the first category sentence structure of one of the statement clusters and other category sentences in the same sentence cluster
  • the first type of mapping method of the structure structure, in the remaining statement clusters respectively maps the first category sentence structure according to the mapping manner, and generates a corresponding sentence structure category.
  • a statement cluster of a monolingual parallel corpus is selected and compared with all the statement clusters in other monolingual parallel corpora.
  • n monolingual parallel corpora then perform n(n-1)/2 comparison to find the same sentence structure in the two sentences clusters.
  • the statement cluster K and the statement cluster L are compared, and the sentence cluster K has synonymous sentence structures a, b, c, d, and e, and the sentence cluster L has synonymous sentence structures d and e. And f, then in the statement clusters K and L, there are two syntactic structures d and e of the same type.
  • the sentence structure d, e respectively generates a mapping relationship with a, b, and c in the statement cluster K and records the mapping manner, and generates a mapping relationship with the same sentence structure f in the statement cluster L and records the mapping manner.
  • the corresponding mapping relationship a', b', c' is established in the statement cluster L, Language
  • a corresponding mapping relationship f" is established in the sentence cluster K, and the new sentence cluster is generated to contain five types of sentences. Structure.
  • the corpus generating module 150 obtains a corpus for the words corresponding to the sequence tags of the sentence structure in the sentence structure nested in the sentence structure generated in all the sentence clusters. Multiple newly generated statement clusters are combined to generate a new monolingual parallel corpus.
  • the corpus generating module 150 includes a tag identifying unit 151 and a corpus generating unit 152.
  • the tag identifying unit 151 identifies tags in all sentence structure in each sentence cluster, and the corpus generating unit 152 tags all the sentence structures in each sentence cluster.
  • the corresponding words are nested into the sentence structure to generate corpus.
  • the corpus generating unit 152 nests the sentence structure in accordance with the tagging criteria of the word segmentation module 110.
  • the sentence structure NS generated by the sentence pattern generation module 140 includes a tag NS, and the word meaning is nested according to the sentence meaning of the statement cluster, such as "small red” or "archaeologist".
  • the corpus generating device of the present invention maps the sentences in the existing corpus, and maps different sentence patterns of the tag sequence according to the tags of the sentence pattern to obtain more sentence structure and fills Get more corpus after nesting the words corresponding to the tag.
  • the corpus can be obtained by nesting the words into the expanded sentence structure, the operation is simple, the resources are saved, and the corpus is expanded to a large extent.
  • FIG. 2 is a schematic flowchart diagram of an implementation manner of a corpus generating method provided by the present invention. The steps of the method include:
  • S210 Perform word segmentation on each statement in at least one monolingual parallel corpus, and perform knowledge-driven segmentation on the word segmentation to implement tagging.
  • a monolingual parallel corpus is used to segment words in parallel corpora and to be knowledge driven after word segmentation to tag statements.
  • the method of word segmentation is usually the use of word segmentation tool software.
  • the word segmentation tool software connected to the network imports the statement to be decomposed into the word segmentation tool software for word segmentation.
  • the same first label is marked with the above seven words, usually labeled as N(noun); tomorrow and afternoon character time, first label labeling For T (time); participation as a verb, the first label is marked as V (verb); to, go, nine layers and science knowledge as additional words, can be omitted without marking.
  • the second level labeling is continued.
  • the first label is the noun Xiaohong, Shenzhen City, science and technology museum, conference room and lectures
  • Xiaohong is a character noun, the subject of the sentence
  • the second label is labeled NS (noun/subject); Shenzhen City, Science and Technology
  • the halls and conference rooms are nouns that characterize the location, usually adverbial modifiers, but the scope of the points is different.
  • the scope of the Shenzhen, Science and Technology and Conference rooms is from large to small, so it can be marked as NAM1, NAM2, and NAM3; the lecture is the object in the sentence, marked as NO for the nouns that characterize the time tomorrow and afternoon, according to the size of the time range, can be labeled T1 and T2.
  • the statements in the corpus can be characterized by the label sequence. For example, the above statement can be labeled as “NS T1 T2 NAM1 NAM2 NAM3 V NO”.
  • the third tag is added according to the meaning of the word.
  • the sentence "An archaeologist discovered the tooth of the Yuanmou people in Shangnao Village, Yuanmou County, Yunnan province in May 1965".
  • the label sequence obtained is exactly the same as the label sequence of the aforementioned sentence. But obviously, the content is very different, so it needs to be differentiated.
  • S220 Identify the knowledge-driven processed statement, and classify the sentences of the same meaning with different tag sequences into the same sentence cluster.
  • S230 analyzing a statement in each statement cluster in each monolingual parallel corpus, determining a sentence structure category of all sentences in the statement cluster, determining and recording a change between different sentence structure types stored in the same sentence cluster, correspondingly The way the label transformation is mapped between the sentence structures.
  • m sentence structure is stored in one of the sentence clusters, and m (m-1)/2 mapping relationships can be generated between the m types of structures, and the sentence structure in all the statement clusters is performed. Map, determine and record how the map is generated by the store.
  • S240 Find the same first category sentence structure in each statement cluster in all the monolingual parallel corpora, and according to the first category sentence structure of one of the statement clusters and the other sentence structure in the same sentence cluster A type of mapping method, in the remaining statement clusters, respectively maps the first category sentence structure according to the mapping manner, and generates a corresponding sentence structure category.
  • one of the statement clusters is optionally compared with other sentence clusters one by one.
  • n monolingual parallel corpora then perform n(n-1)/2 comparison to find the same sentence structure in the two sentences clusters.
  • the statement cluster K and the statement cluster L are compared, and the sentence cluster K has synonymous sentence structures a, b, c, d, and e, and the sentence cluster L has synonymous sentence structures d and e.
  • f then in the statement clusters K and L, there are two syntactic structures d and e of the same type.
  • the sentence structure d, e respectively generates a mapping relationship with a, b, and c in the statement cluster K and records the mapping manner, and generates a mapping relationship with the same sentence structure f in the statement cluster L and records the mapping manner.
  • the statement cluster L according to the mapping manner of the sentence structure d, e in the statement cluster K and a, b, c, the corresponding mapping relationship a', b', c' is established in the statement cluster L,
  • a corresponding mapping relationship f" is established in the statement cluster K, and at this time, the statement clusters K and L contain five kinds.
  • the sentence structure of the type, the sentence structure is expanded.
  • each statement cluster contains the same sentence structure as the sentence structure in all the statement clusters.
  • S250 Generate a new monolingual parallel corpus for the words corresponding to the sequence tags of the sentence structure in the newly generated sentence structure nested statement cluster.
  • the steps to get the corpus include:
  • the tags in all the sentence structures in each statement cluster are identified, and the sentence structure is nested according to the labeling standard.
  • the words corresponding to the tags in all the sentence structures in each statement cluster are nested into the sentence structure to generate a new statement cluster. Multiple newly generated statement clusters are combined to generate a new monolingual parallel corpus.
  • the generated sentence structure includes the tag NS, and the semantic meaning is nested according to the sentence meaning of the statement cluster, such as "small red” or "archaeologist".
  • the corpus generating method of the present invention maps the sentences in the existing corpus, and maps different sentence patterns of the tag sequence according to the tags of the sentence pattern to obtain more sentence structure and fills Get more corpus after nesting the words corresponding to the tag.
  • the corpus can be obtained by nesting the words into the expanded sentence structure, the operation is simple, the resources are saved, and the corpus is expanded to a large extent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种语料生成装置和方法,该装置(100)包括:分词模块(110),连接至少一个单语平行语料库,用于对语句分词,对分词进行知识驱动;分类模块(120),用于将标签序列不同的相同含义的语句分类到同一语句簇;映射模块(130),用于确定语句簇中所有语句的句式结构类别,记录存储同一语句簇中不同句式结构类别变换时,句式结构之间的标签变换的映射方式;句式结构生成模块(140),用于根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的第一类映射方式;以及,语料生成模块(150),用于序列标签对应的词语,生成新的单语平行语料库。该装置和方法能够通过将词语嵌套到扩充得到的句式结构中获取语料,操作简单,节省资源,同时较大程度的扩充了语料库。

Description

一种语料生成装置和方法 技术领域
本发明涉及文字处理领域,特别是涉及一种语料生成装置和方法。
背景技术
随着互联网的发展,网络检索的需求也越来越高,因此需要储备更多的关键词,以及语料,存储于云端的语料库中,供网民上网搜索时使用。
但是语言表达方式丰富多变,仅需通过若干词语随机组合,可能就会形成语句,如果语料库通过依次采集输入全部的语料,需要投入过大的精力,而且容易遗漏。现有技术有采用编辑距离的方法,通过删除、移位、插入等操作扩充语料库,但是实际操作的过程繁琐。
发明内容
本发明主要解决的技术问题是提供一种语料生成装置和方法,能够通过将词语嵌套到扩充得到的句式结构中获取语料,操作简单,节省资源,同时较大程度的扩充了语料库。
为解决上述技术问题,本发明采用的一个技术方案是:提供一种语料生成装置,该装置包括:分词模块,连接至少一个单语平行语料库,用于对每一平行语料库内的语句进行分词,并对分词进行知识驱动以实现标签化;分类模块,用于识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇;映射模块,用于分析每一单语平行语料库中每一语句簇中的语句,确定语句簇中所有语句的句式结构类别,确定并记录存储同一语句簇中不同句式结构类别之间进行变换时,相应的句式结构之间的标签变换的映射方式;句式结构生成模块,用于查找所有单语平行语料库中每一语句簇中相同的第一类别句式结 构,并根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的第一类映射方式,在其余的语句簇中对第一类别句式结构按照映射方式分别进行映射,生成相应的句式结构类别;以及,语料生成模块,用于对新生成的句式结构嵌套语句簇中句式结构的序列标签对应的词语,生成新的单语平行语料库。
为解决上述技术问题,本发明采用的一个技术方案是:提供一种语料生成方法,该方法的步骤包括:对至少一个单语平行语料库中每一语句进行分词,并对分词进行知识驱动以实现标签化;识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇;分析每一单语平行语料库中每一语句簇中的语句,确定语句簇中所有语句的句式结构类别,确定并记录存储同一语句簇中不同句式结构类别之间进行变换时,相应的句式结构之间的标签变换的映射方式;查找所有单语平行语料库中每一语句簇中相同的第一类别句式结构,并根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的第一类映射方式,在其余的语句簇中对第一类别句式结构按照映射方式分别进行映射,生成相应的句式结构类别;对新生成的句式结构嵌套语句簇中句式结构的序列标签对应的词语,生成新的单语平行语料库。
区别于现有技术,本发明的语料生成装置在通过将现有语料库中的语句进行标签化,将标签序列不同的句式格式根据句式的标签进行映射,得到更多的句式结构,填充嵌套标签对应的词语后得到更多的语料。通过本发明,能够通过将词语嵌套到扩充得到的句式结构中获取语料,操作简单,节省资源,同时较大程度的扩充了语料库。
附图说明
图1是本发明提供的一种语料生成装置的实施方式的结构示意图;
图2是本发明提供的一种语料生成方法的实施方式的流程示意图。
具体实施方式
下面结合具体实施方式对本发明的技术方案作进一步更详细的描 述。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。
语料库的建设是统计学习方法的重要基础,近年来,语料库资源对于自然语言处理研究的巨大价值已经得到越来越多的认可。特别是双语语料库(Bilingual Corpus),已经成为机器翻译、机器辅助翻译以及翻译知识获取研究不可或缺的重要资源。一方面,双语语料库的出现直接推动了机器翻译新技术的发展,像平行语料库为统计机器翻译的模型构建提供了必不可少的训练数据,基于统计(Statistic-Based)和基于实例(Example-Based)等基于语料库的翻译方法为机器翻译研究提供了新的思路,有效改善了翻译质量,在机器翻译研究领域掀起了新的高潮。另一方面,双语语料库又是获取翻译知识的重要来源,从中可以挖掘学习各种细粒度的翻译知识,如翻译词典和翻译模板,从而改进传统的机器翻译技术。此外,双语语料库也是跨语言信息检索,翻译词典编撰、双语术语自动提取以及多语言对比研究等的重要基础资源。双语平行语料库建设与获取存在着很大的困难,各国都投入了大量的人力、物力和财力,但是双语平行语料库的来源主要集中在政府报告、新闻法律等特定领域,不适合真实文本应用。同时,互联网上的大规模双语文本并且具有很好的时效性和覆盖性,这为双语平行语料库的获取提供了潜在的解决途径。
参阅图1,图1是本发明提供的一种语料生成装置的实施方式的结构示意图。该装置100包括:分词模块110、分类模块120、映射模块130、句式结构生成模块140和语料生成模块150。
其中,分词模块110连接单语平行语料库101,用于对平行语料库101内的语句进行分词,对分词进行知识驱动以实现标签化。分词的方法通常为采用分词工具软件。知识驱动的过程就是对语句进行标签化的过程。
分词模块110包括分词单元111、第一标签单元112和第二标签单 元113。分词单元111连接到网络的分词工具软件,将待分解的语句导入到分词工具软件中进行分词操作。将现有语料库中全部的语句进行分词后,第一标签单元112按照每一完成分词的句中的词语的词性对语句添加第一标签。第二标签单元113按照词语在句中成分对语句添加第二标签。在本实施方式中,选定语料库中的语句“小红明天下午要去深圳市科技馆九层会议室参加科普知识讲座”,经过分词操作后,得到“小红/明天/下午/要/去/深圳市/科技馆/九层/会议室/参加/科普/知识/讲座”。其中,小红、深圳市、科技馆、会议室和讲座为名词,按照词性为上述7个词语标注相同的第一标签,通常标注为N(noun);明天和下午表征时间,第一标签标注为T(time);参加为动词,第一标签标注为V(verb);要、去、九层和科普知识为附加词语,可省略不做标记。第一标签标注完成后,继续进行二级标注。对于第一标签为名词的小红、深圳市、科技馆、会议室和讲座,小红为人物性的名词,为句中主语,第二标签标注为NS(noun/subject);深圳市、科技馆和会议室分别为表征地点的名词,通常为状语adverbial modifier,但是所指的地点范围有所区别,深圳市、科技馆和会议室三者代表的范围为从大到小,因此可标注为NAM1、NAM2和NAM3;讲座为句中宾语,标注为NO对于表征时间的名词明天和下午,按照时间范围的大小,可标注为T1和T2。标注标签完成后,语料库中的语句均可以通过标签序列表征,如上述语句可标注为“NS T1 T2 NAM1 NAM2 NAM3 V NO”。
在其他实施方式中,分词模块110还包括第三标签单元114,第三标签单元114对经标签化处理后标签序列相同的不同含义的语句,按照词语含义对其添加第三标签。例如句子“考古学家于1965年五一节在云南省元谋县上那蚌村发现了元谋人牙齿化石”,经过分词处理标签化后,得到的标签序列和前述的句子的标签序列完全相同,但是显而易见,二者内容是截然不同的,因此需要进行区分。在本实施方式中,三种标签的添加顺序并无限定。
分类模块120用于识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇。通过分类模块120的处理,将每一 单语平行语料库内的语句分为多个不同的语句簇,每个语句簇中包含同一语义的语句的全部的句式结构类型,此时标签序列相同的语句具有相同的句式结构。
映射模块130分析每一单语平行语料库的语句簇中的语句,确定每一单语平行语料库的语句簇中所有语句的句式结构类别,确定并记录存储同一语句簇中不同句式结构类别之间进行变换时,相应的所述句式结构之间的标签变换的映射方式。在本实施方式中,其中一个语句簇中存储了m种句式结构,则m种结构相互之间可生成m(m-1)/2种映射关系,所有的语句簇中的句式结构进行映射,确定并记录存储生成的映射方式。
句式结构生成模块140查找每一单语平行语料库的每一语句簇中相同的第一类别句式结构,并根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的映射方式,在其余的语句簇中对第一类别句式结构按照映射方式分别进行映射,在其余的语句簇中生成相应的句式结构类别。
句式结构生成模块140查找所有单语平行语料库中每一语句簇中相同的第一类别句式结构,并根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的第一类映射方式,在其余的语句簇中对第一类别句式结构按照映射方式分别进行映射,生成相应的句式结构类别。在本实施方式中,任选一个单语平行语料库的一个语句簇,和其他单语平行语料库中的全部语句簇逐一比较。假设有n个单语平行语料库,则进行n(n-1)/2比较,找出对比的两个语句簇中相同的句式结构。在本实施方式中,对比语句簇K和语句簇L,语句簇K中有同义的句式结构a、b、c、d和e,语句簇L中有同义的句式结构d、e和f,则在语句簇K和L中,有两个相同类型的句式结构d和e。经过映射模块130的处理,句式结构d、e在语句簇K中分别同a、b、c生成映射关系并记录映射方式,在语句簇L中同句式结构f生成映射关系并记录映射方式。此时在语句簇L中,根据句式结构d、e在语句簇K中与a、b、c的映射方式,在语句簇L中建立相应的映射关系a’、b’、c’,在语 句簇K中,根据句式结构d、e在语句簇L中与f的映射方式,在语句簇K中建立相应的映射关系f”,此时生成新的语句簇中包含5种类型的句式结构。
语料生成模块150对所有语句簇中生成的句式结构嵌套语句簇中句式结构的序列标签对应的词语,获得语料。多个新生成的语句簇组合,生成新的单语平行语料库。语料生成模块150包括标签识别单元151和语料生成单元152,标签识别单元151识别每一语句簇中全部句式结构中的标签,语料生成单元152将每一语句簇中全部句式结构中的标签对应的词语嵌套到句式结构中,生成语料。语料生成单元152是按照分词模块110的标签化标准对所述句式结构进行嵌套。在本实施方式中,经句式生成模块140生成的句式结构中包含标签NS,则根据该语句簇的语句含义嵌套词义,如可嵌套“小红”,或“考古学家”。
区别于现有技术,本发明的语料生成装置在通过将现有语料库中的语句进行标签化,将标签序列不同的句式格式根据句式的标签进行映射,得到更多的句式结构,填充嵌套标签对应的词语后得到更多的语料。通过本发明,能够通过将词语嵌套到扩充得到的句式结构中获取语料,操作简单,节省资源,同时较大程度的扩充了语料库。
参阅图2,图2是本发明提供的一种语料生成方法的实施方式的流程示意图。该方法的步骤包括:
S210:对至少一个单语平行语料库中每一语句进行分词,并对分词进行知识驱动以实现标签化。
连接单语平行语料库,用于对平行语料库内的语句进行分词,并在分词后进行知识驱动以对语句进行标签化。分词的方法通常为采用分词工具软件。
分词并标签化的步骤包括:
S211:对所述平行语料库内的语句进行分词。
连接到网络的分词工具软件,将待分解的语句导入到分词工具软件中进行分词操作。
S212:按照词语的词性对分词处理后的所述语句添加第一标签。
选定语料库中的语句“小红明天下午要去深圳市科技馆九层会议室参加科普知识讲座”,经过分词操作后,得到“小红/明天/下午/要/去/深圳市/科技馆/九层/会议室/参加/科普/知识/讲座”。其中,小红、深圳市、科技馆、会议室和讲座为名词,按照词性为上述7个词语标注相同的第一标签,通常标注为N(noun);明天和下午表征时间,第一标签标注为T(time);参加为动词,第一标签标注为V(verb);要、去、九层和科普知识为附加词语,可省略不做标记。
S213:按照词语在句中成分对已添加所述第一标签的语句添加第二标签。
第一标签标注完成后,继续进行二级标注。对于第一标签为名词的小红、深圳市、科技馆、会议室和讲座,小红为人物性的名词,为句中主语,第二标签标注为NS(noun/subject);深圳市、科技馆和会议室分别为表征地点的名词,通常为状语adverbial modifier,但是所指的地点范围有所区别,深圳市、科技馆和会议室三者代表的范围为从大到小,因此可标注为NAM1、NAM2和NAM3;讲座为句中宾语,标注为NO对于表征时间的名词明天和下午,按照时间范围的大小,可标注为T1和T2。标注标签完成后,语料库中的语句均可以通过标签序列表征,如上述语句可标注为“NS T1 T2 NAM1 NAM2 NAM3 V NO”。
S214:对经标签化处理后标签序列相同的不同含义的语句,按照词语含义对语句添加第三标签。
对经标签化处理后标签序列相同的不同含义的语句,按照词语含义对其添加第三标签。例如句子“考古学家于1965年五一节在云南省元谋县上那蚌村发现了元谋人牙齿化石”,经过分词处理标签化后,得到的标签序列和前述的句子的标签序列完全相同,但是显而易见,二者内容是截然不同的,因此需要进行区分。
S220:识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇。
识别分词处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇。将语料库内的语句分为多个不同的语句簇,每个语句簇中 包含同一语义的语句的全部的句式结构类型,此时标签序列相同的语句具有相同的句式结构。
S230:分析每一单语平行语料库中每一语句簇中的语句,确定语句簇中所有语句的句式结构类别,确定并记录存储同一语句簇中不同句式结构类别之间进行变换时,相应的句式结构之间的标签变换的映射方式。
分析每一语句簇中的语句,确定语句簇中所有语句的句式结构类别,确定并记录存储同一语句簇中不同句式结构类别之间进行变换时,相应的所述句式结构之间的标签变换的映射方式。在本实施方式中,其中一个语句簇中存储了m种句式结构,则m种结构相互之间可生成m(m-1)/2种映射关系,所有的语句簇中的句式结构进行映射,确定并记录存储生成的映射方式。
S240:查找所有单语平行语料库中每一语句簇中相同的第一类别句式结构,并根据语句簇其中之一者的第一类别句式结构与同一语句簇中其他类别句式结构的第一类映射方式,在其余的语句簇中对第一类别句式结构按照映射方式分别进行映射,生成相应的句式结构类别。
分析每一语句簇,查询所有语句簇中的句式结构。在本实施方式中,任选其中一个语句簇,和其他语句簇逐一比较。假设有n个单语平行语料库,则进行n(n-1)/2比较,找出对比的两个语句簇中相同的句式结构。在本实施方式中,对比语句簇K和语句簇L,语句簇K中有同义的句式结构a、b、c、d和e,语句簇L中有同义的句式结构d、e和f,则在语句簇K和L中,有两个相同类型的句式结构d和e。经过映射模块130的处理,句式结构d、e在语句簇K中分别同a、b、c生成映射关系并记录映射方式,在语句簇L中同句式结构f生成映射关系并记录映射方式。此时在语句簇L中,根据句式结构d、e在语句簇K中与a、b、c的映射方式,在语句簇L中建立相应的映射关系a’、b’、c’,在语句簇K中,根据句式结构d、e在语句簇L中与f的映射方式,在语句簇K中建立相应的映射关系f”,此时在语句簇K和L中,均包含5种类型的句式结构,句式结构得到扩展。
全部的语句簇经过两两对比,每一语句簇都经过扩展,最终每一语句簇中都包含与所有语句簇中句式结构并集数量相同的句式结构。
S250:对新生成的句式结构嵌套语句簇中句式结构的序列标签对应的词语,生成新的单语平行语料库。
获取语料的步骤包括:
S251:识别所有单语平行语料库中每一语句簇中全部句式结构中的标签。
识别每一语句簇中全部句式结构中的标签,按照标签化标准对句式结构进行嵌套。
S252:将每一语句簇中全部句式结构中的标签对应的词语嵌套到句式结构中,生成新的单语平行语料库。
将每一语句簇中全部句式结构中的标签对应的词语嵌套到句式结构中,生成新的语句簇。多个新生成的语句簇组合,生成新的单语平行语料库。在本实施方式中,生成的句式结构中包含标签NS,则根据该语句簇的语句含义嵌套词义,如可嵌套“小红”,或“考古学家”。
区别于现有技术,本发明的语料生成方法在通过将现有语料库中的语句进行标签化,将标签序列不同的句式格式根据句式的标签进行映射,得到更多的句式结构,填充嵌套标签对应的词语后得到更多的语料。通过本发明,能够通过将词语嵌套到扩充得到的句式结构中获取语料,操作简单,节省资源,同时较大程度的扩充了语料库。
以上所述仅为本发明的实施方式,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (10)

  1. 一种基于知识驱动的语料生成装置,其特征在于,包括:
    分词模块,连接至少一个单语平行语料库,用于对每一所述平行语料库内的语句进行分词,并对分词进行知识驱动以实现标签化;
    分类模块,用于识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇;
    映射模块,用于分析每一所述单语平行语料库中每一所述语句簇中的语句,确定所述语句簇中所有语句的句式结构类别,确定并记录存储同一所述语句簇中不同所述句式结构类别之间进行变换时,相应的所述句式结构之间的标签变换的映射方式;
    句式结构生成模块,用于查找所有单语平行语料库中每一所述语句簇中相同的第一类别句式结构,并根据所述语句簇其中之一者的所述第一类别句式结构与同一所述语句簇中其他类别句式结构的第一类映射方式,在其余的所述语句簇中对所述第一类别句式结构按照所述映射方式分别进行映射,生成相应的句式结构类别;以及,
    语料生成模块,用于对新生成的所述句式结构嵌套所述语句簇中句式结构的序列标签对应的词语,生成新的单语平行语料库。
  2. 根据权利要求1所述的语料生成装置,其特征在于,所述分词模块包括:
    分词单元,用于对所有单语平行语料库中的所述语句进行分词;
    第一标签单元,用于按照词语的词性对分词处理后的所述语句添加第一标签;
    第二标签单元,用于按照词语在句中成分对分词处理后的所述语句添加第二标签。
  3. 根据权利要求2所述的语料生成装置,其特征在于,所述分词模块还包括第三标签单元;
    所述第三标签单元用于对经标签化处理后标签序列相同的不同含 义的语句,按照词语含义对所述语句添加第三标签。
  4. 根据权利要求1所述的语料生成装置,其特征在于,所述语料生成模块包括:
    标签识别单元,用于识别所有单语平行语料库中每一所述语句簇中全部句式结构中的标签;
    语料生成单元,用于将每一所述语句簇中全部句式结构中的标签对应的词语嵌套到所述句式结构中,生成新的单语平行语料库。
  5. 根据权利要求4所述的语料生成装置,其特征在于,所述语料生成单元按照所述分词模块的所述知识驱动标准对新生成的所述句式结构进行嵌套。
  6. 一种语料生成方法,其特征在于,包括:
    对至少一个单语平行语料库中每一语句进行分词,并对分词进行知识驱动以实现标签化;
    识别知识驱动处理后的语句,将标签序列不同的相同含义的语句分类到同一语句簇;
    分析每一所述单语平行语料库中每一所述语句簇中的语句,确定所述语句簇中所有语句的句式结构类别,确定并记录存储同一所述语句簇中不同所述句式结构类别之间进行变换时,相应的所述句式结构之间的标签变换的映射方式;
    查找所有单语平行语料库中每一所述语句簇中相同的第一类别句式结构,并根据所述语句簇其中之一者的所述第一类别句式结构与同一所述语句簇中其他类别句式结构的第一类映射方式,在其余的所述语句簇中对所述第一类别句式结构按照所述映射方式分别进行映射,生成相应的句式结构类别;
    对新生成的所述句式结构嵌套所述语句簇中句式结构的序列标签对应的词语,生成新的单语平行语料库。
  7. 根据权利要求6所述的语料生成方法,其特征在于,在对平行语料库内的语句进行分词的步骤中,包括步骤:
    对所有单语平行语料库中的语句进行分词;
    按照词语的词性对分词处理后的所述语句添加第一标签;
    按照词语在句中成分对分词处理后的所述语句添加第二标签。
  8. 根据权利要求7所述的语料生成方法,其特征在于,在对平行语料库内的语句进行分词的步骤中,还包括步骤:
    对经标签化处理后标签序列相同的不同含义的语句,按照词语含义对所述语句添加第三标签。
  9. 根据权利要求6所述的语料生成方法,其特征在于,在对所有所述语句簇中生成的所述句式结构嵌套所述语句簇中句式结构的序列标签对应的词语的步骤中,包括步骤:
    分别所有单语平行语料库中每一所述语句簇中全部句式结构中的标签;
    将每一所述语句簇中全部句式结构中的标签对应的词语嵌套到所述句式结构中,生成语料。
  10. 根据权利要求9所述的语料生成方法,其特征在于,在将所述标签对应的词语嵌套到所述句式结构的步骤中,按照所述知识驱动标准对新生成的所述句式结构进行嵌套。
PCT/CN2016/087757 2016-06-29 2016-06-29 一种语料生成装置和方法 WO2018000272A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2016/087757 WO2018000272A1 (zh) 2016-06-29 2016-06-29 一种语料生成装置和方法
CN201680001747.6A CN107004000A (zh) 2016-06-29 2016-06-29 一种语料生成装置和方法
US15/694,918 US10268678B2 (en) 2016-06-29 2017-09-04 Corpus generation device and method, human-machine interaction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/087757 WO2018000272A1 (zh) 2016-06-29 2016-06-29 一种语料生成装置和方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/694,918 Continuation-In-Part US10268678B2 (en) 2016-06-29 2017-09-04 Corpus generation device and method, human-machine interaction system

Publications (1)

Publication Number Publication Date
WO2018000272A1 true WO2018000272A1 (zh) 2018-01-04

Family

ID=59431023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087757 WO2018000272A1 (zh) 2016-06-29 2016-06-29 一种语料生成装置和方法

Country Status (3)

Country Link
US (1) US10268678B2 (zh)
CN (1) CN107004000A (zh)
WO (1) WO2018000272A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10300016B2 (en) 2014-10-06 2019-05-28 Mayo Foundation For Medical Education And Research Carrier-antibody compositions and methods of making and using the same
CN111597827A (zh) * 2020-04-02 2020-08-28 云知声智能科技股份有限公司 一种提高机器翻译准确度的方法及其装置
CN113536766A (zh) * 2020-04-16 2021-10-22 浙江大搜车软件技术有限公司 一种汽车维保记录的解析方法和装置

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11288328B2 (en) 2014-10-22 2022-03-29 Narrative Science Inc. Interactive and conversational data exploration
US10394950B2 (en) 2016-08-22 2019-08-27 International Business Machines Corporation Generation of a grammatically diverse test set for deep question answering systems
US10133724B2 (en) * 2016-08-22 2018-11-20 International Business Machines Corporation Syntactic classification of natural language sentences with respect to a targeted element
US11042709B1 (en) 2018-01-02 2021-06-22 Narrative Science Inc. Context saliency-based deictic parser for natural language processing
US11561986B1 (en) 2018-01-17 2023-01-24 Narrative Science Inc. Applied artificial intelligence technology for narrative generation using an invocable analysis service
US10755046B1 (en) 2018-02-19 2020-08-25 Narrative Science Inc. Applied artificial intelligence technology for conversational inferencing
US20190284625A1 (en) 2018-03-16 2019-09-19 Gencove Inc. Methods for joint low-pass and targeted sequencing
US11301777B1 (en) * 2018-04-19 2022-04-12 Meta Platforms, Inc. Determining stages of intent using text processing
US11232270B1 (en) 2018-06-28 2022-01-25 Narrative Science Inc. Applied artificial intelligence technology for using natural language processing to train a natural language generation system with respect to numeric style features
CN109388717B (zh) * 2018-07-20 2021-04-20 杭州光云科技股份有限公司 一种批量生成语料的方法和系统
CN109408821B (zh) * 2018-10-22 2020-09-04 腾讯科技(深圳)有限公司 一种语料生成方法、装置、计算设备及存储介质
CN109582925B (zh) * 2018-11-08 2023-02-14 厦门快商通信息技术有限公司 一种人机结合的语料标注方法及系统
US11341330B1 (en) 2019-01-28 2022-05-24 Narrative Science Inc. Applied artificial intelligence technology for adaptive natural language understanding with term discovery
CN110309280B (zh) * 2019-05-27 2021-11-09 重庆小雨点小额贷款有限公司 一种语料扩容方法及相关设备
CN110399499B (zh) * 2019-07-18 2022-02-18 珠海格力电器股份有限公司 一种语料生成方法、装置、电子设备及可读存储介质
CN112257414A (zh) * 2020-10-21 2021-01-22 网娱互动科技(北京)股份有限公司 一种计算机自动进行句式表达改写的方法
CN114757214B (zh) * 2022-05-12 2023-01-31 北京百度网讯科技有限公司 用于优化翻译模型的样本语料的选取方法、相关装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940915A (zh) * 2005-09-29 2007-04-04 国际商业机器公司 训练语料扩充系统和方法
CN101706777A (zh) * 2009-11-10 2010-05-12 中国科学院计算技术研究所 机器翻译中抽取调序模板的方法及系统
CN104281716A (zh) * 2014-10-30 2015-01-14 百度在线网络技术(北京)有限公司 平行语料的对齐方法及装置
CN105488077A (zh) * 2014-10-10 2016-04-13 腾讯科技(深圳)有限公司 生成内容标签的方法和装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9009590B2 (en) * 2001-07-31 2015-04-14 Invention Machines Corporation Semantic processor for recognition of cause-effect relations in natural language documents
US8799776B2 (en) * 2001-07-31 2014-08-05 Invention Machine Corporation Semantic processor for recognition of whole-part relations in natural language documents
JP4085156B2 (ja) * 2002-03-18 2008-05-14 独立行政法人情報通信研究機構 テキスト生成方法及びテキスト生成装置
US7526424B2 (en) * 2002-03-20 2009-04-28 Microsoft Corporation Sentence realization model for a natural language generation system
JP2006268375A (ja) * 2005-03-23 2006-10-05 Fuji Xerox Co Ltd 翻訳メモリシステム
JP4064413B2 (ja) * 2005-06-27 2008-03-19 株式会社東芝 コミュニケーション支援装置、コミュニケーション支援方法およびコミュニケーション支援プログラム
US8060357B2 (en) * 2006-01-27 2011-11-15 Xerox Corporation Linguistic user interface
US9779079B2 (en) * 2007-06-01 2017-10-03 Xerox Corporation Authoring system
US9418150B2 (en) * 2011-01-11 2016-08-16 Intelligent Medical Objects, Inc. System and process for concept tagging and content retrieval
CA2793268A1 (en) * 2011-10-21 2013-04-21 National Research Council Of Canada Method and apparatus for paraphrase acquisition
US9235567B2 (en) * 2013-01-14 2016-01-12 Xerox Corporation Multi-domain machine translation model adaptation
US9880999B2 (en) * 2015-07-03 2018-01-30 The University Of North Carolina At Charlotte Natural language relatedness tool using mined semantic analysis
US11010820B2 (en) * 2016-05-05 2021-05-18 Transform Sr Brands Llc Request fulfillment system, method, and media

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940915A (zh) * 2005-09-29 2007-04-04 国际商业机器公司 训练语料扩充系统和方法
CN101706777A (zh) * 2009-11-10 2010-05-12 中国科学院计算技术研究所 机器翻译中抽取调序模板的方法及系统
CN105488077A (zh) * 2014-10-10 2016-04-13 腾讯科技(深圳)有限公司 生成内容标签的方法和装置
CN104281716A (zh) * 2014-10-30 2015-01-14 百度在线网络技术(北京)有限公司 平行语料的对齐方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10300016B2 (en) 2014-10-06 2019-05-28 Mayo Foundation For Medical Education And Research Carrier-antibody compositions and methods of making and using the same
CN111597827A (zh) * 2020-04-02 2020-08-28 云知声智能科技股份有限公司 一种提高机器翻译准确度的方法及其装置
CN111597827B (zh) * 2020-04-02 2023-05-26 云知声智能科技股份有限公司 一种提高机器翻译准确度的方法及其装置
CN113536766A (zh) * 2020-04-16 2021-10-22 浙江大搜车软件技术有限公司 一种汽车维保记录的解析方法和装置
CN113536766B (zh) * 2020-04-16 2024-04-12 浙江大搜车软件技术有限公司 一种汽车维保记录的解析方法和装置

Also Published As

Publication number Publication date
US10268678B2 (en) 2019-04-23
US20180004730A1 (en) 2018-01-04
CN107004000A (zh) 2017-08-01

Similar Documents

Publication Publication Date Title
WO2018000272A1 (zh) 一种语料生成装置和方法
CN110941692B (zh) 互联网政治外交类新闻事件抽取方法
CN108287822B (zh) 一种中文相似问题生成系统与方法
CN108121829B (zh) 面向软件缺陷的领域知识图谱自动化构建方法
CN111680173A (zh) 统一检索跨媒体信息的cmr模型
TWI662425B (zh) 一種自動生成語義相近句子樣本的方法
CN110097278B (zh) 一种科技资源智能共享融合训练系统和应用系统
WO2020010834A1 (zh) 一种faq问答库泛化方法、装置及设备
WO2024011813A1 (zh) 一种文本扩展方法、装置、设备及介质
CN112541337B (zh) 一种基于递归神经网络语言模型的文档模板自动生成方法及系统
CN113742446A (zh) 一种基于路径排序的知识图谱问答方法及系统
CN106897274B (zh) 一种跨语种的点评复述方法
CN112182204A (zh) 构建中文命名实体标注的语料库的方法、装置
Gang et al. Chinese intelligent chat robot based on the AIML language
CN112015915A (zh) 基于问题生成的知识库问答系统及装置
Riccardi et al. The sensei project: Making sense of human conversations
CN114064878A (zh) 一种基于强化学习的自然语言数据打标方法及系统
CN114238653A (zh) 一种编程教育知识图谱构建、补全与智能问答的方法
CN114091464A (zh) 一种融合五维特征的高普适性多对多关系三元组抽取方法
CN110750989A (zh) 一种语句分析的方法及装置
Arbizu Extracting knowledge from documents to construct concept maps
Abdurehim et al. A short review of relation extraction methods
Hu et al. Exploring Discourse Structure in Document-level Machine Translation
de Melo et al. " Seeing is believing: the quest for multimodal knowledge" by Gerard de Melo and Niket Tandon, with Martin Vesely as coordinator
Zakraoui et al. A Dynamic Illustration Approach For Arabic Text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16906673

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16906673

Country of ref document: EP

Kind code of ref document: A1