WO2017097166A1 - 识别领域命名实体的方法及装置 - Google Patents

识别领域命名实体的方法及装置 Download PDF

Info

Publication number
WO2017097166A1
WO2017097166A1 PCT/CN2016/108426 CN2016108426W WO2017097166A1 WO 2017097166 A1 WO2017097166 A1 WO 2017097166A1 CN 2016108426 W CN2016108426 W CN 2016108426W WO 2017097166 A1 WO2017097166 A1 WO 2017097166A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain
label
text
word segmentation
entity
Prior art date
Application number
PCT/CN2016/108426
Other languages
English (en)
French (fr)
Inventor
徐文斌
何鑫
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Priority to US16/060,952 priority Critical patent/US10650192B2/en
Publication of WO2017097166A1 publication Critical patent/WO2017097166A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for identifying domain name entities.
  • Named Entity Recognition also known as "recognition of a proper name” refers to an entity that has a specific meaning in a text.
  • the entity mainly includes a person's name, a place name, an organization name, a proper noun, and the like.
  • Named entity recognition is an important basic tool for information extraction, question and answer system, syntax analysis, machine translation, and metadata annotation for Semantic Web. It plays an important role in the process of natural language processing technology becoming practical.
  • named entity recognition is generally implemented by the following methods: constructing a set of named entities, or specifying an entity extraction rule; classifying a sentence, and constructing a dictionary tree or a rule tree; traversing the word segmentation result, matching a dictionary or a rule, if If there is content matching the dictionary or the rule, the location of the matching content is marked, and if there is no matching content, the traversal of the next sentence text statement is performed; until the text sentence traversal ends, and the final labeling result is output.
  • the inventor found that the current technical solution has at least the following problems: in the process of identifying a task in the Chinese-language domain, the Chinese word segmentation cannot be separated by a space such as English, so the wrong word segmentation may be It will lead to inaccurate determination of the boundary of the named entity, resulting in inaccurate identification of the named entity; and, the accuracy of the current named entity recognition is completely dependent on the completeness of the dictionary or the rule, and is not well completed for the range of entities in the change. Entity recognition task.
  • the present invention provides a method and apparatus for identifying a domain named entity.
  • the main purpose of the present invention is to accurately locate a boundary of a named entity by using a label labeling method, thereby effectively reducing the recognition effect of the word segmentation result on the domain named entity. Impact, improving the accuracy of named entity recognition.
  • the present invention provides the following technical solutions:
  • the present invention provides a method of identifying a domain named entity, comprising:
  • the label set includes a basic label set identified by the domain named entity and a label set corresponding to the domain, wherein the basis
  • the tag set contains location tags that make up the words related to the domain named entity;
  • the extracted word segments are composed of domain name entities.
  • the present invention also provides an apparatus for identifying a domain named entity, including:
  • a labeling unit configured to perform label labeling on each participle in the text to be recognized according to the label set of the domain corresponding to the text to be identified, where the label set includes a basic label set identified by the domain named entity and a label set corresponding to the corresponding domain,
  • the base tag set includes location tags that constitute related words of the domain named entity;
  • the extracting unit is configured to extract the word segmentation of the label according to the domain named entity extraction rule
  • a group of word units for grouping the extracted word segments into domain name entities A group of word units for grouping the extracted word segments into domain name entities.
  • the method and device for identifying a domain naming entity when it is required to identify a domain naming entity in a text, firstly, according to a preset label set of the text corresponding domain, each word segment in the text to be recognized is performed. Label labeling, that is, labeling each word segment in the text to be recognized according to the position label of the related words naming the entity in the composition domain, and then extracting the word segmentation labeling according to the domain named entity extraction rule, and naming the extracted word segmentation fields Entity, compared with the prior art relying on dictionary or rule identification domain named entity, the boundary of the recognition domain named entity is no longer limited to matching the word segment with the dictionary, but the label annotation is used to determine the boundary of the named domain named entity. This enables accurate positioning of named entity boundaries, effectively The effect of the word segmentation result on the recognition effect of the domain named entity is reduced, and the accuracy of the named entity recognition is improved.
  • FIG. 1 is a flowchart of a method for identifying a domain named entity in an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing a hidden Markov model in an embodiment of the present invention.
  • FIG. 3 is a block diagram showing the structure of a device for identifying a domain named entity in an embodiment of the present invention
  • FIG. 4 is a block diagram showing the composition of another device for identifying a domain named entity in the embodiment of the present invention.
  • FIG. 5 is a block diagram showing the structure of another device for identifying a domain named entity in the embodiment of the present invention.
  • An embodiment of the present invention provides a method for identifying a domain named entity. As shown in FIG. 1 , the method includes:
  • the word segmentation of the text to be recognized may be implemented by any of the current methods, which is not limited by the embodiment of the present invention.
  • the label set includes a basic label set based on the domain named entity identification And a set of tags to which the corresponding domain belongs, wherein the set of basic tags includes location tags that constitute related words of the domain named entity.
  • the label collection of each domain includes a basic label collection based on the domain named entity identification and a label collection corresponding to the corresponding domain.
  • the label collection of each domain belongs to a label collection unique to different domains. For example, a group label collection can be added in the automotive field; a surname label collection can be added in the person name recognition.
  • the base tag set includes a location tag constituting a related term of the domain naming entity, and the location tag may be, but not limited to, a tag of the following: the term is located in the first TS of the entity, the term is located in the middle of the entity, and the term is located in the entity.
  • the tail TE the word is located before/after the entity TSN/TEN, the TN is not related to the domain entity, the two entities are the parallel relationship TT, the word segmentation error TSX/TEX/TXS/TEX, and the like.
  • Table 1 The specific meaning of each label can be referred to the description in Table 1 below.
  • the label set of each domain needs to be obtained first, and when the label set of each domain is obtained, the method may be implemented by, but not limited to, the following method: the method includes: acquiring domain named entity identification a set of basic labels and a set of labels belonging to each domain; taking a collection of the set of labels belonging to each domain and the set of basic labels identified by the domain named entity A collection of labels for each domain.
  • the embodiment of the present invention may be performed by using, but not limited to, using a training model.
  • the training model can label the text to be recognized for, but not limited to, the hidden Markov model, and can also label the text to be recognized using a conditional random field model or a neural network model. Since the hidden Markov model fully considers the context of the word, the problem that the existing entity identification method is limited to the dictionary size is effectively solved. Therefore, the embodiment of the present invention preferably uses the hidden Markov model to label the recognized text.
  • the embodiment of the present invention will briefly describe the hidden Markov model in conjunction with FIG.
  • the words "Shanghai Volkswagen recall case” correspond to the K node, indicating that the words in the sentence to be marked are the observation layer of the model, and the corresponding label layer is the S node, and each node can be in the
  • the annotation in the step is the hidden layer of the model.
  • A is the state transition matrix, recording the probability of occurrence of the latter state under the condition of the previous state
  • B is the observation state transition probability matrix: indicating the state (label) in the hidden layer and the observed value of the observation layer. The probability between (words).
  • the domain name entity extraction rules set according to different tasks are also different, and the specific may be set according to the entity requirements.
  • the domain name entity extraction rule is: "*/TS+*/TE”.
  • the extracted word segments are composed of domain name entities.
  • the boundary of the recognition domain named entity is no longer limited to matching the word segment with the dictionary, but the label annotation is used to determine the boundary of the named domain named entity. In this way, the boundary of the named entity can be accurately positioned, which effectively reduces the influence of the word segmentation result on the recognition effect of the domain named entity, and improves the accuracy of the named entity recognition.
  • the embodiment of the present invention further provides a corresponding word segmentation error correction mechanism, and corrects the wrong word segmentation when the word segmentation error is found.
  • the specific method may be implemented by using, but not limited to, the following methods, including:
  • the word segmentation in which the label of the word segmentation error is located is subjected to word segmentation error correction processing to obtain a new word segmentation.
  • the method may be implemented by using, but not limited to, the following method.
  • the method is an enumeration method, and specifically includes:
  • the text statement of the label of the segmentation error is split by word; the split words are recombined to obtain a new word segmentation.
  • word segmentation error correction processing for the text sentence in which the label of the word segmentation error is not limited to the above manner, and other word segmentation error correction processing methods, such as direct error correction, dictionary error correction and the like, may be used.
  • direct error correction the dictionary error correction can refer to the related description in the prior art, and the embodiments of the present invention will not be described herein.
  • the embodiment of the present invention specifically describes the enumeration method as an example. For example, if the word ABCD/TSX is detected, the word is first divided into the form of the word A, B, C, D, and then the words composed of the four words A, B, C, D are listed in the following cases:
  • the word segmentation after labeling the word segmentation, it is detected whether the specific tag with the word segmentation error is detected, and if the specific tag with the word segmentation error is detected, and error correction processing is performed in this step, corresponding to the tag type
  • the text file to be recognized is processed by the word segmentation, and the text processed by the word segmentation is re-entered as the input text into the tag model until the error correction tag no longer appears in the tag label.
  • the technical solution of the word segmentation error effectively makes the case of word segmentation error no longer affect the result of domain name entity recognition, further ensuring the accuracy of the domain name entity.
  • the embodiment of the present invention further provides an apparatus for identifying a domain named entity.
  • the apparatus includes:
  • the word segmentation unit 21 is configured to perform word segmentation on the text to be recognized.
  • the labeling unit 22 is configured to perform label labeling on each participle in the text to be recognized according to the label set of the domain corresponding to the text to be identified, where the label set includes a basic label set identified by the domain named entity and a label set corresponding to the domain Where the base tag set contains a group A location tag that is a domain-named entity-related term.
  • the label collection of each domain includes a basic label collection based on the domain named entity identification and a label collection corresponding to the corresponding domain.
  • the label collection of each domain belongs to a label collection unique to different domains. For example, a group label collection can be added in the automotive field; a surname label collection can be added in the person name recognition.
  • the base tag set includes a location tag constituting a related term of the domain naming entity, and the location tag may be, but not limited to, a tag of the following: the term is located in the first TS of the entity, the term is located in the middle of the entity, and the term is located in the entity.
  • the tail TE the word is located before/after the entity TSN/TEN, the TN is not related to the domain entity, the two entities are the parallel relationship TT, the word segmentation error TSX/TEX/TXS/TEX, and the like.
  • the extracting unit 23 is configured to extract the participle of the label label according to the domain named entity extraction rule.
  • the domain name entity extraction rules set according to different tasks are also different, and the specific may be set according to the entity requirements.
  • the domain name entity extraction rule is: "*/TS+*/TE".
  • the group word unit 24 is configured to group the extracted word segments into domain name entities.
  • the device further includes:
  • a detecting unit 25 configured to: after the labeling unit 22 performs label labeling on each participle in the text to be recognized according to the label set in the field corresponding to the text to be recognized, detecting whether there is a word segmentation error in the text marked by the label label.
  • the word segmentation unit 21 is further configured to perform a word segmentation error correction process on the text sentence in which the segmentation error tag is located to obtain a new word segmentation when the detection unit 25 detects the tag with the word segmentation error.
  • the word segmentation unit 21 performs word segmentation error correction processing on the text sentence in which the segmentation error tag is located to obtain a new word segmentation, specifically: the text sentence in which the tag segmentation error tag is separated and recombined to obtain a new segmentation word.
  • the related descriptions of the method embodiments may be referred to in the related description of the method embodiments, and the related descriptions of the method embodiments are not described herein.
  • the labeling unit 22 is further configured to: each of the new word segments according to the label set The word is labeled, until the label of the label is no longer in the text of the label.
  • the device further includes:
  • the obtaining unit 26 is configured to acquire, after the labeling unit 22 performs label labeling of each participle in the text to be recognized according to the label set of the field corresponding to the text to be recognized, acquiring a basic label set recognized by the domain named entity and each domain belongs to Label collection.
  • the obtaining unit 26 is further configured to take a collection of the label set belonging to each domain and the basic label set identified by the domain named entity as a label set of each domain.
  • the method and device for identifying a domain naming entity when it is required to identify a domain naming entity in a text, firstly, according to a preset label set of the text corresponding domain, each word segment in the text to be recognized is performed. Label labeling, that is, labeling each word segment in the text to be recognized according to the position label of the related words naming the entity in the composition domain, and then extracting the word segmentation labeling according to the domain named entity extraction rule, and naming the extracted word segmentation fields Entity, compared with the prior art relying on dictionary or rule identification domain named entity, the boundary of the recognition domain named entity is no longer limited to matching the word segment with the dictionary, but the label annotation is used to determine the boundary of the named domain named entity. In this way, the boundary of the named entity can be accurately positioned, which effectively reduces the influence of the word segmentation result on the recognition effect of the domain named entity, and improves the accuracy of the named entity recognition.
  • the word segmentation After labeling the word segmentation, it detects whether the specific tag of the word segmentation error is detected, and if the specific tag with the word segmentation error is detected, and performs error correction processing in this step, the text statement to be recognized corresponding to the word segmentation For the word segmentation process, the text processed by the word segmentation is re-entered as input text into the tag model until the error correction tag no longer appears in the tag tag.
  • This step effectively makes the case of the word segmentation no longer affect the result of the domain named entity recognition, further ensuring the accuracy of the domain named entity.
  • the device for identifying a domain name entity includes a processor and a memory, and the word segmentation unit, the labeling unit, the extraction unit, the group word unit, the detection unit, and the acquisition unit are all stored as a program unit in a memory, and are stored in the memory by the processor.
  • the above program unit is used to implement the corresponding function.
  • the processor contains a kernel, and the kernel removes the corresponding program unit from the memory. Kernel can By setting one or more, by adjusting the kernel parameters, the labeling method is used to accurately locate the boundary of the named entity, which effectively reduces the influence of the word segmentation result on the recognition effect of the domain named entity, and improves the accuracy of the named entity recognition.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • the present application also provides a computer program product, when executed on a data processing device, adapted to perform program code initialization with the following method steps: segmenting the text to be recognized; and according to the label set of the domain corresponding to the text to be recognized, Labeling each participle in the recognized text, the label set includes a basic label set identified by the domain named entity and a label set corresponding to the corresponding domain, wherein the basic label set includes a location label that constitutes a related term of the domain named entity
  • the word segmentation of the label is extracted according to the domain named entity extraction rule; the extracted word segmentation is composed of the domain named entity.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

一种识别领域命名实体的方法及装置,涉及自然语言处理技术领域,主要目的在于有效的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。所述方法包括:对待识别文本进行分词(101);根据待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注(102),该标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;按照领域命名实体抽取规则对标签标注的分词进行抽取(103);将抽取的分词组成领域命名实体(104)。

Description

识别领域命名实体的方法及装置
本申请基于申请号为201510921228.9、申请日为2015年12月11日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本发明涉及自然语言处理技术领域,尤其涉及一种识别领域命名实体的方法及装置。
背景技术
命名实体识别(Named Entity Recognition,NER),又称作“专名识别”,是指识别文本中具有特定意义的实体,该实体主要包括人名、地名、机构名、专有名词等。命名实体识别是信息提取、问答系统、句法分析、机器翻译、面向Semantic Web(语义网)的元数据标注等应用领域的重要基础工具,在自然语言处理技术走向实用化的过程中占有重要地位。
目前,命名实体识别一般采用如下方法实现,该方法具体为:构建命名实体集合,或者指定实体抽取规则;对句子进行分词,并构建字典树或者规则树;遍历分词结果,匹配词典或者规则,若有与词典或者规则匹配的内容,则标记匹配内容的位置,如果没有匹配的内容,则进行下一句文本语句的遍历;直到将所述的文本语句遍历结束,并输出最终的标注结果。
在执行上述命名实体识别方法时,发明人发现目前的技术方案至少存在如下问题:中文领域的专有命名实体识别任务过程中,中文的分词不能像英文等通过空格来分词,所以错误的分词可能会导致命名实体边界确定的不准确,导致命名实体识别不准确;并且,目前命名实体识别的准确率完全依赖于字典或者规则的完整程度,针对于变化中的实体范围,并不能很好的完成实体识别任务。
发明内容
有鉴于此,本发明提供一种识别领域命名实体的方法及装置,主要目的在于,通过使用标签标记的方法,对命名实体边界进行精准定位,有效的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。
为达到上述目的,本发明提供如下的技术方案:
一方面,本发明提供一种识别领域命名实体的方法,包括:
对待识别文本进行分词;
根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;
按照领域命名实体抽取规则对标签标注的分词进行抽取;
将抽取的分词组成领域命名实体。
另一方面,本发明还提供一种识别领域命名实体的装置,包括:
分词单元,用于对待识别文本进行分词;
标注单元,用于根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;
抽取单元,用于按照领域命名实体抽取规则对标签标注的分词进行抽取;
组词单元,用于将抽取的分词组成领域命名实体。
本发明提供的识别领域命名实体的方法及装置,当需要对文本中的领域命名实体进行识别时,其是先根据预设置的该文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,即根据组成领域命名实体相关词语的位置标签对待识别文本中的每个分词进行标签标注,进而按照领域命名实体抽取规则对进行标签标注的分词进行抽取,并将抽取的分词组成领域命名实体,与现有技术中依赖于字典或者规则识别领域命名实体相比,其识别领域命名实体的边界不再局限于分词与字典进行匹配,而是通过标签标注来确定识别领域命名实体的边界,这样对命名实体边界能够进行精准定位,有效 的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了本发明实施例中一种识别领域命名实体的方法流程图;
图2示出了本发明实施例中一种隐马尔科夫模型的示意图;
图3示出了本发明实施例中一种识别领域命名实体的装置组成框图;
图4示出了本发明实施例中另一种识别领域命名实体的装置组成框图;
图5示出了本发明实施例中另一种识别领域命名实体的装置组成框图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本发明实施例提供一种识别领域命名实体的方法,如图1所示,该方法包括:
101、对待识别文本进行分词。
其中,对待识别文本进行分词可以采用目前的任一种方式实现,本发明实施例对此不进行限定。
102、根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集 合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签。
需要说明的是,不同领域的命名实体具有不同的内部特征,不可能用一个统一的模型来刻画所有的命名实体内部特征,所以本发明实施例在对不同领域的命名实体进行识别时,其各领域使用的标签集合也不尽相同。如上所述,每个领域的标签集合,包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合。其中,各领域所属标签集合为不同领域特属的标签集合。例如:汽车领域可以添加集团标签集合;人名识别中可以添加姓氏标签集合。
其中,该所述基础标签集合包含组成领域命名实体相关词语的位置标签,该位置标签可以为但不局限于以下内容的标签:词语位于实体的首部TS、词语位于实体中部TM、词语位于实体的尾部TE、词语位于实体的前面/后面TSN/TEN、与该领域实体不相关TN、两个实体是并列关系TT、分词错误TSX/TEX/TXS/TEX等。各标签的具体含义可以参考如下的表1中的描述。
Figure PCTCN2016108426-appb-000001
表1
进一步的,在执行本发明实施例之前,还需要先获取各领域的标签集合,在获取各领域的标签集合时,可以通过但不局限于以下的方法实现,该方法包括:获取领域命名实体识别的基础标签集合和各领域所属标签集合;取所述各领域所属标签集合与所述领域命名实体识别的基础标签集合的合集作为 各领域的标签集合。
进一步的,本发明实施例在根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注时,可以采用但不局限于使用训练模型进行。该训练模型可以为但不局限于隐马尔科夫模型对待识别文本进行标签标注,也可以使用条件随机场模型或者神经网络模型等对待识别文本进行标签标注。由于隐马尔科夫模型充分考虑词语的上下文环境,有效的解决了在现有实体识别方法限于字典规模的问题,故本发明实施例优选使用隐马尔科夫模型对待识别文本进行标签标注。例如,将分词语句“上海大众汽车召回案”这句文本交给标签标注训练模型后,经过标签标注训练模型的标注,会输出“上海/TS大众/TE汽车/TEN召回案/TN”。
本发明实施例将结合图2简单描述一下隐马尔科夫模型。在上述的例子中,“上海大众汽车召回案”四个词对应K节点,表示要标记的句子中的词语,是模型的观测层,而对应的标签层为S节点,每个节点可以在该步骤中进行标注,是模型的隐含层。其中A为状态转移矩阵,记录着从前面一个状态出现的条件下,后面一个状态出现的概率,B为观测状态转移概率矩阵:表示隐含层中的状态(标签)与观测层观测到的值(词语)之间的概率。
103、按照领域命名实体抽取规则对标签标注的分词进行抽取。
本发明实施例中,按照特定领域的标签集合,根据不同的任务,设定的领域命名实体抽取规则也不同,具体的可以根据实体需求设置。例如,这对汽车领域,领域命名实体抽取规则为:“*/TS+*/TE”。在执行该步骤时,发现步骤102中的标记结果为“上海/TS大众/TE汽车/TEN召回案/TN”的句子,在本步骤中,发现其中“上海/TS大众/TE”满足规则“*/TS+*/TE”,则我们抽取出“上海大众”两个词语。
104、将抽取的分词组成领域命名实体。
将抽取出的“上海大众”两个词语组合成一个领域命名实体“上海大众”。
在将抽取的分词组成领域命名实体之后,若需要输出,则可以将组成的领域命名实体打上“entity”这样的标签,最后输出“上海大众/entity汽车召回案”这样的结果。
本发明实施例中,当需要对文本中的领域命名实体进行识别时,其是先根据预设置的该文本对应领域的标签集合,对待识别文本中的每个分词进行 标签标注,即根据组成领域命名实体相关词语的位置标签对待识别文本中的每个分词进行标签标注,进而按照领域命名实体抽取规则对进行标签标注的分词进行抽取,并将抽取的分词组成领域命名实体,与现有技术中依赖于字典或者规则识别领域命名实体相比,其识别领域命名实体的边界不再局限于分词与字典进行匹配,而是通过标签标注来确定识别领域命名实体的边界,这样对命名实体边界能够进行精准定位,有效的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。
进一步的,为了保证分词的准确性,本发明实施例还提供相应分词纠错机制,在发现有分词错误时,对错误的分词进行纠错。具体的可以采用但不局限于以下的方法实现,该方法包括:
1、检测所述标签标注的文本中是否存在分词错误的标签;
其中,对于本发明实施例,具体可以检测所述标签标注的文本中是否存在TSX/TEX/TXS/TEX标签的分词,若存在,则确定存在分词错误的标签,执行2。
2、若存在分词错误的标签,则对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词。
其中,对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词时,可以采用但不局限于以下的方法实现,该方法为枚举法,具体包括:
对分词错误的标签所在的文本语句按字进行拆分;将拆分后的字进行重新组合得到新的分词。
当然,对于对分词错误的标签所在的文本语句进行分词纠错处理不局限于上述方式,还可以采用其他的分词纠错处理方式,例如直接纠错,字典纠错等纠错处理方式。其中,针对直接纠错,字典纠错可以参考现有技术中的相关描述,本发明实施例此处将不再赘述。
3、根据所述标签集合对所述新的分词中的每个分词进行标签标注,并执行1,直到标签标注的文本中不再出现分词错误的标签为止。
为了更清楚的表达本发明实施例中的分词纠错处理,本发明实施例具体以枚举方法为例进行具体的说明。比如,检测到ABCD/TSX这个词语,先将该词语先分成字的形式A,B,C,D,再枚举A,B,C,D四个字构成的词语有一下几种情况:
1,A,B,C,D
2,AB,C,D
3,A,BC,D
4,A,B,CD
5,AB,CD
6,ABC,D
7,A,BCD
然后把这些分词分别替换原有分词中的ABCD这个词语,并将替换完分词的句子重新进行标记,如果当前的标记中不存在TSX和TEX及TXE,TXS等标签,则输出文本语句的标签标注结果,结束文本语句重新分词的程序。
上述分词纠错处理方法,可以通过以下的例子进行具体说明,例如在人名识别中,句子文本“邓颖超生前和刘晓辉同学合影”分词程序结果为“邓颖超生前和刘晓辉同学合影”,通过标签标注模型后结果为“邓颖/TSE超生/TSX前/TN和/TT刘晓辉/TSE同学/TEN合影/TN”,其中,“超生/TSX”表示改词为分词错误,要对其进行重分词处理,分词结果为“邓颖超生前和刘晓辉同学合影”,然后重新标注为“邓颖/TS超/TE生/TEN前/TN和/TT刘晓辉/TSE同学/TEN合影/TN”,不再出现纠错标签,则终止本分词纠错步骤。
本发明实施例中,在对分词进行标签标注后,会检测是否存分词错误的特定标签后,若检测出存在分词错误的特定标签,并在本步骤中进行纠错处理,针对该种标签对应的待识别文本语句进行重分词处理,重分词处理后的文本作为输入文本重新输入到标注模型中,直到该标注标签中不再出现纠错标签为止。该分词纠错的技术方案有效地让分词错误的情况不再影响领域命名实体识别的结果,进一步保证了领域命名实体的准确性。
基于上述方法实施例,本发明实施例还提供一种识别领域命名实体的装置,如图3所示,该装置包括:
分词单元21,用于对待识别文本进行分词。
标注单元22,用于根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组 成领域命名实体相关词语的位置标签。
需要说明的是,不同领域的命名实体具有不同的内部特征,不可能用一个统一的模型来刻画所有的命名实体内部特征,所以本发明实施例在对不同领域的命名实体进行识别时,其各领域使用的标签集合也不尽相同。如上所述,每个领域的标签集合,包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合。其中,各领域所属标签集合为不同领域特属的标签集合。例如:汽车领域可以添加集团标签集合;人名识别中可以添加姓氏标签集合。
其中,该所述基础标签集合包含组成领域命名实体相关词语的位置标签,该位置标签可以为但不局限于以下内容的标签:词语位于实体的首部TS、词语位于实体中部TM、词语位于实体的尾部TE、词语位于实体的前面/后面TSN/TEN、与该领域实体不相关TN、两个实体是并列关系TT、分词错误TSX/TEX/TXS/TEX等。
抽取单元23,用于按照领域命名实体抽取规则对标签标注的分词进行抽取。本发明实施例中,按照特定领域的标签集合,根据不同的任务,设定的领域命名实体抽取规则也不同,具体的可以根据实体需求设置。例如,这对汽车领域,领域命名实体抽取规则为:“*/TS+*/TE”。
组词单元24,用于将抽取的分词组成领域命名实体。
进一步的,如图4所示,该装置还包括:
检测单元25,用于在所述标注单元22根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之后,检测所述标签标注的文本中是否存在分词错误的标签。
所述分词单元21还用于,当所述检测单元25检测到存在分词错误的标签时,对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词。其中,所述分词单元21对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词时,具体为:对分词错误的标签所在的文本语句按字拆分并重新组合得到新的分词。关于该分词单元对分词错误的标签所在的文本语句按字拆分并重新组合得到新的分词的相关描述,本发明实施例此处将不再赘述,相关描述可以参考方法实施例的对应描述。
所述标注单元22还用于,根据所述标签集合对所述新的分词中的每个分 词进行标签标注,直到进行标签标注的文本中不再出现分词错误的标签为止。
进一步的,如图5所示,该装置还包括:
获取单元26,用于在所述标注单元22根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之前,获取领域命名实体识别的基础标签集合和各领域所属标签集合。
所述获取单元26还用于,取所述各领域所属标签集合与所述领域命名实体识别的基础标签集合的合集作为各领域的标签集合。
需要说明的是,本发明实施例中涉及的各功能单元及功能模块的其他描述,可以参考方法实施例中的对应描述,本发明实施例此处将不再赘述。
本发明提供的识别领域命名实体的方法及装置,当需要对文本中的领域命名实体进行识别时,其是先根据预设置的该文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,即根据组成领域命名实体相关词语的位置标签对待识别文本中的每个分词进行标签标注,进而按照领域命名实体抽取规则对进行标签标注的分词进行抽取,并将抽取的分词组成领域命名实体,与现有技术中依赖于字典或者规则识别领域命名实体相比,其识别领域命名实体的边界不再局限于分词与字典进行匹配,而是通过标签标注来确定识别领域命名实体的边界,这样对命名实体边界能够进行精准定位,有效的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。
并且,在对分词进行标签标注后,会检测是否存分词错误的特定标签,若检测出存在分词错误的特定标签,并在本步骤中进行纠错处理,针对该种标签对应的待识别文本语句进行重分词处理,重分词处理后的文本作为输入文本重新输入到标注模型中,直到该标注标签中不再出现纠错标签为止。该步骤有效让分词错误的情况不再影响领域命名实体识别的结果,进一步保证了领域命名实体的准确性。
所述识别领域命名实体的装置包括处理器和存储器,上述分词单元、标注单元、抽取单元、组词单元、检测单元和获取单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以 设置一个或以上,通过调整内核参数来通过使用标签标记的方法,对命名实体边界进行精准定位,有效的减少了分词结果对领域命名实体识别效果的影响,提高了命名实体识别的准确率。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序代码:对待识别文本进行分词;根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;按照领域命名实体抽取规则对标签标注的分词进行抽取;将抽取的分词组成领域命名实体。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (10)

  1. 一种识别领域命名实体的方法,其特征在于,包括:
    对待识别文本进行分词;
    根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;
    按照领域命名实体抽取规则对标签标注的分词进行抽取;
    将抽取的分词组成领域命名实体。
  2. 根据权利要求1所述的方法,其特征在于,所述位置标签包括:分词错误;在根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之后,还包括:
    检测所述标签标注的文本中是否存在分词错误的标签;
    若存在分词错误的标签,则对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词;
    根据所述标签集合对所述新的分词中的每个分词进行标签标注,直到标签标注的文本中不再出现分词错误的标签为止。
  3. 根据权利要求2所述的方法,其特征在于,对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词包括:
    对分词错误的标签所在的文本语句按字拆分并重新组合得到新的分词。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,在根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之前,所述方法还包括:
    获取领域命名实体识别的基础标签集合和各领域所属标签集合;
    取所述各领域所属标签集合与所述领域命名实体识别的基础标签集合的合集作为各领域的标签集合。
  5. 根据权利要求2或3所述的方法,其特征在于,所述位置标签还包括:
    词语位于实体的首部、词语位于实体中部、词语位于实体的尾部、词语位于实体的前面/后面、与该领域实体不相关、两个实体是并列关系。
  6. 一种识别领域命名实体的装置,其特征在于,包括:
    分词单元,用于对待识别文本进行分词;
    标注单元,用于根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注,所述标签集合包含基于领域命名实体识别的基础标签集合和对应领域所属标签集合,其中,所述基础标签集合包含组成领域命名实体相关词语的位置标签;
    抽取单元,用于按照领域命名实体抽取规则对标签标注的分词进行抽取;
    组词单元,用于将抽取的分词组成领域命名实体。
  7. 根据权利要求6所述的装置,其特征在于,所述位置标签包括:分词错误;所述装置还包括:
    检测单元,用于在所述标注单元根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之后,检测所述标签标注的文本中否存在分词错误的标签;
    所述分词单元还用于,当所述检测单元检测到存在分词错误的标签时,对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词;
    所述标注单元还用于,根据所述标签集合对新的分词中的每个分词进行标签标注,直到进行标签标注的文本中不再出现分词错误的标签为止。
  8. 根据权利要求7所述的装置,其特征在于,所述分词单元对分词错误的标签所在的文本语句进行分词纠错处理得到新的分词时,具体为:
    对分词错误的标签所在的文本语句按字拆分并重新组合得到新的分词。
  9. 根据权利要求6-8中任一项所述的装置,其特征在于,还包括:
    获取单元,用于在所述标注单元根据所述待识别文本对应领域的标签集合,对待识别文本中的每个分词进行标签标注之前,获取领域命名实体识别的基础标 签集合和各领域所属标签集合;
    所述获取单元还用于,取所述各领域所属标签集合与所述领域命名实体识别的基础标签集合的合集作为各领域的标签集合。
  10. 根据权利要求7或8所述的装置,其特征在于,所述位置标签还包括:
    词语位于实体的首部、词语位于实体中部、词语位于实体的尾部、词语位于实体的前面/后面、与该领域实体不相关、两个实体是并列关系。
PCT/CN2016/108426 2015-12-11 2016-12-02 识别领域命名实体的方法及装置 WO2017097166A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/060,952 US10650192B2 (en) 2015-12-11 2016-12-02 Method and device for recognizing domain named entity

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510921228.9A CN106874256A (zh) 2015-12-11 2015-12-11 识别领域命名实体的方法及装置
CN201510921228.9 2015-12-11

Publications (1)

Publication Number Publication Date
WO2017097166A1 true WO2017097166A1 (zh) 2017-06-15

Family

ID=59012688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/108426 WO2017097166A1 (zh) 2015-12-11 2016-12-02 识别领域命名实体的方法及装置

Country Status (3)

Country Link
US (1) US10650192B2 (zh)
CN (1) CN106874256A (zh)
WO (1) WO2017097166A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866402A (zh) * 2019-11-18 2020-03-06 北京香侬慧语科技有限责任公司 一种命名实体识别的方法、装置、存储介质及电子设备
CN111143559A (zh) * 2019-12-24 2020-05-12 北京明略软件系统有限公司 基于三元组的词云展示方法及装置
CN111435411A (zh) * 2019-01-15 2020-07-21 菜鸟智能物流控股有限公司 命名体类型识别方法和装置以及电子设备
CN111859972A (zh) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 实体识别方法、装置、计算机设备及计算机可读存储介质
CN111881680A (zh) * 2020-08-04 2020-11-03 医渡云(北京)技术有限公司 文本的标准化处理方法、装置、电子设备及计算机介质
CN113761137A (zh) * 2020-06-02 2021-12-07 阿里巴巴集团控股有限公司 一种提取地址信息的方法及装置
CN114282538A (zh) * 2021-11-24 2022-04-05 重庆邮电大学 基于bie位置词列表的中文文本数据字向量表征方法
CN114861667A (zh) * 2022-05-16 2022-08-05 中电金信软件有限公司 一种命名实体标签识别方法及装置

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467346B2 (en) * 2017-05-18 2019-11-05 Wipro Limited Method and system for generating named entities
CN107832781B (zh) * 2017-10-18 2021-09-14 扬州大学 一种面向多源数据的软件缺陷表示学习方法
CN107943911A (zh) * 2017-11-20 2018-04-20 北京大学深圳研究院 数据抽取方法、装置、计算机设备及可读存储介质
CN109840320B (zh) * 2017-11-28 2023-08-25 微软技术许可有限责任公司 文本的定制化处理
CN108009160A (zh) * 2017-11-30 2018-05-08 北京金山安全软件有限公司 含有命名实体的语料翻译方法、装置、电子设备及存储介质
CN108197197A (zh) * 2017-12-27 2018-06-22 北京百度网讯科技有限公司 实体描述型标签挖掘方法、装置及终端设备
CN108710612A (zh) * 2018-05-22 2018-10-26 腾讯科技(深圳)有限公司 语义标注的方法、装置、计算机设备、可读存储介质
CN109325121B (zh) * 2018-09-14 2021-04-02 北京字节跳动网络技术有限公司 用于确定文本的关键词的方法和装置
CN111353308A (zh) * 2018-12-20 2020-06-30 北京深知无限人工智能研究院有限公司 命名实体识别方法、装置、服务器及存储介质
CN109710741A (zh) * 2018-12-27 2019-05-03 中山大学 一种面向在线问答平台的基于深度强化学习的问题标注方法
CN111444710B (zh) * 2019-01-15 2023-04-18 阿里巴巴集团控股有限公司 分词方法及分词装置
CN110134779A (zh) * 2019-05-13 2019-08-16 极智(上海)企业管理咨询有限公司 一种企业名称处理的方法
CN110162793A (zh) * 2019-05-27 2019-08-23 北京奇艺世纪科技有限公司 一种命名实体的识别方法及相关设备
US11520985B2 (en) * 2019-07-31 2022-12-06 International Business Machines Corporation Named entity recognition
CN110543638B (zh) * 2019-09-10 2022-12-27 杭州橙鹰数据技术有限公司 一种命名实体识别的方法和装置
CN110705258A (zh) * 2019-09-18 2020-01-17 北京明略软件系统有限公司 文本实体识别方法及装置
CN110851597A (zh) * 2019-10-28 2020-02-28 青岛聚好联科技有限公司 一种基于同类实体替换的语句标注的方法及装置
CN111160013B (zh) * 2019-12-30 2023-11-24 北京百度网讯科技有限公司 文本纠错方法及装置
CN113128226A (zh) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 命名实体识别方法、装置、电子设备及计算机存储介质
CN111178080B (zh) * 2020-01-02 2023-07-18 杭州涂鸦信息技术有限公司 一种基于结构化信息的命名实体识别方法及系统
CN111209753B (zh) * 2020-01-03 2023-11-03 北京明略软件系统有限公司 一种实体命名识别方法及装置
CN111967264B (zh) * 2020-08-26 2021-09-24 湖北亿咖通科技有限公司 一种命名实体识别方法
CN113807097A (zh) * 2020-10-30 2021-12-17 北京中科凡语科技有限公司 命名实体识别模型建立方法及命名实体识别方法
CN113158677B (zh) * 2021-05-13 2023-04-07 竹间智能科技(上海)有限公司 一种命名实体识别方法和系统
CN113837113A (zh) * 2021-09-27 2021-12-24 中国平安财产保险股份有限公司 基于人工智能的文档校验方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719122A (zh) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 一种从文本数据中提取中文命名实体的方法
US20110258188A1 (en) * 2010-04-16 2011-10-20 Abdalmageed Wael Semantic Segmentation and Tagging Engine
CN102314417A (zh) * 2011-09-22 2012-01-11 西安电子科技大学 基于统计模型的Web命名实体识别方法
CN103164471A (zh) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 视频文本标签的推荐方法及系统
CN104572625A (zh) * 2015-01-21 2015-04-29 北京云知声信息技术有限公司 命名实体的识别方法

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1159661C (zh) * 1999-04-08 2004-07-28 肯特里奇数字实验公司 用于中文的标记和命名实体识别的系统
US7092871B2 (en) * 2000-07-20 2006-08-15 Microsoft Corporation Tokenizer for a natural language processing system
US7212963B2 (en) * 2002-06-11 2007-05-01 Fuji Xerox Co., Ltd. System for distinguishing names in Asian writing systems
CN1910573A (zh) * 2003-12-31 2007-02-07 新加坡科技研究局 用来识别并分类命名实体的系统
US20050182736A1 (en) * 2004-02-18 2005-08-18 Castellanos Maria G. Method and apparatus for determining contract attributes based on language patterns
US20070078644A1 (en) * 2005-09-30 2007-04-05 Microsoft Corporation Detecting segmentation errors in an annotated corpus
US7672832B2 (en) * 2006-02-01 2010-03-02 Microsoft Corporation Standardized natural language chunking utility
CN101075228B (zh) * 2006-05-15 2012-05-23 松下电器产业株式会社 识别自然语言中的命名实体的方法和装置
US8539349B1 (en) * 2006-10-31 2013-09-17 Hewlett-Packard Development Company, L.P. Methods and systems for splitting a chinese character sequence into word segments
WO2009070931A1 (en) * 2007-12-06 2009-06-11 Google Inc. Cjk name detection
US9092424B2 (en) * 2009-09-30 2015-07-28 Microsoft Technology Licensing, Llc Webpage entity extraction through joint understanding of page structures and sentences
US20110119050A1 (en) * 2009-11-18 2011-05-19 Koen Deschacht Method for the automatic determination of context-dependent hidden word distributions
US20140195884A1 (en) * 2012-06-11 2014-07-10 International Business Machines Corporation System and method for automatically detecting and interactively displaying information about entities, activities, and events from multiple-modality natural language sources
US20130332450A1 (en) * 2012-06-11 2013-12-12 International Business Machines Corporation System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
US20150278298A1 (en) * 2012-11-06 2015-10-01 Nokia Corporation Apparatus and method for displaying image-based representations of geographical locations in an electronic text
US20140163951A1 (en) * 2012-12-07 2014-06-12 Xerox Corporation Hybrid adaptation of named entity recognition
US9864798B2 (en) * 2014-09-22 2018-01-09 Bmc Software, Inc. Generation of support data records using natural language processing
US10216783B2 (en) * 2014-10-02 2019-02-26 Microsoft Technology Licensing, Llc Segmenting data with included separators
US9836453B2 (en) * 2015-08-27 2017-12-05 Conduent Business Services, Llc Document-specific gazetteers for named entity recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719122A (zh) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 一种从文本数据中提取中文命名实体的方法
US20110258188A1 (en) * 2010-04-16 2011-10-20 Abdalmageed Wael Semantic Segmentation and Tagging Engine
CN102314417A (zh) * 2011-09-22 2012-01-11 西安电子科技大学 基于统计模型的Web命名实体识别方法
CN103164471A (zh) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 视频文本标签的推荐方法及系统
CN104572625A (zh) * 2015-01-21 2015-04-29 北京云知声信息技术有限公司 命名实体的识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TANG BUZHOU: "Sequence labeling: Supervised LEARNING AND APPLICATION", CHINA DISSERTATION DATABASE, 20 March 2013 (2013-03-20), pages 16-22 - 38-39 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435411A (zh) * 2019-01-15 2020-07-21 菜鸟智能物流控股有限公司 命名体类型识别方法和装置以及电子设备
CN110866402A (zh) * 2019-11-18 2020-03-06 北京香侬慧语科技有限责任公司 一种命名实体识别的方法、装置、存储介质及电子设备
CN110866402B (zh) * 2019-11-18 2023-11-28 北京香侬慧语科技有限责任公司 一种命名实体识别的方法、装置、存储介质及电子设备
CN111143559A (zh) * 2019-12-24 2020-05-12 北京明略软件系统有限公司 基于三元组的词云展示方法及装置
CN113761137A (zh) * 2020-06-02 2021-12-07 阿里巴巴集团控股有限公司 一种提取地址信息的方法及装置
CN113761137B (zh) * 2020-06-02 2024-01-09 阿里巴巴集团控股有限公司 一种提取地址信息的方法及装置
CN111859972A (zh) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 实体识别方法、装置、计算机设备及计算机可读存储介质
CN111859972B (zh) * 2020-07-28 2024-03-15 平安科技(深圳)有限公司 实体识别方法、装置、计算机设备及计算机可读存储介质
CN111881680A (zh) * 2020-08-04 2020-11-03 医渡云(北京)技术有限公司 文本的标准化处理方法、装置、电子设备及计算机介质
CN114282538A (zh) * 2021-11-24 2022-04-05 重庆邮电大学 基于bie位置词列表的中文文本数据字向量表征方法
CN114861667A (zh) * 2022-05-16 2022-08-05 中电金信软件有限公司 一种命名实体标签识别方法及装置
CN114861667B (zh) * 2022-05-16 2023-04-28 中电金信软件有限公司 一种命名实体标签识别方法及装置

Also Published As

Publication number Publication date
CN106874256A (zh) 2017-06-20
US20180365211A1 (en) 2018-12-20
US10650192B2 (en) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2017097166A1 (zh) 识别领域命名实体的方法及装置
CN109472033B (zh) 文本中的实体关系抽取方法及系统、存储介质、电子设备
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
WO2021042521A1 (zh) 一种合同自动生成方法、计算机设备及计算机非易失性存储介质
US9224103B1 (en) Automatic annotation for training and evaluation of semantic analysis engines
US9275135B2 (en) Annotating entities using cross-document signals
US10169305B2 (en) Marking comparison for similar documents
WO2019007288A1 (zh) 一种风险地址识别方法、装置以及电子设备
CN107193796B (zh) 一种舆情事件检测方法及装置
CN104615589A (zh) 训练命名实体识别模型的方法、命名实体识别方法及装置
CN112015900B (zh) 医学属性知识图谱构建方法、装置、设备及介质
WO2017177809A1 (zh) 语言文本的分词方法和系统
CN109448793B (zh) 基因序列的权利范围标注、检索及信息标注方法、系统
CN108763192B (zh) 用于文本处理的实体关系抽取方法及装置
CN112287071A (zh) 一种文本关系提取方法、装置及电子设备
CN110705261B (zh) 中文文本分词方法及其系统
US9588965B2 (en) Identifying and characterizing an analogy in a document
CN110737770B (zh) 文本数据敏感性识别方法、装置、电子设备及存储介质
EP3173965A1 (en) System and method for enablement of data masking for web documents
Kuncham et al. Statistical sandhi splitter for agglutinative languages
CN110866394A (zh) 公司名称识别方法及装置、计算机设备及可读存储介质
CN113688243B (zh) 语句中实体的标注方法、装置、设备以及存储介质
Nag et al. Offline extraction of Indic regional language from natural scene image using text segmentation and deep convolutional sequence
CN113033380B (zh) 一种文本标注方法
JP5916666B2 (ja) テキストによる視覚表現を含む文書を分析する装置、方法およびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872357

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872357

Country of ref document: EP

Kind code of ref document: A1