WO2016127677A1 - 地址结构化方法及装置 - Google Patents

地址结构化方法及装置 Download PDF

Info

Publication number
WO2016127677A1
WO2016127677A1 PCT/CN2015/094371 CN2015094371W WO2016127677A1 WO 2016127677 A1 WO2016127677 A1 WO 2016127677A1 CN 2015094371 W CN2015094371 W CN 2015094371W WO 2016127677 A1 WO2016127677 A1 WO 2016127677A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
word
dependency
relationship
address word
Prior art date
Application number
PCT/CN2015/094371
Other languages
English (en)
French (fr)
Inventor
茹旷
边旭
吴颖徽
马帅
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016127677A1 publication Critical patent/WO2016127677A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to an address structuring method and apparatus.
  • the system design research of a new generation of quantitative fusion data management basic software aiming at reliable, efficient, universal and automatic processing of complex data and distributed data includes how to capture meaningful matches in graph query and how to deal with the dynamics of graph data.
  • Google search added the Google Knowledge Graph, a Google knowledge base that uses semantic retrieval to gather information from multiple sources to improve the quality of Google search.
  • the Knowledge Graph provides structured and detailed information about the topic. The goal is that users will be able to use the information provided by this feature to resolve their queries without having to navigate to other sites and aggregate the information themselves.
  • the Knowledge Graph is a large semantic network whose nodes represent entities or concepts, and edges represent various semantic relationships between entities/concepts.
  • Map refers to a diagram that has been edited by the system and described in terms of the object. This technology is an important link or even a key link in the automatic construction of the address knowledge base system. The basic task is to determine the syntactic structure of the sentence or the relationship between the words in the sentence. However, in general, the structuring of addresses is not the ultimate goal of an address repository processing task.
  • the technology includes, but is not limited to, the following technologies: automatic word segmentation, part-of-speech tagging, syntactic analysis, and entity relationship extraction.
  • HMM Hidden Markov Model
  • part-of-speech is the basic grammatical attribute of vocabulary.
  • Part-of-speech tagging is the process of determining the grammatical category of each word in a given sentence, determining its part of speech and labeling it.
  • Part-of-speech tagging is an important research direction in natural language processing.
  • There are many methods for part-of-speech tagging which can be roughly classified into two categories, rule-based methods and statistical-based methods, and Hidden Markov Models (HMM).
  • HMM Hidden Markov Models
  • the part-of-speech tagging technique is a typical example of statistical methods.
  • Dependency Grammar the framework for describing the structure of language using the dependence between words and words. It was first proposed by the French linguist L. Tesniere to analyze sentences into a dependency tree (Dependency). Tree) to describe the dependencies between the various words.
  • the existing dependency syntax analysis algorithms are roughly classified into a profiling analysis method, a discriminant analysis method, a deterministic (decision-based) analysis method, and an analysis method based on constraint satisfaction.
  • dependency tree is used to express the dependency relationship based on the dependency syntax, and the dependency relationship of the natural language is analyzed mainly according to the linguistic grammatical features such as subject, predicate and object.
  • the data structure features a special address structuring study, and the simple tree structure can not represent the complex relationship in the address.
  • Another object of the present invention is to provide an address structuring apparatus for generating a dependency syntax diagram structure to represent a dependency relationship between words in an address text.
  • an address structuring method including:
  • Step 10 Divide the address text into address word sequences
  • Step 20 Perform part-of-speech tagging on each address word in the address word sequence according to a predefined part of speech tagging that reflects the characteristics of the address word;
  • Step 30 Perform dependency syntax analysis on the sequenced address word sequence according to a predefined address word dependency rule, use the entity address word as a node, and use the dependency relationship between the entity address words as an edge to generate a dependency reflecting the address structure. Syntactic structure.
  • the address is a Chinese address.
  • step 10 the address text is segmented based on the hidden Markov model.
  • the part of speech tagging is performed based on the hidden Markov model in step 20.
  • step 20 the part-of-speech tagging result is also corrected by using a predefined tagging rule.
  • the part of speech tag set includes a tag representing a space occupied by an entity address word.
  • the label representing the space occupied by the entity address word is a country, a province, a city, a district, a street, a community, a district, a road, a house number, a building, a room, a junction, or a subway line.
  • the predefined dependency rule is an inclusion relationship, a house number pointing relationship, an adjacency relationship, or a same name relationship.
  • the invention also provides an address structuring device, comprising:
  • An address text segmentation module for dividing an address text into address word sequences
  • An address word labeling module configured to perform part-of-speech tagging on each address word in the address word sequence according to a predefined part of speech tagging that reflects the characteristics of the address word;
  • the dependency syntax analysis module is configured to perform dependency syntax analysis on the annotated address word sequence according to a predefined address word dependency rule, using the entity address word as a node, and using the dependency relationship between the entity address words as an edge to generate a reflection.
  • the dependency syntax diagram structure of the address structure is configured to perform dependency syntax analysis on the annotated address word sequence according to a predefined address word dependency rule, using the entity address word as a node, and using the dependency relationship between the entity address words as an edge to generate a reflection.
  • the address is a Chinese address.
  • the address structuring method and apparatus of the present invention can efficiently and automatically generate a dependency syntax diagram structure to represent a dependency relationship between words in an address text; a manual intervention strategy is simple, and does not require a large amount of background knowledge; The invention expands the structure of the dependency tree so that it can express the relationship between address words in the form of graphs; effectively assists the manual operation and simplifies the difficulty of obtaining the address knowledge.
  • FIG. 1 is a flow chart of a preferred embodiment of an address structuring method of the present invention
  • FIG. 2 is a dependency syntax diagram structure of an example address text in an embodiment of an address structuring method according to the present invention
  • FIG. 3 is a block diagram of an address structuring apparatus of the present invention.
  • FIG. 1 is a flowchart of a preferred embodiment of an address structuring method of the present invention.
  • the method mainly includes:
  • Step 10 Dividing the address text into address word sequences; Step 20, performing part-of-speech tagging on each address word in the address word sequence according to the predefined part of speech tagging that reflects the characteristics of the address word; Step 30, according to the predefined address word
  • the dependency syntax analysis is performed on the sequenced address word sequence.
  • the entity address word is used as the node, and the dependency relationship between the entity address words is used as the edge to generate the dependency syntax diagram structure reflecting the address structure.
  • the present invention needs to solve two main problems: segmentation and labeling of arbitrary addresses, and then generating a dependency syntax diagram structure based on the segmentation annotation.
  • the segmentation and labeling of the address is first performed by steps 10 and 20. Take “713 Shenzhen High-tech Zhongyi Software Building” as an example. First, the address is divided into “Shenzhen – Nanshan District – Gaoxinzhongyi – Software Building – 713”, and the address is obtained. The sequence of words, then, by address, the address is expressed in order to mark the sequence "Shenzhen City / City - Nanshan District / District - Gaoxinzhongyi / Road - Software Building / Building - 713 / Room".
  • the Chinese address is taken as an example to illustrate the present invention.
  • the Chinese address referred to in the present invention is composed of characters included in the CJK character set in Unicode, and contains most of the Chinese characters and a small number of non-Chinese characters.
  • address segmentation, or address segmentation is to cut out the "words" in the Chinese address. Since the concept of an address word is not clearly defined in any place, it is difficult to have a definitive correct answer. According to the survey of experts, the recognition rate of words appearing in the Chinese text between the subjects whose native language is Chinese is only about 70%. So encountering a divergence problem does not mean that the system or method is unreliable or that one must be wrong.
  • the invention follows the basic principle of two divisions to ensure that the address words are not ambiguous under normal circumstances:
  • a pure statistical model HMM is first used to segment the address words. This method is a common word segmentation method and will not be described again. Then, the address tag is performed according to a predetermined token set.
  • the address tagging task of the present invention is very similar to the usual part-of-speech tagging, except that the actual physical space category of each word is judged, which category is given by the address tagging system of the present invention.
  • the process of performing part-of-speech tagging in the present invention is the same as the general natural-language part-of-speech tagging process, but the part-of-speech tagging set of the present invention mainly focuses on the physical space category represented by the address word, instead of the noun, verb, adjective or Other part of speech.
  • the present invention proposes the backbone of the annotation according to the characteristics of the space occupied by the address words. Then, in order to be compatible with another part of the statement, a part of the part-of-speech tag is introduced, such as the "and" tag.
  • the label represents only the nature of the space occupied by the address word, and there is no mandatory hierarchical inclusion relationship. For example, in Singapore, “country” and “city” are the same space. In the Congo, the “country” is spatially subordinate to the “city” of Rome. Pay attention to its nature when labeling, not its space size. Table 1 below is a detailed description of the labeling system of a preferred embodiment of the present invention.
  • the address word corresponding to the address entity is an entity address word
  • the entity address word can be applied to the country, province, city, district, street, community, area, Labels such as roads, house numbers, buildings, rooms, interchanges, or subway lines are marked.
  • the address word labeling is the same as the word segmentation, which is an important basic problem for address information processing, and the two have a close relationship.
  • the method of combining rules and statistics is used for labeling.
  • the rule-based labeling method is an early labeling method. The basic idea is to construct the word class disambiguation rules according to the collocation relationship and context. The strategy of manual intervention is simple and does not require a lot of background knowledge.
  • the HMM statistical model is first used to perform the coarse labeling of the first step, and then the coarse labeling is performed by a predetermined special rule system. Make corrections.
  • the reason why not only a pure statistical model is selected is based on the following considerations:
  • the parameter estimation of the model is a key issue.
  • the present invention can randomly initialize all parameters of the HMM, but this will make the labeling problem too unrestricted;
  • the preferred embodiment adds a correction to the results by a manually maintained rule system.
  • the method of the invention combines the statistical and regular methods, and has two main advantages: on the one hand, using the labeled corpus to perform parameter training on the statistical model, different parameters required for statistical disambiguation can be obtained; on the other hand, the machine is automatically labeled The results are compared with the results of the manual rules, and the errors that are automatically processed can be found, and a large amount of useful information is summarized to supplement and adjust the contents of the rule base.
  • the segmentation and labeling of the addresses are selected based on a hidden Markov model.
  • other appropriate word segmentation/labeling methods can also be selected for address segmentation/labeling. See Chinese patent application CN103440311A and CN102298585A.
  • step 30 the relationship between words and words is obtained through the rule system, and a dependency syntax diagram structure reflecting the address structure is generated.
  • the present invention proposes that the necessary and sufficient conditions for the address dependency graph structure should be satisfied:
  • a single head node, a sentence can only have one head node. That is, only the nodes that do not enter.
  • component A is directly subordinate to B, and component C is located between A and B in the sentence, then component C is either subordinate to A, or subordinate to B, or subordinate to A and B.
  • component C is either subordinate to A, or subordinate to B, or subordinate to A and B.
  • the present invention proposes the following address word dependency rules.
  • the house number points to the relationship (NUMBER), indicating the direction of the road house number system to the space.
  • the adjacency relationship (SIDE) is mainly used to indicate the adjacency relationship with the road.
  • the deterministic dependency analysis method takes one word to be analyzed one by one in a specific direction, and produces a single analysis result for each input word until the last word of the sequence. In each step of the analysis, such an algorithm must make decisions based on the current state of analysis (such as determining whether it has a dependency on the previous word). Therefore, this method is also called a decision-making analysis method.
  • the present invention obtains a unique syntactic representation, i.e., dependent graphs (sometimes there may be backtracking and patching), through a determined sequence of analytical actions, which is the basic idea of the method used in the present invention.
  • the specific analysis process is similar to the process of using the dependency syntax to analyze natural sentences in the prior art, except that the address word replaces the subject, the predicate, the object, and the like, and the dependency relationship is also replaced by the dependency relationship between the address words.
  • the analysis results are "Shenzhen City"-[CONTAIN]->"Nanshan District", "high-tech one-in-one"-[SIDE]->"software building” and so on.
  • the relationship between the word pairs constitutes an address dependent graph structure.
  • FIG. 2 it is a dependency syntax diagram structure of an example address text in an embodiment of the address structuring method of the present invention.
  • the sequence of address words after the labeling "Shenzhen City / City - Nanshan District / District - Gaoxin Zhongyi / Road - Software Building / Building--713/room"
  • the dependency textual structure of the sample text address “713, Gaoxin Zhongyi Software Building, Nanshan District, Shenzhen” can be obtained.
  • the present invention also proposes a rule description syntax for addresses.
  • Both the predefined labeling rules and the address word dependency rules of the present invention can use the same logical and grammatical notation.
  • the statement 'if:' is the start of a conditional sentence, each condition is a single line, and each condition is an AND relationship.
  • Each statement consists of two parts separated by ":".
  • the front indicates the concept of relative position i (Notion) , or called a value, or a value, or a word, is followed by a condition that satisfies the condition, an condition is an "or" relationship, and the concept satisfies the condition.
  • 'then:' is the start of the execution sentence.
  • the beginning of 'N' represents the concept, and the beginning of 'V' represents the value.
  • the current check position is 0, if the relative position is -1 for the previous word, if the relative position is 1 for the next word.
  • N-1 Building, house number
  • FIG. 3 it is a block diagram of the address structuring apparatus of the present invention.
  • the present invention further provides an address structuring device, which mainly includes:
  • the address text segmentation module 1 is configured to divide the address text into address word sequences
  • the address word labeling module 2 is configured to perform part-of-speech tagging on each address word in the address word sequence according to a predefined part of speech tagging that reflects the characteristics of the address word;
  • the dependency syntax analysis module 3 is configured to perform dependency syntax analysis on the tagged address word sequence according to a predefined address word dependency rule, and use the entity address word as a node to generate a dependency relationship between the entity address words as an edge.
  • a dependency syntax diagram structure that reflects the address structure.
  • the address can be specifically a Chinese address.
  • the address structuring method and apparatus of the present invention can efficiently and automatically generate a dependency syntax diagram structure to represent a dependency relationship between words in an address text; a manual intervention strategy is simple, and does not require a large amount of background knowledge; The invention expands the structure of the dependency tree so that it can express the relationship between address words in the form of graphs; effectively assists the manual operation and simplifies the difficulty of obtaining the address knowledge.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及一种地址结构化方法及装置。该地址结构化方法包括:步骤10、将地址文本切分为地址词序列;步骤20、根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;步骤30、按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。本发明还提供了地址结构化装置。本发明地址结构化方法及装置能够高效的自动化的生成依存句法图结构来表示地址文本中词之间的依存关系;人工干预的策略简单,不需要了解大量的背景知识;本发明扩展了依存树的结构,使之能够以图的形式表达地址词间的关系。

Description

地址结构化方法及装置 技术领域
本发明涉及数据处理技术领域,尤其涉及一种地址结构化方法及装置。
背景技术
以可靠、高效、通用、自动处理复杂数据和分布数据为目标的新一代量质融合数据管理基础软件的系统设计研究中,包含了图查询中如何捕捉有意义的匹配、如何应对图数据的动态特性和查询的复杂性、如何查询分布式的图数据的研究。
截至2012年,技术上可在合理时间内分析处理的数据集大小单位为艾字节(exabytes)。在许多领域,由于数据集过度庞大,科学家经常在分析处理上遭遇限制和阻碍。对于普通人来说,面对大量数据如何筛选出所需的知识也变的越来越困难。因此在2012年Google搜索中加入了知识图谱技术(Google Knowledge Graph),它是Google的一个知识库,其使用语义检索从多种来源收集信息,以提高Google搜索的质量。知识图谱除了显示其他网站的链接列表,还提供结构化及详细的关于主题的信息。其目标是,用户将能够使用此功能提供的信息来解决他们查询的问题,而不必导航到其他网站并自己汇总信息。
知识图谱(Knowledge Graph)是一张庞大的语义网络,其节点代表实体(entity)或者概念(concept),边代表实体/概念之间的各种语义关系。“图谱”是指经过系统编辑并根据实物描述的图。该技术是自动构建地址知识库系统的重要环节甚至是关键环节。其基本任务是确定句子的句法结构或者句子中词汇之间的关系。但是,一般来说,地址的结构化并不是一个地址知识库处理任务的最终目标。在该技术中包括但不限于以下技术:自动分词,词性标注,句法分析和实体关系提取等。
语言学上,词是最小的能够独立运用的语言单位。中文作为一种孤立语系语言,协同很多黏着语系的语言(例如日语),在文本中不像西方屈折语系如英语的文本那样有显式的词边界,因此,自动分词问题就成 了计算机处理孤立语和黏着语文本时面临的首要基础性工作,是诸多应用系统不可或缺的一个重要环节。自中文自动分词问题被提出以来,众多专家提出了很多分词方法,包括最大正向匹配法(FMM),逆向最大匹配法(BMM),双向扫描法,逐词遍历法等,这些方法基本上都是在20世纪80年代或者更早的时候提出来的。由于这些方法大多数都是基于词表进行的,因此,一般统称为基于词表的分词方法。随着统计方法的迅速发展,人们又提出了若干基于统计模型的分词方法,以及规则方法与统计方法相结合的分词技术,使汉语分词问题得到了更加深入的研究。其中,基于隐马尔可夫模型(HMM)的分词技术正是一种典型的基于统计模型的分词方法。
在语言学中,词性(part-of-speech)是词汇基本的语法属性。词性标注就是在给定的句子中判定每个词的语法范畴,确定其词性并加以标注的过程。词性标注是自然语言处理中一个具有重要意义的研究方向,词性标注的方法有很多,大体上可以归为两类,基于规则的方法和基于统计的方法,而基于隐马尔可夫模型(HMM)的词性标注技术正是统计方法的典型例子。
具体到地址的切分与标注,现有技术中基于词表的分词方法可参见中国专利申请CN103440311A以及CN102298585A。
另一方面,用词与词之间的依存关系来描述语言结构的框架称为依存句法(Dependency Grammar),是由法国语言学家L.Tesniere最先提出,将句子分析成一颗依存树(Dependency Tree)来描述出各个词语之间的依存关系。现有依存句法分析算法大致归为生成式的分析方法、判别式的分析方法、确定性的(决策式的)分析方法以及基于约束满足的分析方法。
当前,基于依存句法的自然语言处理技术不断得到发展和完善。但是,现有技术中基于依存句法对自然语言做分析时一般采用依存树来表示依存关系,而且主要按照主语、谓语、宾语等语言学语法特性来分析自然语言的析依存关系,并没有针对地址的数据结构特点进行专门的地址结构化研究,同时单纯的树状结构也不能表示地址中的复杂关系。
发明内容
本发明的目的在于提供一种地址结构化方法,生成依存句法图结构表示地址文本中词之间的依存关系。
本发明的另一目的在于提供一种地址结构化装置,用于生成依存句法图结构表示地址文本中词之间的依存关系。
为实现上述目的,本发明提供一种地址结构化方法,包括:
步骤10、将地址文本切分为地址词序列;
步骤20、根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;
步骤30、按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。
其中,所述地址为中文地址。
其中,步骤10中基于隐马尔可夫模型进行地址文本切分。
其中,步骤20中基于隐马尔可夫模型进行词性标注。
其中,步骤20中还使用预定义的标注规则对词性标注结果进行修正。
其中,所述词性标注集包括代表实体地址词所占据空间的标签。
其中,所述代表实体地址词所占据空间的标签为国家、省、市、区、街道、社区、片区、道路、门牌号、楼栋、房间、交汇处或地铁线。
其中,所述预定义的依存关系规则为包含关系、门牌号指向关系、邻接关系或同名关系。
本发明还提供了一种地址结构化装置,包括:
地址文本切分模块,用于将地址文本切分为地址词序列;
地址词标注模块,用于根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;
依存句法分析模块,用于按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。
其中,所述地址为中文地址。
综上所述,本发明地址结构化方法及装置能够高效的自动化的生成依存句法图结构来表示地址文本中词之间的依存关系;人工干预的策略简单,不需要了解大量的背景知识;本发明扩展了依存树的结构,使之能够以图的形式表达地址词间的关系;有效的辅助了人工操作,简化了地址知识获取的难度。
附图说明
图1为本发明地址结构化方法一较佳实施例的流程图;
图2为本发明地址结构化方法实施例中示例地址文本的依存句法图结构;
图3为本发明地址结构化装置的方框图。
具体实施方式
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其有益效果显而易见。
参见图1,其为本发明地址结构化方法一较佳实施例的流程图。该方法主要包括:
步骤10、将地址文本切分为地址词序列;步骤20、根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;步骤30、按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。本发明为了最终提供依存句法图结构,需要解决2个主要问题:对任意地址进行切分和标注,再在切分标注的基础上生成依存句法图结构。
首先通过步骤10和20进行地址的切分和标注。以“深圳市南山区高新中一道软件大厦713”为例,首先通过地址切分,该地址表示为“深圳市——南山区——高新中一道——软件大厦——713”,得到了地址词的序列,然后,通过地址标注,该地址表示为了标注序列“深圳市/市——南山区/区——高新中一道/道路——软件大厦/楼栋——713/房间”。
由于屈折语系的语言自带词分隔符,地址切分相对简单,下面仅以 中文地址为例来说明本发明。本发明所指的中文地址,是由Unicode中CJK字符集所收录的字符组成,包含了绝大部分中文,以及小部分非中文的字符。在当前任务中,地址切分,或者叫做地址分词,目的就是将中文地址中的“词”切割出来。由于地址词的概念在哪一个地方都没有明确的定义,所以很难有确定的正确答案。有关专家的调查表明,在母语为汉语的被试者之间,对汉语文本中出现的词语的认同率只有大约70%。所以遇到切分歧义问题不代表系统或者方法是不可靠的或者一定有一种是错误的。本发明按照两个切分基本的原则来确保在一般情况下地址词不产生歧义:
1)最小单位原则,切分出来的词是否能保持不丧失语义,并且不增加歧义。
2)无水波效应原则,对一个词的切分是否影响了其他词的语义。保证词的切分对其他词没有影响。
在本发明一较佳实施例中,首先使用纯粹的统计模型HMM来进行地址词的切分。该方法是常见的分词方法,不再赘述。然后,再根据预定的词性标注集(tagging set)进行地址标注。本发明的地址标注任务和通常的词性标注很类似,只不过判断的是每个词实际的物理空间范畴,这个范畴由本发明的地址标注系统给出。也就是说,本发明进行词性标注的过程与一般的自然语言词性标注过程相同,但是本发明的词性标注集主要关注于地址词所表示的物理空间范畴,而非词的名词、动词、形容词或其他词性。
为了方便处理地址,本发明按照地址词所占据空间的特点,提出了标注的主干。然后,为了兼容另一部分语句,引入了一部分词性标注,比如“与”标签(tag)。请注意,标注只代表地址词所占据空间的性质,并不存在强制的层级包含关系。比如在新加坡“国家”和“市”是同一个空间。在梵蒂冈,“国家”在空间上从属于罗马“市”。在标注的时候关注它的性质,而不是它的空间大小。如下表一是本发明一较佳实施例的标注系统的详细内容。本发明称对应于地址实体的地址词为实体地址词,实体地址词可以对应用表一中的国家、省、市、区、街道、社区、片区、 道路、门牌号、楼栋、房间、交汇处或地铁线等标签来标注。
表一、地址标注体系
Figure PCTCN2015094371-appb-000001
地址词标注与分词一样,是地址信息处理面临的重要的基础性问题,而且两者有着密切的关系。
在本发明该较佳实施例中采用规则和统计相结合的方法进行标注。基于规则的标注方法是人们提出较早的一种标注方法,其基本思想是按兼类词搭配关系和上下文语境建造词类消歧规则。人工干预的策略简单,不需要了解大量的背景知识。该较佳实施例中,首先使用HMM统计模型进行第一步的粗标注,然后通过事先预定的专门规则系统对粗标注结 果进行修正。
在该较佳实施例的标注过程中,之所以没有仅选择纯粹的统计模型,是基于以下几个考虑:
1)实现基于HMM的标注方法时,模型的参数估计是其中的关键问题。本发明可以随机地初始化HMM的所有参数,但是,这将使标注问题过于缺乏限制;
2)另外一个需要注意的问题是HMM模型参数对训练语料的适应性。也就是说,由于不同的语料中概率有所差异,HMM的参数也应随着语料的变化而变化。在经典的HMM理论框架下,利用标注过的语料对模型初始化以后,已标注的语料就难以再发挥作用。
由于有上述的问题,所以该较佳实施例增加由人工维护的规则系统对结果做修正。本发明的方法结合了统计和规则的方法,主要有两个好处:一方面利用标注语料对统计模型进行参数训练,可以得到统计排歧所需要的不同参数;另一方面,通过将机器自动标注的结果与人工规则的结果进行比较,可以发现自动处理的错误所在,从中总结出大量有用的信息以补充和调整规则库的内容。
在该较佳实施例中,地址的切分和标注都选用了基于隐马尔可夫模型的方式。实践中也可以选取其它适当的分词/标注方法进行地址切分/标注,可参见中国专利申请CN103440311A以及CN102298585A。
接下来步骤30中通过规则系统得到词与词之间的关系,生成反映地址结构的依存句法图结构。
在地址中,“依存”就是指词与词之间支配与被支配的关系,这种关系不是对等的,而是有方向的。处于支配地位的成分称为支配者(governor,regent,head),而处于被支配地位的成分称为从属者(modifier,subordinate,dependency)。
根据地址文本(句子)的数据特点,不同于一般的自然语言依存句法分析,为了最终形成地址依存图结构,本发明提出了地址依存图结构应该满足的充分必要条件为:
1)单一头结点,一个句子只能有一个头结点。即只出不进的结点。
2)连通,一个句子形成的依存结构要保持连通状态。
3)无环,句子中任何一种依存关系都不能在成分之间形成环。
4)可投射,如果成分A直接从属于B,而成分C在句子中位于A和B之间,那么,成分C或者从属于A,或者从属于B,或者从属于A和B之间的某一成分。
为了能保证地址的依存句法的合理性,本发明提出了以下的地址词依存关系规则。
1)包含关系(CONTAIN),表示地址词空间上的包含关系。
2)门牌号指向关系(NUMBER),表示道路门牌号系统对空间的指向关系。
3)邻接关系(SIDE),主要用于表示和道路的邻接关系。
4)同名关系(ALIAS),或称为别名关系,由主名称指向别名实体。
由于依存句法分析技术为现有技术,本发明在此仅使用基于规则的确定性依存分析方法作为具体示例。确定性依存分析方法以特定的方向逐次取一个待分析词,为每次输入的词产生一个单一的分析结果,直至序列的最后一个词。这类算法在每一步分析中都要根据当前分析状态做出决策(如判断其是否与前一个词发生依存关系),因此,也称这种方法为决策式分析方法。
本发明通过一个确定的分析动作序列来得到一个唯一的句法表达,即依存图(有时可能会有回溯和修补),这是本发明所使用方法的基本思想。具体分析过程与现有技术中采用依存句法分析自然语句的过程相类似,只是以地址词代替了主语、谓语、宾语等,并且依存关系也替换为地址词之间的依存关系。比如,分析结果为“深圳市”-[CONTAIN]->“南山区”,“高新中一道”-[SIDE]->“软件大厦”等等关系。通过词对之间的关系,构成地址依存图结构。
如图2所示,其为本发明地址结构化方法实施例中示例地址文本的依存句法图结构。通过按照预定义的地址词依存关系规则如[CONTAIN],[SIDE]等,对标注后的地址词序列“深圳市/市——南山区/区——高新中一道/道路——软件大厦/楼栋——713/房间”进行依存句 法分析,可以得到示例文本地址“深圳市南山区高新中一道软件大厦713”的依存句法图结构。图2中,对于类似A-[CONTAIN]->B,B-[CONTAIN]->C,A-[CONTAIN]->C的情况,由于A-[CONTAIN]->C可由A-[CONTAIN]->B且B-[CONTAIN]->C自动推导出,因此在图2所示依存句法图结构中省略了A-[CONTAIN]->C的标注。
标注后的地址词序列通过规则系统进行确定性依存分析的分析算法的简单形式可如下所示:
For wordi in sentence:
For wordj in sentence:
satisfy(wordi,wordj)):#满足规则系统的约束,地址词wordi和wordj满足预定义的地址词依存关系规则时,在wordi和wordj间建立相应的依存关系。
具体实施时,为了能更好的表示规则,本发明还提出了一种关于地址的规则描述语法。本发明预定义的标注规则和地址词依存关系规则都可以使用相同的逻辑和语法标记。如语句'if:'是条件句起始,每个条件单独一行,各条件间是“与”关系,每条语句由两部分组成用":"隔开,前面表示相对位置i的概念(Notion,或称为标注)或值(Value,或称为词),后面表示满足的条件,条件是“或”关系,概念满足条件。
'then:'是执行句起始。条件句中'N'起始代表概念,'V'起始代表值。
字母后的数字为相对位置,当前检查位置为0,如果相对位置为-1代表前一个词,如果相对位置为1代表后一个词。
1:(.*公司)(前.*):公司,楼栋
if:
N0:市,省
N-1:楼栋,门牌号
N1:公司
thenMerge:
0<>1:公司
Thenconnect
-1-c->2
简单来说上面语句表达的意思就是:
如果当前概念为'市',前一个概念为'楼栋',后一个概念为'公司'。则可以将当前值和后一值合并,并给新值赋予概念为'公司'.最后,将相对位置为-1的词和相对位置为2的词建立连接。
如图3所示,其为本发明地址结构化装置的方框图。根据本发明的地址结构化方法,本发明还提供了地址结构化装置,主要包括:
地址文本切分模块1,用于将地址文本切分为地址词序列;
地址词标注模块2,用于根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;
依存句法分析模块3,用于按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。
其中,地址可以具体为中文地址。
综上所述,本发明地址结构化方法及装置能够高效的自动化的生成依存句法图结构来表示地址文本中词之间的依存关系;人工干预的策略简单,不需要了解大量的背景知识;本发明扩展了依存树的结构,使之可以以图的形式表达地址词间的关系;有效的辅助了人工操作,简化了地址知识获取的难度。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种地址结构化方法,其特征在于,包括:
    步骤10、将地址文本切分为地址词序列;
    步骤20、根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;
    步骤30、按照预定义的地址词依存关系规则,对标注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。
  2. 根据权利要求1所述的地址结构化方法,其特征在于,所述地址为中文地址。
  3. 根据权利要求2所述的地址结构化方法,其特征在于,步骤10中基于隐马尔可夫模型进行地址文本切分。
  4. 根据权利要求1所述的地址结构化方法,其特征在于,步骤20中基于隐马尔可夫模型进行词性标注。
  5. 根据权利要求4所述的地址结构化方法,其特征在于,步骤20中还使用预定义的标注规则对词性标注结果进行修正。
  6. 根据权利要求1所述的地址结构化方法,其特征在于,所述词性标注集包括代表实体地址词所占据空间的标签。
  7. 根据权利要求6所述的地址结构化方法,其特征在于,所述代表实体地址词所占据空间的标签为国家、省、市、区、街道、社区、片区、道路、门牌号、楼栋、房间、交汇处或地铁线。
  8. 根据权利要求1所述的地址结构化方法,其特征在于,所述预定义的依存关系规则为包含关系、门牌号指向关系、邻接关系或同名关系。
  9. 一种地址结构化装置,其特征在于,包括:
    地址文本切分模块,用于将地址文本切分为地址词序列;
    地址词标注模块,用于根据预定义的反映地址词特性的词性标注集对地址词序列中每个地址词进行词性标注;
    依存句法分析模块,用于按照预定义的地址词依存关系规则,对标 注后的地址词序列进行依存句法分析,以实体地址词作为节点,以实体地址词之间的依存关系作为边,生成反映地址结构的依存句法图结构。
  10. 根据权利要求9所述的地址结构化装置,其特征在于,所述地址为中文地址。
PCT/CN2015/094371 2015-02-13 2015-11-12 地址结构化方法及装置 WO2016127677A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510080522.1A CN104679850B (zh) 2015-02-13 2015-02-13 地址结构化方法及装置
CN201510080522.1 2015-02-13

Publications (1)

Publication Number Publication Date
WO2016127677A1 true WO2016127677A1 (zh) 2016-08-18

Family

ID=53314892

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/094371 WO2016127677A1 (zh) 2015-02-13 2015-11-12 地址结构化方法及装置

Country Status (2)

Country Link
CN (1) CN104679850B (zh)
WO (1) WO2016127677A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635807A (zh) * 2018-10-16 2019-04-16 深圳壹账通智能科技有限公司 信息录入方法、装置、设备及计算机可读存储介质
CN109684440A (zh) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 基于层级标注的地址相似度度量方法
CN110019617A (zh) * 2017-12-05 2019-07-16 腾讯科技(深圳)有限公司 地址标识的确定方法和装置、存储介质、电子装置
CN110210038A (zh) * 2019-06-13 2019-09-06 北京百度网讯科技有限公司 核心实体确定方法及其系统、服务器和计算机可读介质
CN110210033A (zh) * 2019-06-03 2019-09-06 苏州大学 基于主述位理论的汉语基本篇章单元识别方法
CN110210020A (zh) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 通讯地址标准化的系统及其方法
CN111522901A (zh) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 文本中地址信息的处理方法及装置
CN112115214A (zh) * 2019-06-20 2020-12-22 中科聚信信息技术(北京)有限公司 地址标准化方法、地址标准化装置和电子设备
CN113111653A (zh) * 2021-04-07 2021-07-13 同济大学 一种基于Word2Vec和句法依存树的文本特征构造方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679850B (zh) * 2015-02-13 2018-05-29 深圳市华傲数据技术有限公司 地址结构化方法及装置
CN104899296B (zh) * 2015-06-08 2018-05-29 深圳市华傲数据技术有限公司 复杂地址中多条路径的分析方法
CN106055635B (zh) * 2016-05-30 2019-11-19 深圳市华傲数据技术有限公司 地址信息查找方法及装置
CN106021556A (zh) * 2016-05-30 2016-10-12 深圳市华傲数据技术有限公司 地址信息处理方法及装置
CN109213990A (zh) * 2017-07-05 2019-01-15 菜鸟智能物流控股有限公司 一种特征提取方法、装置和服务器
CN111309827A (zh) * 2020-03-23 2020-06-19 平安医疗健康管理股份有限公司 知识图谱构建方法、装置、计算机系统及可读存储介质
CN112347222B (zh) * 2020-10-22 2022-03-18 中科曙光南京研究院有限公司 一种基于知识库推理的将非标准地址转换为标准地址的方法及系统
CN112541341A (zh) * 2020-12-18 2021-03-23 广东电网有限责任公司 一种文本事件元素提取方法
CN112818665A (zh) * 2021-01-29 2021-05-18 上海寻梦信息技术有限公司 结构化地址信息的方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112133A1 (en) * 2001-11-14 2006-05-25 Ljubicich Philip A System and method for creating and maintaining data records to improve accuracy thereof
CN102298585A (zh) * 2010-06-24 2011-12-28 高德软件有限公司 一种地址切分及级别标注方法和地址切分及级别标注装置
CN104239355A (zh) * 2013-06-21 2014-12-24 高德软件有限公司 面向搜索引擎的数据处理方法及装置
CN104679850A (zh) * 2015-02-13 2015-06-03 深圳市华傲数据技术有限公司 地址结构化方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106681A1 (en) * 2007-10-19 2009-04-23 Abhinav Gupta Method and apparatus for geographic specific search results including a map-based display
CN103514234B (zh) * 2012-06-30 2018-10-16 北京百度网讯科技有限公司 一种页面信息提取方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112133A1 (en) * 2001-11-14 2006-05-25 Ljubicich Philip A System and method for creating and maintaining data records to improve accuracy thereof
CN102298585A (zh) * 2010-06-24 2011-12-28 高德软件有限公司 一种地址切分及级别标注方法和地址切分及级别标注装置
CN104239355A (zh) * 2013-06-21 2014-12-24 高德软件有限公司 面向搜索引擎的数据处理方法及装置
CN104679850A (zh) * 2015-02-13 2015-06-03 深圳市华傲数据技术有限公司 地址结构化方法及装置

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019617B (zh) * 2017-12-05 2022-05-20 腾讯科技(深圳)有限公司 地址标识的确定方法和装置、存储介质、电子装置
CN110019617A (zh) * 2017-12-05 2019-07-16 腾讯科技(深圳)有限公司 地址标识的确定方法和装置、存储介质、电子装置
CN109635807A (zh) * 2018-10-16 2019-04-16 深圳壹账通智能科技有限公司 信息录入方法、装置、设备及计算机可读存储介质
CN109684440A (zh) * 2018-12-13 2019-04-26 北京惠盈金科技术有限公司 基于层级标注的地址相似度度量方法
CN109684440B (zh) * 2018-12-13 2023-02-28 北京惠盈金科技术有限公司 基于层级标注的地址相似度度量方法
CN110210020A (zh) * 2019-05-22 2019-09-06 武汉虹信通信技术有限责任公司 通讯地址标准化的系统及其方法
CN110210020B (zh) * 2019-05-22 2023-06-20 武汉虹旭信息技术有限责任公司 通讯地址标准化的系统及其方法
CN110210033B (zh) * 2019-06-03 2023-08-15 苏州大学 基于主述位理论的汉语基本篇章单元识别方法
CN110210033A (zh) * 2019-06-03 2019-09-06 苏州大学 基于主述位理论的汉语基本篇章单元识别方法
CN110210038A (zh) * 2019-06-13 2019-09-06 北京百度网讯科技有限公司 核心实体确定方法及其系统、服务器和计算机可读介质
CN110210038B (zh) * 2019-06-13 2023-01-10 北京百度网讯科技有限公司 核心实体确定方法及其系统、服务器和计算机可读介质
CN112115214A (zh) * 2019-06-20 2020-12-22 中科聚信信息技术(北京)有限公司 地址标准化方法、地址标准化装置和电子设备
CN112115214B (zh) * 2019-06-20 2024-04-02 中科聚信信息技术(北京)有限公司 地址标准化方法、地址标准化装置和电子设备
CN111522901A (zh) * 2020-03-18 2020-08-11 大箴(杭州)科技有限公司 文本中地址信息的处理方法及装置
CN111522901B (zh) * 2020-03-18 2023-10-20 大箴(杭州)科技有限公司 文本中地址信息的处理方法及装置
CN113111653A (zh) * 2021-04-07 2021-07-13 同济大学 一种基于Word2Vec和句法依存树的文本特征构造方法
CN113111653B (zh) * 2021-04-07 2023-06-02 同济大学 一种基于Word2Vec和句法依存树的文本特征构造方法

Also Published As

Publication number Publication date
CN104679850A (zh) 2015-06-03
CN104679850B (zh) 2018-05-29

Similar Documents

Publication Publication Date Title
WO2016127677A1 (zh) 地址结构化方法及装置
Li et al. Leveraging linguistic structures for named entity recognition with bidirectional recursive neural networks
WO2016138773A1 (zh) 基于图的地址知识处理方法及装置
Green et al. Parsing models for identifying multiword expressions
CN106537370B (zh) 在存在来源和翻译错误的情况下对命名实体鲁棒标记的方法和系统
Lita et al. Truecasing
US9613026B2 (en) System and method for interactive automatic translation
KR101130444B1 (ko) 기계번역기법을 이용한 유사문장 식별 시스템
KR101084786B1 (ko) 트리 서열화 컴포넌트를 저장하는 컴퓨터 판독가능 기록매체
JP5362353B2 (ja) 文書中のコロケーション誤りを処理すること
JP4494706B2 (ja) 2カ国語コーパスからの変換マッピングの自動抽出プログラム
CN109522418B (zh) 一种半自动的知识图谱构建方法
Le-Hong et al. An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts
US20150278195A1 (en) Text data sentiment analysis method
US20050216253A1 (en) System and method for reverse transliteration using statistical alignment
Mori et al. A machine learning approach to recipe text processing
US20120239378A1 (en) Methods and Systems for Alignment of Parallel Text Corpora
JP2004038976A (ja) 用例ベースの機械翻訳システム
CN111382571A (zh) 一种信息抽取方法、系统、服务器和存储介质
Almeida et al. Aligning opinions: Cross-lingual opinion mining with dependencies
Kadim et al. Parallel HMM-based approach for arabic part of speech tagging.
Comas et al. Sibyl, a factoid question-answering system for spoken documents
Uchimoto et al. Morphological analysis of the Corpus of Spontaneous Japanese
CN114997398B (zh) 一种基于关系抽取的知识库融合方法
CN108255818B (zh) 利用分割技术的复合式机器翻译方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15881826

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15881826

Country of ref document: EP

Kind code of ref document: A1