CN112347222B

CN112347222B - Method and system for converting non-standard address into standard address based on knowledge base reasoning

Info

Publication number: CN112347222B
Application number: CN202011141247.7A
Authority: CN
Inventors: 吕晓宝; 叶恺翔; 王元兵; 王海荣
Original assignee: Sugon Nanjing Research Institute Co ltd
Current assignee: Sugon Nanjing Research Institute Co ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2022-03-18
Anticipated expiration: 2040-10-22
Also published as: CN112347222A

Abstract

The invention discloses a method and a system for converting a non-standard address into a standard address based on knowledge base reasoning, wherein the method specifically comprises the following steps: firstly, setting a body of an address knowledge base, secondly, constructing a standard address knowledge base, constructing an entity from the traditional standard address base, further constructing a word vector of a standard address, comparing the word vector by a cosine similarity algorithm, mapping the word vector to the entity in the knowledge base, searching an entity matched with an address element in the standard address knowledge base by using a semantic similarity algorithm based on an address name, and further extracting the address element and azimuth relation description information in an original text by named entity identification; through natural language processing and knowledge map processing, non-standardized address text data are automatically mapped to standard addresses through an algorithm, and the cleaning and treatment of the address data are completed.

Description

Method and system for converting non-standard address into standard address based on knowledge base reasoning

Technical Field

The invention relates to an address conversion technology, in particular to a method and a system for converting a non-standard address into a standard address based on knowledge base reasoning.

Background

With the progress of informatization construction of digital cities and smart cities in various regions, business information of different departments is gradually brought into informatization construction contents, however, most of addresses for expressing spatial positions in the information are semantic place name address information described by natural language characters, and spatial geographic coordinates for determining relative position relation of a space main body are described in an information world and are main indexes of spatialization of various information, the spatialization of the address is one of core technologies of an application service informatization system based on the position, and how to correlate and match the address and the spatial geographic coordinates is a key for realizing spatialization of various address information and is also a basis for realizing large-batch business data spatialization management.

At present, the non-standard address mapping algorithm basically calculates the similarity between each address text in a standard address and a non-standard address, and then selects the most similar address as an output result, and generally adopts a similarity algorithm as follows: 1. matching based on keywords; 2. cosine similarity based on the short text vector; 3. an edit distance based on the character string; 4. big data recommendation based on user click behaviors; 5. the mapping process is regarded as a text classification task, machine automatic learning is carried out through a naive Bayes and neural network model, and the similarity algorithms basically meet the requirements of nonstandard address mapping but lack reasoning capability.

Disclosure of Invention

The purpose of the invention is as follows: the utility model provides a complicated path planning system of data center computer lab to solve above-mentioned problem.

The technical scheme is as follows: a complex path planning system of a data center machine room comprises:

step 1: setting an ontology of an address knowledge base;

step 2: constructing a standard address knowledge base;

and step 3: comparing by a cosine similarity algorithm;

and 4, step 4: and extracting the address information of the original text.

According to one aspect of the invention, the ontology of the address knowledge base in the step 1 comprises a knowledge graph ontology, uuids of entities, entity attributes and relationships among the entities, wherein the knowledge graph ontology comprises six levels of province, city, county, street, town, road section and address unit, the entities are corresponding standard addresses of different levels, and are distinguished through global unique identifiers; the uuid of the entity consists of three parts, namely a knowledge map body, a name and a number in a knowledge base; the number is an administrative division number or an address number; the entity attributes comprise names, types, labels, longitude and latitude of a central point, longitude and latitude sequences of boundaries and remarks, and the labels are social attributes of the address entities.

According to an aspect of the present invention, the step 2 is further:

step 21, constructing a standard address knowledge base, constructing word vectors of standard addresses, constructing relationships among entities, calculating relationships among the entities, and acquiring hidden relationships, wherein the constructed standard address knowledge base comprises a traditional standard address base and unstructured text data;

step 22, building an entity from a traditional standard address library, wherein the traditional standard address library comprises a place name, longitude and latitude, an address type and an address label; when the knowledge graph is brought into, forming an entity by each standard address according to the uuid of the entity in the step 1, and standardizing the field value into a corresponding attribute value according to the mapping relation between the field and the entity attribute;

step 23, building word vectors of standard addresses according to a standard knowledge base, wherein the word vectors of the standard addresses are built by cutting address character strings in a segmentation mode with the step length of 1 and the window length of 2, a group of character strings with the length of 2 are generated and used as vector bases, and the value of each vector is the number of times that each base appears in the address character strings;

step 24, constructing the relationship between entities from the structured administrative division information, and directly constructing the relationship between the lower address and the upper address and the equal relationship of the same address generated by different names and laws through the existing administrative division information;

step 25, calculating the relationship between the entities according to the longitude and latitude, calculating the distance and the orientation between every two entities, taking 1 kilometer as a truncation radius of the adjacent relationship, taking the left deviation and the right deviation of 45 degrees of the four orientations of east, west, south and north as respective direction intervals according to respective standard angles, and taking the actual travel distance of each address unit entity on the same road section along the road section as a distance attribute value of the orientation relationship;

step 26, constructing and extracting a hidden relation between the existing entities in the knowledge base according to the unstructured text data, and further acquiring the hidden relation between the address of the artificial oral description and the corresponding artificial calibration standard address;

for each piece of unstructured text data, firstly, extracting address elements in the text in an entity naming identification mode, comparing the extracted address elements with the word vectors of the constructed standard addresses and the address word vectors of all entities in the knowledge base through a cosine similarity algorithm, and mapping the address elements to an entity A in the knowledge base.

According to an aspect of the present invention, the step 3 is further comparing by a cosine similarity algorithm, and marking the word vector after dividing the non-standard address character string as a vector

And

the vector spaces are different due to different bases, and the vector spaces need to be converted into the same vector space, and the module operation extraction is performed

、

The union of two vector bases to form a union base

、

If the two vectors are converted into a new merged vector space composed of merged bases, the step of calculating the similarity between the non-standard address word vector a and the standard address word vector b by using a cosine similarity formula is as follows:

step 31, splicing the bases of the two word vectors to form a vector base union set to obtain new word vector values, wherein the generated new vectors are (1,1,0,0) and (0,1,1, 1);

step 32, obtaining the following mode according to a cosine similarity algorithm:

in the formula (I), the compound is shown in the specification,

and

all represent vectors;

memory vector

(ii) a Vector quantity

And then substituting the result into a cosine similarity algorithm to further obtain the following mode:

by the above method, the standard address with the highest cosine similarity is extracted from each non-standard address to form a standard candidate set for querying the non-standard address, and the entity B is further obtained according to the recorded manual verification standard address.

According to one aspect of the invention, an entity B is obtained according to a recorded manual check standard address, the entity A and the entity B mapped to a knowledge base are judged, and a relationship between the entity A and the entity B is judged; extracting additional relation azimuth description in the text address in a mode of combining the regular expression and the part of speech tagging algorithm, mapping the additional relation azimuth description into a corresponding relation type and a corresponding attribute in a knowledge base body, and then establishing a corresponding relation from an entity A to an entity B, wherein the step of specifically extracting the additional relation azimuth description is as follows:

step 1, firstly, performing part-of-speech tagging on a text through an open-source word segmentation tool, filtering place names, proper nouns, verbs, adjectives and time words, and segmenting the text into a plurality of semantic segments;

step 2, judging whether each segment is described in relation orientation in a regular expression matching mode;

step 3, describing semantic segments of the direction by adopting a regular expression;

there is a relationship between the entity a and the entity B, and the probability of occurrence of the entity a and the entity B has influence on each other, that is:

searching an entity matched with the address elements in a standard address knowledge base to obtain the following mode:

the following is further derived from the relationship between the two:

in the formula (I), the compound is shown in the specification,

and

representing independent entity vector events.

According to an aspect of the present invention, the step 4 is further: the following steps are obtained according to the address information of the extracted original file:

step 41, identifying address and direction description information;

step 42, matching address entities, and if no azimuth description exists, ending the process;

step 43, matching the orientation description into a standard relationship, and screening the relationship conforming to the orientation description from all the relationships of the matched address entities;

step 44, deducing a tail entity according to the relationship, and if the attribute description of the tail entity exists, further screening the tail entity;

step 45, if a plurality of relationships are continuously inferred, the existence of the intermediate entity corresponding to each relationship needs to be confirmed;

step 46, screening the uniqueness of the address description information jointly according to the head and tail entities and the relationship attributes;

further according to step 4, an entity A mapped into a knowledge base and a step 3 of manually checking a standard address to obtain an entity B for recording, through named entity identification, address elements and azimuth relation description information in an original text are extracted, then an entity matched with the address elements is searched in the standard address knowledge base by using a semantic similarity algorithm based on an address name, if the matched entity is unique and does not have the relation azimuth description information, a standard address conversion process is completed, subsequent steps are not needed, the matched entity is unique and has the relation azimuth description information, the relation information needs to be matched with a standard relation in a knowledge base body through the semantic similarity algorithm, a relation type and corresponding attributes are determined, and a relation which is the most similar to the attribute uniquely matched with the entity is searched in all relations connected with the entity A, acquiring an entity B, and setting an error allowable range to float 30% above a distance attribute value for a distance precision range; the method comprises the steps of generating exact description about address attributes, carrying out step-by-step reasoning on a plurality of azimuth relationships under the condition that multi-hop relationship reasoning exists, sequentially confirming the existence of intermediate entities until the tail entity is finally confirmed to exist, and carrying out combined screening on azimuth relationships and attribute information under the condition that a matched entity is not unique, and extracting a standard address of the tail entity.

According to one aspect of the present invention, the relationship conforming to the orientation description is screened from all the relationships of the matching address entity, and the following steps are obtained:

43.1, establishing a non-standard address library and a standard address library which are independent of each other;

43.2, preprocessing the non-standard address base and performing first-level address matching with the standard address base;

43.3, splitting the address of the pre-processed non-standard address library and the standard address library to form an independent address library, and completing the allocation of the non-standard address library and the standard address library;

43.4, matching the addresses of the second level formed by the non-standard address base and the standard address base;

43.5, matching the addresses of the second level, judging whether the traversal of the level combination mode is finished, and finishing the matching of the address library if the traversal of the level combination mode is finished; if not, the operation of step 43.4 is performed so that the non-standard address pool and the standard address pool address match.

Has the advantages that: the invention designs a method and a system for converting a non-standard address into a standard address based on knowledge base reasoning, wherein the standard address and mutual relationship attributes are constructed into a knowledge base in a form of a head entity-directed relationship-tail entity triple, the knowledge base is stored in a knowledge graph form, and the head entity and the directed relationship in the triple are determined by extracting the standard address in the non-structured address and extracting related azimuth and attribute elements, so that a knowledge graph query condition is determined, and the tail entity address based on the standard address reasoning is obtained; the method is effectively applied to the scene of verbally describing the address by the user, helps the system quickly and accurately locate the real address pointed by the user, and compared with the traditional standard address mapping algorithm, the method can automatically construct and update the knowledge base based on the existing structural and non-structural data, and carries out logical reasoning, thereby conforming to the actual business scene.

Drawings

FIG. 1 is a flow chart of the standard address repository construction of the present invention.

FIG. 2 is a flow diagram of the knowledge base translation non-standard address of the present invention.

FIG. 3 is an address matching flow diagram of the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a method for converting a non-standard address into a standard address based on knowledge base reasoning includes:

step 1: setting an ontology of an address knowledge base;

step 2: constructing a standard address knowledge base;

and step 3: comparing by a cosine similarity algorithm;

and 4, step 4: and extracting the address information of the original text.

In a further embodiment, the ontology of the address knowledge base in step 1 includes a knowledge graph ontology, uuid of an entity, attribute of the entity, and a relationship between the entities, where the knowledge graph ontology includes six levels of province, city, county, street, town, road segment, and address unit, and the entities are standard addresses corresponding to different levels, and are distinguished by a globally unique identifier; the uuid of the entity consists of three parts, namely a knowledge map body, a name and a number in a knowledge base; the number is an administrative division number or an address number; the entity attributes comprise names, types, labels, longitude and latitude of a central point, longitude and latitude sequences of boundaries and remarks, and the labels are social attributes of the address entities.

In a further embodiment, the step 2 is further:

In a further embodiment, the step 3 is further:

the word vectors are compared by a cosine similarity algorithm, and the word vectors after the segmentation of the non-standard address character strings are recorded as vectors

And

、

The union of two vector bases to form a union base

、

in the formula (I), the compound is shown in the specification,

and

all represent vectors;

memory vector

(ii) a Vector quantity

In a further embodiment, an entity B is obtained according to the recorded manual check standard address, the entity A and the entity B mapped to the knowledge base are judged, and a relationship between the entity A and the entity B is judged; extracting additional relation azimuth description in the text address in a mode of combining the regular expression and the part of speech tagging algorithm, mapping the additional relation azimuth description into a corresponding relation type and a corresponding attribute in a knowledge base body, and then establishing a corresponding relation from an entity A to an entity B, wherein the step of specifically extracting the additional relation azimuth description is as follows:

the following is further derived from the relationship between the two:

in the formula (I), the compound is shown in the specification,

and

representing independent entity vector events.

In a further embodiment, the step 4 is further: the following steps are obtained according to the address information of the extracted original file:

step 41, identifying address and direction description information;

In a further embodiment, the relationship conforming to the orientation description is screened from all the relationships of the matching address entities, resulting in the following steps:

In a further embodiment, the label is a social attribute of the address entity, such as "store, supermarket, school, hospital, institution, enterprise, residential district", etc., the attribute types of different types of entities are different, and all attributes should be included for "face" type entities, such as province, city, county, town, community, district, etc.; for "point" type entities, such as address units, there is no need to include a "boundary latitude and longitude sequence"; for "line" type entities, such as road segments, there is no need to include "center point latitude and longitude";

in further embodiments, the relationships between entities are classified into four types of relationships, i.e., "belong to", "equal", "adjacent", "cross", and so on:

the belonging relationship refers to a spatial contained relationship in which lower-level entities belong to upper-level entities among six levels of entities. Typically, a subordinate entity can only have a relationship with a nearest superior entity. However, there are exceptions, such as one "address unit" class entity corresponds to "intersection", and may belong to a plurality of "road section" type entities, or one "road section" class entity spans different areas, and may belong to different street towns;

the equal relation means that different place entities actually correspond to the same place due to different name calling methods or space superposition and the like;

the neighbor relation contains two attributes: "orientation" and "distance". Wherein the orientations include discrete values such as "south", "north", "east", "west", "opposite", "near", and the like; the distance is a specific numerical value and the unit is meter;

the intersection relation refers to a line-type address entity, such as an intersection generated by intersection between road sections (street, road, lane), wherein the intersection also corresponds to an address unit-type entity. The head entity and the tail entity of the cross relationship are respectively a road section type entity, the attribute of the cross relationship comprises two attributes of an intersection type and an intersection entity, the attribute value of the intersection type is equal to the intersection type and the intersection entity, and the attribute value of the intersection entity is uuid of the intersection type entity generated by the intersection.

In a further embodiment, the longitude and latitude are used to calculate the relationship between the entities, for example, if the determination of the longitude and latitude shows that "huawei building" faces the east 10 ° and is a new street crossing before 100 m, the relationship from the "huawei building" entity to the "new street crossing" entity can be increased by the following steps: east; distance: 100 meters ".

In a further embodiment, the regular expression indicates (near the opposite | east | south | west | north | side | adjacent | side | next | partition |) [ \ u4E00- \ u9FA5|0-9] {0,3} $, indicating that a semantic fragment that conforms to the orientation description must occur in one of the words "opposite", "east", etc., and not more than three characters from the end of the string; for example, a text that people put up opposite to a street of the eight-dot Huawei building in the morning can be used for identifying and extracting the entity A of the Huawei building through a named entity, then the orientation semantic description of the street opposite to the street is extracted through a regular expression template and a part-of-speech tagging algorithm, the semantic similarity algorithm is used for mapping the orientation semantic description of the street opposite to the standard relation type of the opposite, and the opposite relation can be added between the entity A of the Huawei building and the entity B of the Guangxi building by combining the manual verification address of the text.

In a further embodiment, the named entity identifies and extracts the address elements and the orientation relation description information in the original text, for example, "120 m east store of building", and "120 m east store" and so on.

In a further embodiment, the matched entity is unique, for example "120 meters east" is mapped to a neighbor relation, and the attribute is "position: east; distance: 120 m ".

In a further embodiment, the occurrence of a precise description of an address attribute, such as "shop of 120 m east of building mansion", then this address attribute of "shop" is used as a prerequisite for the end entity B query.

In a further embodiment, for the case of multi-hop relationship inference, the multi-position relationship is inferred step by step, and the existence of the intermediate entities is sequentially confirmed until the tail entity confirms the existence. If the intersection of the yellow mountain road and the level road is 50 meters east, the opposite side of the Suguo supermarket, all the cross relations of the entity of the yellow mountain road are searched, the relation corresponding to the entity of the yellow mountain road is found out, the uuid corresponding to the entity of the intersection is found out in the attribute of the relation, the entity is positioned, and the direction is found out from the entity of the intersection: east; distance: and (3) confirming that the label attribute of the tail entity is supermarket according to the tail entity of 50 meters, searching address entities with opposite relation according to the tail entity, wherein the process needs to ensure the existence of various intermediate entities, and if the tail entity does not exist, the conversion fails.

In a further embodiment, in the case that the matching entity is not unique, joint screening needs to be performed on information such as orientation relationship and attribute. For example, the Suguo supermarket opposite to the building is searched for all address entities with names including the building in the city range, the entities with the names of the Suguo supermarket in the tail entities of the opposite relation are screened, then the head entities and the tail entities meeting the conditions can be uniquely determined, and the standard addresses of the tail entities are extracted.

In a further embodiment, a system for a method for converting non-standard addresses to standard addresses based on knowledge base reasoning, comprising the following modules:

the hierarchical distribution module is used for setting an address knowledge base body; the hierarchy distribution module comprises a knowledge graph body, uuid of an entity, entity attributes and relations among the entities, wherein the knowledge graph body comprises six hierarchies of province, city, district and county, street and town, road sections and address units, the entities are corresponding standard addresses of different hierarchies, and are distinguished through global unique identifiers; the uuid of the entity consists of three parts, namely a knowledge map body, a name and a number in a knowledge base; the number is an administrative division number or an address number; the entity attributes comprise names, types, labels, longitude and latitude of a central point, a boundary longitude and latitude sequence and remarks, and the labels are social attributes of the address entities;

the standard address construction module is used for constructing a standard address knowledge base; the standard address construction module is further:

for each piece of unstructured text data, firstly, extracting address elements in a text in an entity naming and identifying mode, comparing the extracted address elements with word vectors of the constructed standard addresses and address word vectors of each entity in a knowledge base through a cosine similarity algorithm, and mapping the extracted address elements to an entity A in the knowledge base;

the vector comparison module is used for comparing by a cosine similarity algorithm; the vector comparison module further compares the word vectors by a cosine similarity algorithm, and marks the word vectors divided by the non-standard address character strings as vectors

And

due to their respectiveThe vector spaces are different due to different bases, and are required to be converted into the same vector space, and the module operation extraction is carried out

、

The union of two vector bases to form a union base

、

in the formula (I), the compound is shown in the specification,

and

all represent vectors;

memory vector

(ii) a Vector quantity

So as to substitute the cosine similarity algorithm to obtainThe following modes are adopted:

by the above mode, the standard address with the highest cosine similarity is extracted from each non-standard address to form a standard candidate set for inquiring the non-standard addresses, and an entity B is further obtained according to the recorded manual verification standard address;

the address information screening module is used for extracting the address information of the original text; the address information screening module obtains the following steps according to the address information of the extracted original file:

step 41, identifying address and direction description information;

In summary, the present invention has the following advantages: the method comprises the steps of constructing a knowledge base by a head entity-directed relationship-tail entity triple form of standard addresses and mutual relationship attributes, storing the knowledge base in a knowledge map form, and determining a head entity and a directed relationship in the triple through extraction of the standard addresses in unstructured addresses and extraction of relevant direction and attribute elements so as to determine a knowledge map query condition and obtain a tail entity address inferred based on the standard addresses, so that non-standardized geographic position information orally described by a user can be processed and converted into standard address information capable of being processed by a machine.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. The invention is not described in detail in order to avoid unnecessary repetition.

Claims

1. A method for converting a non-standard address into a standard address based on knowledge base reasoning is characterized by comprising the following steps:

step 1: setting an ontology of an address knowledge base;

step 2: constructing a standard address knowledge base;

and step 3: comparing by a cosine similarity algorithm;

and 4, step 4: and extracting the address information of the original text.

2. The method for converting a non-standard address into a standard address based on knowledge base reasoning according to claim 1, wherein the ontology of the address knowledge base in step 1 comprises a knowledge graph ontology, uuids of entities, attributes of the entities and relationships among the entities, wherein the knowledge graph ontology comprises six levels of province, city, county, street, town, road section and address unit, and the entities are corresponding standard addresses of different levels and are distinguished by a globally unique identifier; the uuid of the entity consists of three parts, namely a knowledge map body, a name and a number in a knowledge base; the number is an administrative division number or an address number; the entity attributes comprise names, types, labels, longitude and latitude of a central point, longitude and latitude sequences of boundaries and remarks, and the labels are social attributes of the address entities.

3. The method for converting non-standard address into standard address based on knowledge base inference as claimed in claim 1, wherein said step 3 is further:

And

、

The union of two vector bases to form a union base

、

in the formula (I), the compound is shown in the specification,

and

all represent vectors;

memory vector

(ii) a Vector quantity

4. The method for converting the non-standard address into the standard address based on the knowledge base inference as claimed in claim 3, wherein the entity B is obtained according to the recorded manual check standard address, the entity A and the entity B mapped to the knowledge base are judged, and a relationship exists between the entity A and the entity B; extracting additional relation azimuth description in the text address in a mode of combining the regular expression and the part of speech tagging algorithm, mapping the additional relation azimuth description into a corresponding relation type and a corresponding attribute in a knowledge base body, and then establishing a corresponding relation from an entity A to an entity B, wherein the step of specifically extracting the additional relation azimuth description is as follows:

the following is further derived from the relationship between the two:

in the formula (I), the compound is shown in the specification,

and

representing independent entity vector events.

5. The method for converting non-standard address into standard address based on knowledge base inference as claimed in claim 1, wherein said step 4 is further:

the following steps are obtained according to the address information of the extracted original file:

step 41, identifying address and direction description information;

6. The method for converting non-standard address into standard address based on knowledge base inference as claimed in claim 5, wherein the relationship conforming to the orientation description is selected from all the relationships of the matching address entities, and the following steps are obtained:

7. A system for converting a non-standard address into a standard address based on knowledge base reasoning is characterized by comprising the following modules:

the hierarchical distribution module is used for setting an address knowledge base body;

the standard address construction module is used for constructing a standard address knowledge base;

the vector comparison module is used for comparing by a cosine similarity algorithm;

the address information screening module is used for extracting the address information of the original text;

the hierarchy distribution module comprises a knowledge graph body, uuid of an entity, entity attributes and relations among the entities, wherein the knowledge graph body comprises six hierarchies of province, city, district and county, street and town, road sections and address units, the entities are corresponding standard addresses of different hierarchies, and are distinguished through global unique identifiers; the uuid of the entity consists of three parts, namely a knowledge map body, a name and a number in a knowledge base; the number is an administrative division number or an address number; the entity attributes comprise names, types, labels, longitude and latitude of a central point, a boundary longitude and latitude sequence and remarks, and the labels are social attributes of the address entities;

the standard address construction module is further:

8. The system of claim 7, wherein the vector matching module further performs matching by cosine similarity algorithm, and the word vector after segmentation of the non-standard address character string is recorded as a vector