CN116680356A - Address data processing method and device, electronic equipment and storage medium - Google Patents

Address data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116680356A
CN116680356A CN202310701692.1A CN202310701692A CN116680356A CN 116680356 A CN116680356 A CN 116680356A CN 202310701692 A CN202310701692 A CN 202310701692A CN 116680356 A CN116680356 A CN 116680356A
Authority
CN
China
Prior art keywords
word
words
address data
administrative division
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310701692.1A
Other languages
Chinese (zh)
Inventor
李金坤
刘桐宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cennavi Technologies Co Ltd
Original Assignee
Beijing Cennavi Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cennavi Technologies Co Ltd filed Critical Beijing Cennavi Technologies Co Ltd
Priority to CN202310701692.1A priority Critical patent/CN116680356A/en
Publication of CN116680356A publication Critical patent/CN116680356A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an address data processing method, an address data processing device, electronic equipment and a storage medium, relates to the technical field of data processing, and aims to convert address data into more accurate and complete high-value data. The method comprises the following steps: acquiring source address data, and performing word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; determining a target interest point from a first geographic information database based on the source address data and the word segmentation result, and acquiring a standard administrative division word corresponding to the target interest point; and determining the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and obtaining target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result under the condition that the similarity is greater than or equal to a preset threshold value.

Description

Address data processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for processing address data, an electronic device, and a storage medium.
Background
With the development of navigation technology and search engines, a large amount of address data is generated in a network, and the address data is complex and expressed differently, which makes management and application of the address data very difficult. Therefore, it is important to clean address data.
In the prior art, when address data is cleaned, the cleaning data obtained by performing processes such as keyword query, paraphrasing substitution and the like on the address data is not ideal. For example, the address data after washing may still have problems of incomplete information and low accuracy of the data after washing, so that the utility value is not high.
Disclosure of Invention
The application provides an address data processing method, an address data processing device, an electronic device and a storage medium, which aim to convert address data into more accurate and complete high-value data.
In order to achieve the above purpose, the application adopts the following technical scheme:
in a first aspect, there is provided an address data processing method, the method comprising: acquiring source address data, and performing word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used for identifying address guide words; under the condition that noise words, administrative division words, azimuth words and address guide words exist in word segmentation results, determining target interest points from a first geographic information database based on source address data and word segmentation results, and acquiring standard administrative division words corresponding to the target interest points; and determining the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and obtaining target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result under the condition that the similarity is greater than or equal to a preset threshold value.
Optionally, the preset administrative division word library includes a plurality of preset administrative division words; the word segmentation result comprises a first sub-result, wherein the first sub-result is obtained by carrying out word segmentation processing on the source address data based on a preset administrative division word stock; performing word segmentation processing on the source address data based on a preset administrative division word stock to obtain a first sub-result, wherein the word segmentation processing comprises the following steps: dividing each preset administrative division word to obtain a first administrative division word and a second administrative division word of each preset administrative division word; the division level of the first administrative division dividing word is greater than or equal to a preset level, and the division level of the second administrative division dividing word is less than the preset level; constructing a prefix search tree by taking a first administrative region rowing word as a head node and a second administrative region rowing word as a child node; and identifying in the source address data based on the prefix search tree to obtain a first sub-result.
Optionally, in the case that the noise word, the administrative division word, the azimuth word and the address guide word exist in the word segmentation result, determining the target interest point from the first geographic information database based on the source address data and the word segmentation result includes: under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, performing first search in a first geographic information database based on source address data to obtain a first search result; performing second search in the first geographic information database based on the word segmentation result to obtain a second search result; under the condition that the first search result comprises at least one first interest point and the matching degree corresponding to each first interest point and/or the second search result comprises at least one second interest point and the matching degree corresponding to each second interest point, obtaining a plurality of interest points according to the first search result and the second search result, and selecting the interest point with the highest matching degree from the plurality of interest points as a target interest point.
Optionally, obtaining a standard administrative division word corresponding to the target interest point includes: and acquiring the position information of the target interest point, and retrieving and obtaining the standard administrative division words corresponding to the target interest point according to the inverse geographic service.
Optionally, the method further comprises: under the condition that the first search result and the second search result are null values, carrying out index matching in a second geographic information database based on the source address data to obtain a matching result; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
Optionally, the method further comprises: under the condition that noise words, administrative division words, azimuth words and address guide words do not exist in the word segmentation result, index matching is carried out in a second geographic information database based on source address data, and a matching result is obtained; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
In a second aspect, an address data processing apparatus is provided, the apparatus including an acquisition unit, a processing unit, and a determination unit; an acquisition unit configured to acquire source address data; the processing unit is used for performing word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used for identifying address guide words; the determining unit is used for determining a target interest point from the first geographic information database based on the source address data and the word segmentation result and acquiring a standard administrative division word corresponding to the target interest point under the condition that a noise word, an administrative division word, an azimuth word and an address guide word exist in the word segmentation result; the determining unit is further used for determining the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and obtaining target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result under the condition that the similarity is greater than or equal to a preset threshold value.
Optionally, the preset administrative division word library includes a plurality of preset administrative division words; the word segmentation result comprises a first sub-result, wherein the first sub-result is obtained by carrying out word segmentation processing on the source address data based on a preset administrative division word stock; the processing unit is specifically used for: dividing each preset administrative division word to obtain the highest administrative division word and the non-highest administrative division word of each preset administrative division word; constructing a prefix search tree by taking the highest administrative region rowing word as a head node and the non-highest administrative region rowing word as a child node; and indexing in the source address data based on the prefix search tree to obtain a first sub-result.
Optionally, the determining unit is specifically configured to: under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, performing first search in a first geographic information database based on source address data to obtain a first search result, and performing second search in the first geographic information database based on the word segmentation result to obtain a second search result; under the condition that the first search result comprises at least one first interest point and the matching degree corresponding to each first interest point and/or the second search result comprises at least one second interest point and the matching degree corresponding to each second interest point, the first search result and the second search result are obtained to obtain a plurality of interest points, and the interest point with the highest matching degree is selected from the plurality of interest points to serve as a target interest point.
Optionally, the determining unit is specifically configured to: and acquiring the position information of the target interest point, and retrieving and obtaining the standard administrative division words corresponding to the target interest point according to the inverse geographic service.
Optionally, the determining unit is further configured to: under the condition that the first search result and the second search result are null values, carrying out index matching in a second geographic information database based on the source address data to obtain a matching result; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
Optionally, the determining unit is further configured to: under the condition that noise words, administrative division words, azimuth words and address guide words do not exist in the word segmentation result, index matching is carried out in a second geographic information database based on source address data, and a matching result is obtained; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
In a third aspect, there is provided an electronic device comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the address data processing method of the first aspect described above.
In a fourth aspect, there is provided a computer readable storage medium having instructions stored thereon, which when executed by a processor of an electronic device, enable the electronic device to perform the address data processing method of the first aspect as described above.
The technical scheme provided by the application has at least the following beneficial effects: the address processing device acquires source address data, and performs word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used to identify address-directed words. The application refers to the noise word stock, the administrative division word stock, the azimuth word stock and the semantic model during word segmentation, so the application has more purpose when the source address data is segmented, and aims to obtain the noise word, the administrative division word stock, the azimuth word stock and the address guide word corresponding to the source address data. Under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, the address processing device determines target interest points from the first geographic information database based on the source address data and the word segmentation result, acquires standard administrative division words corresponding to the target interest points, and lays down the standardization of the address data. Further, the address processing device determines the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and under the condition that the similarity is greater than or equal to a preset threshold value, the address processing device obtains target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result. In this way, the target address data obtained through the data processing process covers standard administrative division words, noise words, azimuth words and address guide words, the information integrity degree is higher, compared with the source address data, the target address data is more standard, the noise words in the target address data can be clearly known in the subsequent use, and the noise words can be eliminated, so that the interference of the noise words can be eliminated, and the accuracy of the target address data can be improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;
fig. 2 is a flowchart illustrating a method for processing address data according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a prefix search tree according to an embodiment of the present application;
FIG. 4 is a second flowchart of an address data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an address data processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
It should be noted that, in the embodiment of the present application, "english: of", "corresponding" and "corresponding" may sometimes be used in combination, and it should be noted that the meaning to be expressed is consistent when the distinction is not emphasized.
In order to clearly describe the technical solution of the embodiments of the present application, in the embodiments of the present application, the terms "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect, and those skilled in the art will understand that the terms "first", "second", etc. are not limited in number and execution order.
Before explaining the embodiments of the present application in detail, some related arts related to the embodiments of the present application will be described.
With the development of navigation technology and search engines, a large amount of address data is generated in a network, and the address data is complex and expressed differently, which makes management and application of the address data very difficult. Therefore, it is important to clean address data.
Data cleansing refers to the last procedure to find and correct identifiable errors in a data file, including checking for data consistency, processing invalid and missing values, etc. Unlike questionnaire reviews, the cleaning of entered data is typically done by a computer rather than manually.
In practical application, address data which are complex and expressed differently can be converted into data meeting the data quality requirement through data cleaning.
When the address data is cleaned, the related technology generally only executes keyword inquiry and paraphrasing for the address data, so that cleaned address data is obtained, the cleaning effect for noise words or interference words is poor, the problem that the cleaned address data is still possibly incomplete in information and the problem that the cleaned data is low in accuracy is still likely to exist, and therefore the utilization value is not high.
In view of this, the embodiment of the application provides an address data processing method, which aims to convert the problems of a large number of errors, disordered names, incomplete information and the like of address data into more accurate and complete high-value data through a data processing means.
The method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates an exemplary application scenario diagram provided by an embodiment of the present application. As shown in fig. 1, the address data processing method provided in the embodiment of the present application may be applied to a data processing system 10. The data processing system 10 includes an address data processing device (hereinafter referred to as a data processing device) 11 and an electronic apparatus 12. Wherein the data processing means 11 is connected to an electronic device 12. The data processing device 11 and the electronic device 12 may be connected in a wired manner or may be connected in a wireless manner, which is not limited in the embodiment of the present disclosure.
The electronic device 12 is used to store source address data. For example, the electronic device 12 has a database disposed therein, and the electronic device 12 stores source address data in the database.
The data processing device 11 is configured to obtain source address data, and perform word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model, so as to obtain a word segmentation result. The data processing device 11 is further configured to determine, in the case where the noise word, the administrative division word, the azimuth word, and the address guide word exist in the word segmentation result, a target interest point from the first geographic information database based on the source address data and the word segmentation result, and acquire a standard administrative division word corresponding to the target interest point. The data processing device 11 is further configured to determine a similarity between the standard administrative division word and the administrative planning word in the word segmentation result, and obtain the target address data based on the standard administrative division word, the noise word, the azimuth word and the address guidance word in the word segmentation result if the similarity is greater than or equal to a preset threshold.
The electronic device 12 is also used to store the destination address data into a database.
In different application scenarios, the data processing apparatus 11 and the electronic device 12 may be independent devices, or may be integrated in the same device, which is not specifically limited in the embodiments of the present disclosure.
When the data processing apparatus 11 and the electronic device 12 are integrated in the same device, the data transmission between the data processing apparatus 11 and the electronic device 12 is performed by data transmission between internal modules of the device. In this case, the data transfer flow therebetween is the same as "in the case where the data processing apparatus 11 and the electronic device 12 are independent of each other".
In the following embodiments provided in the embodiments of the present disclosure, description will be given taking an example in which the data processing apparatus 11 and the electronic device 12 are provided independently of each other.
FIG. 2 is a flow diagram illustrating a method of address data processing according to some example embodiments. In some embodiments, the address data processing method described above may be applied to a data processing apparatus, an electronic device, and the like as shown in fig. 1, and may also be applied to other similar devices.
As shown in fig. 2, the address data processing method provided in the embodiment of the present disclosure includes the following S201 to S203.
S201, the data processing device acquires source address data, and performs word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result.
Wherein the semantic model is used to identify address-directed words.
As one possible implementation manner, the data processing device obtains source address data from the electronic device, and performs word segmentation processing on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result.
It should be noted that, the source address data is unprocessed or cleaned data, and the address information is stored in the source address data, but the address information in the source address data may be inaccurate or nonstandard. For example, there are wrongly written words in the source address data.
In some embodiments, when the source address data is subjected to word segmentation, the data processing device may refer to the noise word in the preset noise word library, and determine whether the same or similar noise word exists in the source address data.
The noise word is also referred to as interference data, and refers to a word in the source address data that is not related to the address information. For example, for source address data "I want to go to park A," where "I want to go to" is independent of the specific address, it may be considered a noise word.
In some embodiments, to obtain the noise word library, the data processing apparatus may further obtain search address data in the network through buried point analysis, and manually mark the search address data to mark the noise word. Further, the data processing device takes marked data as a sample, takes a large amount of noiseless data in a standard electronic map as a reference, trains a conditional random field (conditional random field, CRF) model, and obtains a noise word extraction model. Further, the data processing device utilizes the noise word extraction model to extract noise words from a large amount of unlabeled search address data, and a noise word stock is obtained.
Similarly, the data processing device can refer to the azimuth words in the preset azimuth word library to judge whether the same or similar azimuth words exist in the source address data.
The azimuth word stock can be pre-constructed by operation and maintenance personnel, or can be obtained by extracting through model training by a data processing device, and the embodiment of the application is not limited to the method.
In some embodiments, when the source address data is subjected to word segmentation, the data processing device may input the source address data into a pre-trained semantic model to extract address guide words in the source address data.
For example, for source address data "walk 100 meters north to park a," where "north" is an azimuth and "walk 100 meters" is an address guide.
In some embodiments, the pre-set administrative division word library comprises a plurality of pre-set administrative division words; the word segmentation result comprises a first sub-result, wherein the first sub-result is obtained by carrying out word segmentation processing on the source address data based on a preset administrative division word stock.
In some embodiments, word segmentation is performed on the source address data based on a preset administrative division word stock to obtain a first sub-result, including: the data processing device divides each preset administrative division word to obtain a first administrative division word and a second administrative division word of each preset administrative division word. The data processing device uses the first administrative division rowing word as a head node and uses the second administrative division rowing word as a child node to construct a prefix search tree. Further, the data processing device indexes in the source address data based on the prefix search tree to obtain a first sub-result.
It should be noted that the administrative division is an area that the relevant departments divide in stages for the convenience of administrative management. For example, the administrative division may be a five-level administrative division (provincial level+municipal level+county level+rural level+community). The division level of the first administrative division word is greater than or equal to a preset level, and the division level of the second administrative division word is less than the preset level. The preset level may be preset by an operation and maintenance person. In practical application, the first administrative division rowing word may be the administrative division rowing word with the highest division level, and the second administrative division rowing word is another administrative division rowing word with the other highest division level.
For example, for administrative division 1"A, B city, C county, D county, administrative division 2, a city, E, F community, the data processing apparatus uses" a province, "highest administrative division, and" B city, "C county," D county, "E city," F community, "as non-highest administrative division, when building the prefix search tree.
In other embodiments, the data processing apparatus may further perform word segmentation on the address data according to the keyword, to obtain the first sub-result. Each keyword corresponds to a different grade, and the data processing device can use the grade of the keyword matched with the preset administrative division word as the grade of the preset administrative division word.
For example, the keywords include province, city, and district, and the corresponding levels are level 1, level 2, and level 3, respectively. For the source address data "/city"/region ", wherein"/region "includes the keyword" province ", it can be used as an administrative division word, and the level corresponding to the administrative division word is level 1; the keyword "city" exists in ". Times.city", so that the keyword "city" can be used as an administrative division word, and the grade corresponding to the administrative division word is grade 2; the keyword "region" exists in the ". Times.region", so that the keyword "region" can be used as an administrative division word, and the level corresponding to the administrative division word is 3. As shown in fig. 3, an exemplary structure of a prefix search tree is shown. The "A province" is used as a head node, and the "B city", "C county", "D county", "E city" and "F community" are used as child nodes and connected below the head node.
In practical applications, the data processing apparatus may index into the source address data according to the prefix search tree to search out possible administrative division words in the source address data.
S202, under the condition that noise words, administrative division words, azimuth words and address guide words exist in word segmentation results, the data processing device determines target interest points from the first geographic information database based on source address data and word segmentation results, and obtains standard administrative division words corresponding to the target interest points.
As a possible implementation manner, after the source address data is subjected to word segmentation, if noise words, administrative division words, azimuth words and address guide words exist in the obtained word segmentation result, the data processing device searches from the first geographic information database based on the source address data and the word segmentation result, and if a corresponding point of interest (POI) is searched, the data processing device determines a target point of interest from the search result and acquires a standard administrative division word corresponding to the target point of interest.
As another possible implementation manner, in the case that the word segmentation result includes a noise word, an administrative division word, an azimuth word and an address guide word, the data processing apparatus retrieves from the first geographic information database based on the source address data and the word segmentation result, and if a corresponding point of interest (POI) is retrieved, obtains a plurality of POIs. Further, the data processing device calculates the similarity between each POI and the source address data respectively, and the POI with the largest similarity is determined as the target interest point. It should be noted that the first geographic information database stores geographic information. For example, the first geographic information database may be a geographic information system (Geographic Information System or Geo-Information system, GIS or GEO) database.
In some embodiments, to obtain the target point of interest, the data processing apparatus may perform a first search in the first geographic information database based on the source address data to obtain a first search result, and further perform a second search in the first geographic information database based on the word segmentation result to obtain a second search result.
In practical application, the data processing device may call the GEO service to retrieve the point of interest.
The data processing device is used for inputting content as source address data when the GEO service is invoked for searching, and if the interest points corresponding to the input content exist in the first geographic information database, the GEO service outputs the searched first interest points and the matching degree corresponding to each first interest point. Similarly, when the data processing device calls the GEO service to search, the input content is a word segmentation result, and if the interest points corresponding to the input content exist in the first geographic information database, the GEO service outputs the searched second interest points and the matching degree corresponding to each second interest point. Further, the data processing device obtains the first search result and the second search result, obtains a plurality of interest points, and selects the interest point with the highest matching degree from the plurality of interest points as the target interest point.
In some embodiments, the data processing apparatus may acquire the standard administrative division word corresponding to the target point of interest by using an inverse geographic reduction technique.
Illustratively, the data processing apparatus obtains location information (e.g., latitude and longitude) of the target point of interest. Further, the data processing device searches the longitude and latitude through the inverse geographic service to search the corresponding standard administrative division words.
It can be understood that, after specific position information is determined, the standard administrative division words are obtained by corresponding search according to the position information, and compared with the administrative division words obtained by word segmentation, the standard administrative division words are more accurate and have uniqueness with the corresponding relationship of the position information.
S203, the data processing device determines the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and under the condition that the similarity is larger than or equal to a preset threshold value, the data processing device obtains target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result.
As a possible implementation manner, the data processing device calculates the similarity between the standard administrative division words and the administrative division words in the word segmentation result through a similarity formula. Further, the data processing device compares the calculated similarity with a preset threshold value, and under the condition that the similarity is larger than or equal to the preset threshold value, the data processing device obtains target address data based on the standard administrative division words, noise words, azimuth words and address guide words in the word segmentation result.
The target address data format may be, for example, standard administrative division words + azimuth words + address guide words + noise words.
In some embodiments, the data processing apparatus obtains the target address data based on the administrative division word, the noise word in the word segmentation result, the azimuth word, and the address guide word when the similarity is smaller than a preset threshold.
The technical scheme provided by the embodiment of the application at least has the following beneficial effects: the address processing device acquires source address data, and performs word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used to identify address-directed words. The application refers to the noise word stock, the administrative division word stock, the azimuth word stock and the semantic model during word segmentation, so the application has more purpose when the source address data is segmented, and aims to obtain the noise word, the administrative division word stock, the azimuth word stock and the address guide word corresponding to the source address data. Under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, the address processing device determines target interest points from the first geographic information database based on the source address data and the word segmentation result, acquires standard administrative division words corresponding to the target interest points, and lays down the standardization of the address data. Further, the address processing device determines the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and under the condition that the similarity is greater than or equal to a preset threshold value, the address processing device obtains target address data based on the standard administrative division words, the noise words, the azimuth words and the address guide words in the word segmentation result. In this way, the target address data obtained through the data processing process covers standard administrative division words, noise words, azimuth words and address guide words, the information integrity degree is higher, compared with the source address data, the target address data is more standard, the noise words in the target address data can be clearly known in the subsequent use, and the noise words can be eliminated, so that the interference of the noise words can be eliminated, and the accuracy of the target address data can be improved.
In some embodiments, in order to ensure that the source address data can be cleaned, in the case that no noise word, no administrative division word, no azimuth word, and no address guide word exist in the word segmentation result, the address processing device performs index matching in the second geographic information database based on the source address data to obtain a matching result; wherein the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
Similarly, under the condition that the first search result and the second search result are null values, the address processing device performs index matching in the second geographic information database based on the source address data to obtain a matching result.
In some embodiments, as shown in fig. 4, the address processing device performs word segmentation on the source address data to obtain a word segmentation result, if there are a noise word, an administrative division word, an azimuth word and an address guide word in the word segmentation result, the address processing device uses the word segmentation result to call the GEO service in the first geographic information database to obtain a first search result, and uses the source address data to call the GEO service to obtain a second search result. Otherwise, the address processing device uses the source address data to call the GEO service in the geographic information database to obtain a third retrieval result. If there is an interest point in the first search result and/or the second search result, the address processing device selects the interest point with the highest matching degree as the target interest point, and performs the subsequent processing flow (refer to S202-S203). If no interest point exists in the first search result and the second search result, the address processing device uses the source address data to call the full text search service inquiry in the second geographic information database to obtain a matching result. Similarly, if the third search result does not have the interest point, the address processing device uses the source address data to call the full-text search service query in the second geographic information database to obtain a matching result.
The foregoing embodiments mainly describe the solutions provided by the embodiments of the present application from the perspective of the apparatus (device). It will be appreciated that, in order to implement the above-mentioned method, the apparatus or device includes hardware structures and/or software modules corresponding to each of the method flows, and these hardware structures and/or software modules corresponding to each of the method flows may constitute a material information determining apparatus. Those of skill in the art will readily appreciate that the various illustrative algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the device or the equipment according to the method example, for example, the device or the equipment can divide each functional module corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
Fig. 5 is a schematic diagram showing a structure of an address data processing apparatus according to an exemplary embodiment. Referring to fig. 5, an address data processing apparatus 30 provided in an embodiment of the present application includes an acquisition unit 301, a processing unit 302, and a determination unit 303.
An acquiring unit 301, configured to acquire source address data; the processing unit 302 is configured to perform word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model, so as to obtain a word segmentation result; the semantic model is used for identifying address guide words; a determining unit 303, configured to determine, in the case where a noise word, an administrative division word, an azimuth word, and an address guide word exist in the word segmentation result, a target interest point from the first geographic information database based on the source address data and the word segmentation result, and obtain a standard administrative division word corresponding to the target interest point; the determining unit 303 is further configured to determine a similarity between the standard administrative division word and the administrative planning word in the word segmentation result, and obtain the target address data based on the standard administrative division word, the noise word, the azimuth word and the address guide word in the word segmentation result when the similarity is greater than or equal to a preset threshold.
Optionally, the preset administrative division word library includes a plurality of preset administrative division words; the word segmentation result comprises a first sub-result, wherein the first sub-result is obtained by carrying out word segmentation processing on the source address data based on a preset administrative division word stock; the processing unit 302 is specifically configured to: dividing each preset administrative division word to obtain the highest administrative division word and the non-highest administrative division word of each preset administrative division word; constructing a prefix search tree by taking the highest administrative region rowing word as a head node and the non-highest administrative region rowing word as a child node; and indexing in the source address data based on the prefix search tree to obtain a first sub-result.
Optionally, the determining unit 303 is specifically configured to: under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, performing first search in a first geographic information database based on source address data to obtain a first search result, and performing second search in the first geographic information database based on the word segmentation result to obtain a second search result; under the condition that the first search result comprises at least one first interest point and the matching degree corresponding to each first interest point and/or the second search result comprises at least one second interest point and the matching degree corresponding to each second interest point, the first search result and the second search result are obtained to obtain a plurality of interest points, and the interest point with the highest matching degree is selected from the plurality of interest points to serve as a target interest point.
Optionally, the determining unit 303 is specifically configured to: and acquiring the position information of the target interest point, and retrieving and obtaining the standard administrative division words corresponding to the target interest point according to the inverse geographic service.
Optionally, the determining unit 303 is further configured to: under the condition that the first search result and the second search result are null values, carrying out index matching in a second geographic information database based on the source address data to obtain a matching result; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
Optionally, the determining unit 303 is further configured to: under the condition that noise words, administrative division words, azimuth words and address guide words do not exist in the word segmentation result, index matching is carried out in a second geographic information database based on source address data, and a matching result is obtained; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
Fig. 6 is a schematic structural diagram of an electronic device according to the present application. As shown in fig. 6, the electronic device 40 may include at least one processor 401 and a memory 402 for storing processor executable instructions, wherein the processor 401 is configured to execute the instructions in the memory 402 to implement the address data processing method in the above-described embodiment.
In addition, the electronic device 40 may also include a communication bus 403 and at least one communication interface 404.
The processor 401 may be a processor (central processing units, CPU), a microprocessor unit, ASIC, or one or more integrated circuits for controlling the execution of the programs of the present application.
Communication bus 403 may include a pathway to transfer information between the aforementioned components.
The communication interface 404 uses any transceiver-like device for communicating with other devices or communication networks, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.
The memory 402 may be, but is not limited to, read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be electrically erasable programmable read-only memory (EEPROM), compact disc-read only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be implemented separately and coupled to the processor 401 via a bus. The memory may also be integrated with the processor 401.
The memory 402 is used for storing instructions for executing the scheme of the present application, and the processor 401 controls the execution. The processor 401 is arranged to execute instructions stored in the memory 402 in order to carry out the functions of the method of the application.
As an example, in connection with fig. 5, the acquisition unit 301, the processing unit 302, and the determination unit 303 in the address data processing apparatus 30 realize the same functions as those of the processor 401 in fig. 6.
In a particular implementation, as one embodiment, processor 401 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 6.
In a particular implementation, electronic device 40 may include multiple processors, such as processor 401 and processor 407 in FIG. 6, as one embodiment. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In a particular implementation, electronic device 40 may also include an output device 405 and an input device 406, as one embodiment. The output device 405 communicates with the processor 401 and may display information in a variety of ways. For example, the output device 405 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. The input device 406 is in communication with the processor 401 and may accept input of a user object in a variety of ways. For example, the input device 406 may be a mouse, keyboard, touch screen device, or sensing device, among others.
Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the electronic device 40 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
In addition, the present application also provides a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the address data processing method provided in the above-described embodiment.
In addition, the application also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the address data processing method as provided in the above embodiments.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

Claims (10)

1. A method of address data processing, the method comprising:
acquiring source address data, and performing word segmentation processing on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used for identifying address guide words;
under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, determining a target interest point from a first geographic information database based on the source address data and the word segmentation result, and acquiring a standard administrative division word corresponding to the target interest point;
and determining the similarity between the standard administrative division words and the administrative planning words in the word segmentation result, and obtaining target address data based on the standard administrative division words, noise words, azimuth words and address guide words in the word segmentation result under the condition that the similarity is greater than or equal to a preset threshold value.
2. The address data processing method according to claim 1, wherein the preset administrative division word library includes a plurality of preset administrative division words; the word segmentation result comprises a first sub-result, and the first sub-result is obtained by carrying out word segmentation on the source address data based on the preset administrative division word stock;
The word segmentation processing is performed on the source address data based on the preset administrative division word stock to obtain the first sub-result, including:
dividing each preset administrative division word to obtain a first administrative division word and a second administrative division word of each preset administrative division word; the division level of the first administrative division dividing word is greater than or equal to a preset level, and the division level of the second administrative division dividing word is smaller than the preset level;
constructing a prefix search tree by taking the first administrative region rowing word as a head node and the second administrative region rowing word as a child node;
and identifying in the source address data based on the prefix search tree to obtain the first sub-result.
3. The method according to claim 1, wherein in the case where the noise word, the administrative division word, the azimuth word, and the address guide word are present in the word segmentation result, determining the target interest point from the first geographic information database based on the source address data and the word segmentation result includes:
under the condition that noise words, administrative division words, azimuth words and address guide words exist in the word segmentation result, first search is conducted in the first geographic information database based on the source address data, and a first search result is obtained;
Performing second retrieval in the first geographic information database based on the word segmentation result to obtain a second retrieval result;
under the condition that the first search result comprises at least one first interest point and the matching degree corresponding to each first interest point and/or the second search result comprises at least one second interest point and the matching degree corresponding to each second interest point, obtaining a plurality of interest points according to the first search result and the second search result, and selecting the interest point with the highest matching degree from the plurality of interest points as the target interest point.
4. The method for processing address data according to claim 3, wherein the obtaining the standard administrative division word corresponding to the target point of interest comprises:
and acquiring the position information of the target interest point, and retrieving and obtaining the standard administrative division words corresponding to the target interest point according to the inverse geographic service.
5. A method of address data processing according to claim 3, wherein the method further comprises:
under the condition that the first search result and the second search result are null values, index matching is carried out in a second geographic information database based on the source address data, and a matching result is obtained; the data volume of the second geographic information database is larger than the data volume of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
6. The address data processing method according to claim 1, characterized in that the method further comprises:
under the condition that noise words, administrative division words, azimuth words and address guide words do not exist in the word segmentation result, index matching is carried out in a second geographic information database based on the source address data, and a matching result is obtained; the data volume of the second geographic information database is larger than that of the first geographic information database; the matching result comprises administrative division words, azimuth words and address guide words corresponding to the source address data.
7. An address data processing apparatus, characterized in that the apparatus comprises an acquisition unit, a processing unit and a determination unit;
the acquisition unit is used for acquiring source address data;
the processing unit is used for performing word segmentation on the source address data based on a preset noise word stock, a preset administrative division word stock, a preset azimuth word stock and a pre-trained semantic model to obtain a word segmentation result; the semantic model is used for identifying address guide words;
the determining unit is used for determining a target interest point from a first geographic information database based on the source address data and the word segmentation result and acquiring a standard administrative division word corresponding to the target interest point under the condition that a noise word, an administrative division word, an azimuth word and an address guide word exist in the word segmentation result;
The determining unit is further configured to determine a similarity between the standard administrative division word and an administrative planning word in the word segmentation result, and obtain target address data based on the standard administrative division word, a noise word in the word segmentation result, an azimuth word, and an address guide word when the similarity is greater than or equal to a preset threshold.
8. The address data processing apparatus according to claim 7, wherein the preset administrative division word library includes a plurality of preset administrative division words; the word segmentation result comprises a first sub-result, and the first sub-result is obtained by carrying out word segmentation on the source address data based on the preset administrative division word stock; the processing unit is specifically configured to:
dividing each preset administrative division word to obtain a first administrative division word and a second administrative division word of each preset administrative division word; the division level of the first administrative division dividing word is greater than or equal to a preset level, and the division level of the second administrative division dividing word is smaller than the preset level;
constructing a prefix search tree by taking the first administrative region rowing word as a head node and the second administrative region rowing word as a child node;
And indexing in the source address data based on the prefix search tree to obtain the first sub-result.
9. An apparatus, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the address data processing method of any of claims 1-6.
10. A computer readable storage medium having instructions or a computer program stored thereon; and/or a computer program, characterized in that the computer program or instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the address data processing method according to any one of claims 1-6.
CN202310701692.1A 2023-06-13 2023-06-13 Address data processing method and device, electronic equipment and storage medium Pending CN116680356A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310701692.1A CN116680356A (en) 2023-06-13 2023-06-13 Address data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310701692.1A CN116680356A (en) 2023-06-13 2023-06-13 Address data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116680356A true CN116680356A (en) 2023-09-01

Family

ID=87788823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310701692.1A Pending CN116680356A (en) 2023-06-13 2023-06-13 Address data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116680356A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874371A (en) * 2024-03-11 2024-04-12 园测信息科技股份有限公司 Method, system, medium and equipment for inquiring point of interest storage under administrative division
CN117874371B (en) * 2024-03-11 2024-05-31 园测信息科技股份有限公司 Method, system, medium and equipment for inquiring point of interest storage under administrative division

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874371A (en) * 2024-03-11 2024-04-12 园测信息科技股份有限公司 Method, system, medium and equipment for inquiring point of interest storage under administrative division
CN117874371B (en) * 2024-03-11 2024-05-31 园测信息科技股份有限公司 Method, system, medium and equipment for inquiring point of interest storage under administrative division

Similar Documents

Publication Publication Date Title
CN110609902B (en) Text processing method and device based on fusion knowledge graph
CN108388559B (en) Named entity identification method and system under geographic space application and computer program
Xavier et al. A survey of measures and methods for matching geospatial vector datasets
Gao et al. Efficient collective spatial keyword query processing on road networks
Al-Bakri et al. Assessing similarity matching for possible integration of feature classifications of geospatial data from official and informal sources
Safra et al. Ad hoc matching of vectorial road networks
CN109033314B (en) Real-time query method and system for large-scale knowledge graph under condition of limited memory
CN112328891B (en) Method for training search model, method for searching target object and device thereof
Lehmann et al. Deqa: deep web extraction for question answering
CN109086356B (en) Method for diagnosing and correcting error connection relation of large-scale knowledge graph
JP2018537760A (en) Method and apparatus for account mapping based on address information
Chen et al. Georeferencing places from collective human descriptions using place graphs
Abdolmajidi et al. Matching authority and VGI road networks using an extended node-based matching algorithm
WO2018188509A1 (en) Estate information processing method and apparatus, computer device and storage medium
Zhang et al. An improved probabilistic relaxation method for matching multi-scale road networks
Qu et al. Integrating non-spatial preferences into spatial location queries
JP2009110508A (en) Method and system for calculating competitiveness metric between objects
Xu Formalizing natural‐language spatial relations between linear objects with topological and metric properties
Luo et al. Efficient reverse spatial and textual k nearest neighbor queries on road networks
Honarparvar et al. Improvement of a location-aware recommender system using volunteered geographic information
CN111104503A (en) Construction engineering quality acceptance standard question-answering system and construction method thereof
CN114201480A (en) Multi-source POI fusion method and device based on NLP technology and readable storage medium
van Erp et al. Georeferencing animal specimen datasets
CN112818072A (en) Tourism knowledge map updating method, system, equipment and storage medium
CN116680356A (en) Address data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination