WO2021017679A1 - 地址信息解析方法、装置、系统及数据获取方法 - Google Patents
地址信息解析方法、装置、系统及数据获取方法 Download PDFInfo
- Publication number
- WO2021017679A1 WO2021017679A1 PCT/CN2020/096989 CN2020096989W WO2021017679A1 WO 2021017679 A1 WO2021017679 A1 WO 2021017679A1 CN 2020096989 W CN2020096989 W CN 2020096989W WO 2021017679 A1 WO2021017679 A1 WO 2021017679A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- address information
- data
- feature
- geographic
- geocoding
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- This application relates to the field of address resolution, in particular to address information resolution methods, devices, systems, and data acquisition methods.
- Modern retail companies generate massive amounts of sales data every day, and retail companies will analyze sales data as a basis for corporate decision-making or auxiliary decision-making.
- address data in the sales data is the basic data for smart retail analysis and decision-making.
- small shop location decision-making, logistics resource allocation, geographic dimension sales data analysis, etc. all rely on the analysis of address data in sales data, so the efficiency and accuracy of address data analysis are very important.
- the massive address data is parsed into the rule cleaning technology used in standard geocoding. Specifically, all standard administrative geographic data is first constructed into a dictionary library containing rules, and then the geography in the original data is proposed by regular expressions. Then, the extracted geographic data is matched with the dictionary library, and then the standard form of geographic data is obtained. Finally, the geographic data is converted into geocoding locally and provided to various upper-level retail decision-making applications.
- the address information in the sales data is mostly filled in manually by the user, and there are many irregularities, so that some data cannot be converted into codes, and the accuracy of the analytical results is low.
- the present application provides an address information analysis method, device, system, and data acquisition method, which have solved the problem of address resolution occupying a lot of resources and long analysis time in the prior art.
- a method for parsing address information includes:
- the method further includes:
- the historical address information analysis record includes historical address information and corresponding historical geocoding data
- Using natural language processing technology to extract features from the address information to be resolved includes: if it has not been resolved, then using natural language processing technology to extract features from the address information to be resolved.
- the method further includes:
- the geographic location tree dictionary is formed according to administrative regions hierarchically divided;
- the encoding the standard array to obtain the geocoding result includes encoding the completed standard array to obtain the geocoding result.
- said encoding the standard array to obtain a geocoding result includes:
- the method further includes the step of constructing the preset model in advance:
- the sample feature vector is used as an input, and the corresponding sample array is used as an output, and a neural network and a conditional random algorithm are used for training to obtain the preset model.
- said using natural language processing technology to extract primary features of address data in said sample set and determine primary features that meet certain conditions as target features, and vectorizing said target features to obtain sample feature vectors includes:
- the target feature is vectorized according to the weighting matrix to obtain a sample feature vector.
- the method further includes: associating and storing the geocoding result with the original data.
- the prediction model is set in a spark computing engine, and the geocoding result is associated with the original data and stored in an elasticsearch search engine.
- Another aspect of the present application also provides a data acquisition method, the method includes
- calculation is performed in the association table between the prestored geocoding result and the original data, and the geocoding result and the corresponding original data within the preset geographic range are obtained.
- an address information parsing device which includes:
- the address information obtaining unit to be resolved is used to obtain the address information to be resolved in the original data
- the feature extraction unit is configured to extract features from the address information to be resolved using natural language processing technology, select the extracted features, and vectorize the selected features to obtain a feature vector;
- the model prediction unit is configured to input the feature vector into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to the geographic entities; the preset model is trained based on a combination of cyclic neural networks and conditional random field algorithms;
- the sorting unit is used to sort the geographic entities in the initial array according to the administrative division level to remove duplicates to obtain a standard array;
- the geocoding unit is used to code the standard array to obtain a geocoding result.
- Another aspect of this application provides a computer system, including:
- One or more processors are One or more processors.
- a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
- the technical solution of this application uses natural language processing technology to perform feature extraction and selection on address information and vectorize to obtain the feature vector to be identified, and then use the feature vector to be identified as the model input to predict an initial array including geographic entities and corresponding administrative division levels ; After sorting and removing duplicates, geocoding is performed to obtain the analytical result.
- This process eliminates the need to build a full dictionary library containing rules, reduces the occupation of hardware resources, and places lower requirements on the deployment environment.
- Standard geographic data extraction is performed on massive address information through model prediction, which is not affected by the address information input format, adapts to various data changes, does not require human maintenance, and improves the extraction efficiency of geographic data.
- the prediction model optimized by the feature selection algorithm of this solution is used to discard the cluttered features that have low correlation with the administrative division level. Therefore, the accuracy of extracting geographic information is higher than that of traditional rule matching and the model calculation speed is improved. Geographic data is more accurate.
- the address information encoding function can be encapsulated as a batch analysis interface and placed on an external independent server, without occupying computing resources extracted by geographic data analysis, improving encoding efficiency, and making data processing more real-time.
- the program can also complete the missing administrative geographic information of address information to make the analysis result more accurate.
- Figure 1 is a system structure diagram provided by an embodiment of the present application.
- Figure 2 is a flow chart of specific address information analysis provided by an embodiment of the present application.
- FIG. 3 is a flowchart of an address resolution method provided by an embodiment of the present application.
- Figure 4 is a structural diagram of an apparatus provided by an embodiment of the present application.
- Fig. 5 is an architecture diagram of a computer system provided by an embodiment of the present application.
- the purpose of this application is to provide an address information parsing method that uses natural language processing technology to extract features of address information and select features with high correlation to vectorize to obtain feature vectors, and use pre-built models and feature vector predictions to obtain geographic entities and Corresponding administrative division level, and further sorting and de-duplication to obtain standard form of geographic data, and then geocoding to obtain coordinates to complete the analysis of address information. Since the feature extraction and selection and vectorization are performed on the address information, the features that have a higher correlation with the administrative division level are extracted, thus speeding up the prediction speed of the subsequent model and improving the accuracy of the prediction. At the same time, the use of model predictions does not need to build a full dictionary library containing rules, reducing the occupation of hardware resources.
- the system architecture diagram of this application includes an original data system, an address information processing system, and an encoding system that can exist independently of each other on hardware.
- the original data system is used to provide the original data of the original data system, such as an external system or an OMS (Order Management) system.
- the address information processing system is used to obtain original data such as order information from the original data system, and perform a series of processing on the address information of the original data to obtain standard form of geographic data.
- the encoding system is used to encode the geographic data in the standard form to obtain a geocoding result (usually coordinates).
- the coding system is encapsulated with a batch parsing interface, and the address information processing system can complete the coding of the standard form of geographic data by calling the batch parsing interface of the coding system.
- the address information processing system may also associate the geocoding result obtained from the coding system with the original data corresponding to the geocoding result and store it in the Elasticsearch search engine for subsequent searches for related data.
- the address information processing system may also associate the resolved address information and the corresponding geocoding result as a historical resolution record and store it in the address resolution history table.
- the address information processing system obtains the address information, it first matches in the address resolution history table. If the same address information is matched, the corresponding geocoding result can be directly obtained without performing subsequent processing, and the analysis result of this time No need to save the address resolution history table again. If the same address information cannot be matched, the address information is considered to be resolved for the first time, and the address information processing system will follow the normal processing procedure, and the joint coding system will realize the analysis and coding of the address information, and store the result of this geocoding. Enter the address resolution history table.
- the original data system and the address information processing system can share the same server.
- the encoding system can also share the same server with the address information processing system.
- the encoding system is placed on an independent server and the encoding task is completed by encapsulating the batch parsing interface. Since it does not occupy the computing resources extracted by the address information system from the address information analysis, the encoding efficiency is improved and the data Processing is more real-time.
- the encoding system and the address information processing system are located on different servers, and the original data is order data as an example for description.
- the address information processing system needs to first convert the address information into standard geographic data.
- the address information is "Mr. Li, Binhai New District, No. 18, Xingang Road, Tianjin”
- there are non-geographical data in the address information then it needs to be converted into standard form of geographic data, namely "Tianjin
- this application In order to convert the unprocessed address information into standard form of geographic data, this application first extracts the geographic entities in the address information and the administrative division levels corresponding to the geographic entities.
- the geographical entities are Tianjin, Binhai, Tanggu, etc.
- the administrative division levels are the country, province, urban area, county, etc.
- regular expressions are used to extract geographic entities and corresponding administrative division levels from strings that meet certain rules. In this way, it is not only necessary to build a rule database, but also that the character strings representing addresses comply with certain rules. . For strings that do not meet the rules, the extraction cannot be completed.
- this application provides an administrative geographic entity relationship recognition model optimized by a feature selection algorithm, which uses natural language processing technology (NLP) to perform feature selection on address information, and calculates a feature vector.
- NLP natural language processing technology
- the trained administrative geographic entity relationship recognition model is used to obtain the prediction result, which is a binary geographic entity relationship array political relation composed of geographic entities and corresponding administrative division levels.
- e1...en represents the identified geographic entity
- t1...tn represents the administrative level
- level classification is shown in Table 1.
- the administrative level in the binary array can be replaced by the marker words in Table 1.
- Such as the city can be represented by CI.
- CI For some non-geographical entity and non-administrative division level information, we classify it as redundant information. Of course, repetitive geographic information will also be classified as redundant information.
- the sorted binary array with the tree dictionary to determine whether there is geographic information missing in the binary array.
- a recursive method can be used to check and complete. For example, the geographic information of Tanggu Street is missing between Binhai New Area and Xingang No. 2 Road in the above binary array.
- the aforementioned coding technique can be used to encode the geographic data to obtain the geocoding result.
- this application provides an administrative geographic entity relationship recognition model optimized by feature selection algorithm. Next, the construction and training process of this model will be described:
- the first is to use natural language processing technology (NLP) to extract and select the features of the sample address information, and calculate the sample feature vector. Specific steps are as follows:
- this application can divide the original address information corpus obtained from the original data system into data whose coordinates cannot be obtained by the coordinate analysis program, data whose coordinates are not correct, and data whose coordinates can be obtained correctly. Then each categorical aliquot is selected from the original address information corpus as the basic corpus. After that, word segmentation is performed on the selected corpus and the sample geographic entities of each segmentation and the administrative divisions (administrative geographic identifiers) corresponding to the sample geographic entities are marked. A certain proportion of labeled data is randomly selected for model training, and a certain proportion of labeled data is reserved for model verification.
- Feature frequency FC represents the number of times the feature appears in the address information text, as (1), Ni represents the total number of features appearing in the address information.
- EX ik is the number of texts that appear in the feature pw at levels other than the geographic administrative division level t;
- UN ik is the number of texts that do not appear in the feature pw at the geographic administrative division level t;
- S is the geographic classification of all administrative entities The sum of the number of entity texts.
- each selected target feature will get x correlation degrees, and the average of these x correlation degrees is taken as the weight of each word.
- the weighted normalized sample feature vector is constructed by formula (2)(5)(6)(7)(8) as shown in formula (9):
- the obtained sample feature vector v is used as the vectorized input parameter of model training, and the vectorized training corpus is trained through neural network and conditional random field algorithms such as RNN recurrent neural network and CRF conditional random field algorithm to obtain administrative geographic entity relations Identify the model.
- the final output of the model is a binary geographic entity relationship group as follows:
- the selected target feature has a high correlation with the administrative division level, discarding some cluttered features that have low correlation with the administrative division level, reducing the adverse effects of these cluttered features on the results, and reducing the input of the model.
- the amount of data is optimized using the aforementioned feature selection, so that the parameters of the input model are not messy address information, but feature vectors after selection and optimization, which improves the correlation between the input parameters and geographic entities and corresponding administrative divisions, thus speeding up
- the calculation speed of the model improves the accuracy of the recognition result.
- this method solves the problem of low quality of geographic data, increases the effective resolution of address resolution, and provides more accurate data basis for upper-level decision-making:
- R represents the set of records where the address resolution has obtained the correct coordinates
- G(wr) i represents a certain type of analysis error result set i
- the main error type is the deviation of the analysis coordinates
- T represents the total number of addresses that need to be resolved
- S represents The address is successfully resolved to obtain the coordinate record set.
- E represents the failed record set that did not obtain the coordinate after the address resolution.
- the correct rate of the final address resolution is shown in equation (10)
- the resolution rate is shown in equation (11)
- the effective resolution rate is shown in equation (12).
- the feature selection algorithm is used to optimize the administrative geographic entity relationship recognition model.
- the accuracy of extracting geographic information is higher than that of traditional rule matching, and the extracted geographic data is more correct.
- Embodiment 1 of this application is a specific implementation of Embodiment 1 of this application:
- the analytical task cluster is based on spark technology and uses java to develop data processing tasks to achieve task scheduling and distribution.
- the pre-trained administrative geographic entity relationship recognition model is deployed in the analysis task cluster to identify the administrative division level and geographic entity relationship of low-quality address information, and extract effective information.
- the core administrative geographic entity relationship recognition model is implemented in python language, based on RNN recurrent neural network and CRF conditional random field algorithm for model training, embedded administrative geographic entity feature optimization algorithm, and noise reduction for human interference information. Then the administrative hierarchical sorting algorithm is used to sort and reorganize the administrative geographic entities, and the tree dictionary constructed as described above is used to check the data to obtain standard geographic data, and provide address information with improved quality for subsequent coding.
- the geocoding function can be concurrently scheduled in the spark task cluster, and the RESTful style HTTP address batch resolution interface developed by java is used to perform coding analysis on the address information completed after the model is extracted to obtain standard geocoding information.
- concurrent task scheduling can be used, and a single batch submission method is used to perform batch parsing and encoding of data to improve parsing and encoding throughput without increasing cluster pressure.
- Embodiment 2 of the present application provides an address information resolution method. As shown in FIG. 3, the method includes:
- S32 uses natural language processing technology to perform feature extraction and selection on the address information to be parsed and vectorizes the selected features to obtain the feature vector to be recognized; the specific method can refer to the steps of feature extraction and selection and vectorization in model training.
- S33 input the feature vector to be identified into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to geographic entities;
- S34 sorts the geographic entities in the initial array according to the administrative division level to remove duplicates to obtain a standard array
- S35 encodes the standard array to obtain a geocoding result.
- the encoding interface of the external server can be called to encode the standard array to obtain the geocoding result.
- the method further includes:
- the historical address information analysis record includes historical address information and corresponding historical geocoding data
- feature extraction is performed on the address information to be parsed using natural language processing technology.
- the method further includes:
- the geographic location tree dictionary is formed according to administrative regions hierarchically divided;
- the encoding the standard array to obtain the geocoding result includes encoding the completed standard array to obtain the geocoding result.
- the method of this application also includes the step of pre-constructing the preset model:
- the sample feature vector is used as an input, and the corresponding sample array is used as an output, and a neural network and a conditional random algorithm are used for training to obtain the preset model.
- said using natural language processing technology to extract primary features of address data in said sample set and determine primary features that meet certain conditions as target features, and vectorizing said target features to obtain sample feature vectors includes:
- the target feature is vectorized according to the weighting matrix to obtain a sample feature vector.
- the aforementioned geocoding result can be combined with other data to provide a data basis for subsequent application decision-making. For this reason, in this application, the aforementioned geocoding result can be associated and stored with the original data corresponding to the result.
- the geocoding result can be stored in association with the corresponding original data, and then the product sales in a certain geographic location can be obtained.
- the associated information can be stored in the elasticsearch search engine.
- the third embodiment provided in this application provides a data obtaining method, including:
- calculation is performed in the association table between the prestored geocoding result and the original data, and the geocoding result and the corresponding original data within the preset geographic range are obtained.
- the original data within a certain geographic range can be obtained by using the geocoding result, which provides a data basis for subsequent sales and promotion decisions.
- the fourth embodiment of the present invention provides an address information parsing device. As shown in FIG. 4, the device includes:
- the to-be-resolved address information obtaining unit 41 is configured to obtain the to-be-resolved address information in the original data
- the first feature vectorization unit 42 is configured to use natural language processing technology for feature extraction and selection and vectorization of the address information to be parsed to obtain a feature vector;
- the model prediction unit 43 is configured to input the feature vector into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to the geographic entities; the preset model is trained based on a combination of cyclic neural networks and conditional random field algorithms ;
- the sorting unit 44 is configured to sort the geographic entities in the initial array and remove duplicates according to the administrative division level to obtain a standard array;
- the geocoding unit 45 is configured to code the standard array to obtain a geocoding result.
- the device further includes:
- the resolution record judging unit 46 is connected to the to-be-resolved address information obtaining unit 41, and is used for judging whether the to-be-resolved address information has been resolved according to the pre-stored historical address information analysis record; the historical address information resolution record includes historical address information And corresponding historical geocoding data;
- the analysis record obtaining unit 47 is connected to the analysis record judging unit 46, and is used to obtain the corresponding historical geocoding data as the geocoding result when it is determined that the address information to be resolved is parsed.
- the first feature vectorization unit 42 is specifically configured to perform feature extraction on the address information to be resolved using natural language processing technology when it is determined that the address information to be resolved has not been parsed.
- the device also includes
- the method further includes:
- the completion unit 48 is configured to match the standard array obtained by the sorting unit 44 with a pre-stored geographic location tree dictionary, determine whether the standard array is missing, and if there is a defect based on the geographic location tree dictionary Completing the standard array; the geographical location tree dictionary is formed according to the administrative region classification;
- the geocoding unit 45 is specifically configured to code the completed standard array to obtain a geocoding result.
- the device of this application also includes a unit for pre-building the preset model, specifically including
- the second feature vectorization unit is used to extract features from the address data in the sample set using natural language processing technology and perform feature selection, and vectorize the selected features to obtain the sample feature vector; for the specific process of this step, please refer to the first embodiment Related description in.
- the second feature vectorization unit and the first feature vectorization unit may be the same or different.
- the sample administrative entity relationship unit is used to label the address data in the sample set to obtain a sample array consisting of sample geographic entities and sample administrative division levels corresponding to the sample geographic entities;
- the model training unit is configured to use the sample feature vector as input and the sample array as output, and train through RNN cyclic neural network and CRF conditional random field algorithm to construct the preset model.
- the aforementioned geocoding result can be combined with other data to provide a data basis for subsequent application decision-making.
- the aforementioned device of the present application further includes an associated storage unit for associating and storing the aforementioned geocoding result with the original data corresponding to the result.
- the geocoding result can be stored in association with the corresponding original data, and then the product sales in a certain geographic location can be obtained.
- the associated information can be stored in the elasticsearch search engine.
- Embodiment 5 of the present application provides a computer system, including:
- One or more processors are One or more processors.
- a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
- FIG. 5 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520.
- the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
- the processor 1510 may be implemented by a general CPU (Central Processing Unit, central processing unit), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.
- a general CPU Central Processing Unit, central processing unit
- microprocessor microprocessor
- application specific integrated circuit Application Specific Integrated Circuit, ASIC
- integrated circuits etc.
- the memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc.
- the memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500 and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500.
- BIOS basic input output system
- web browser 1523, data storage management system 1524, and icon font processing system 1525 can also be stored.
- the aforementioned icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510.
- the input/output interface 1513 is used to connect the input/output module to realize information input and output.
- the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
- the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
- the network interface 1514 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices.
- the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
- the bus 1530 includes a path for transmitting information between various components of the device (such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
- various components of the device such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
- the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
- the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation.
- the above-mentioned device may also include only the components necessary for implementing the solution of the present application, and not necessarily all the components shown in the figure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种地址信息解析方法,其特征在于,所述方法包括:获取原始数据中的待解析地址信息;将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到待识别特征向量;将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;对所述标准数组进行编码得到地理编码结果。
- 如权利要求1所述的地址信息解析方法,其特征在于,在将所述待解析地址信息利用自然语言处理技术进行特征提取前,所述方法还包括:根据预存的历史地址信息解析记录,判断所述待解析地址信息是否被解析过;所述历史地址信息解析记录包括历史地址信息及对应的历史地理编码数据;若被解析过,则获取对应的历史地理编码数据作为地理编码结果;所述将所述待解析地址信息利用自然语言处理技术提取特征包括:若未被解析过,则将所述待解析地址信息利用自然语言处理技术进行特征提取。
- 如权利要求1所述的地址信息解析方法,其特征在于,在对所述标准数组进行编码得到地理编码结果前,所述方法还包括:将所述标准数组与预存的地理位置树形字典进行匹配,判断所述标准数组是否有缺失;所述地理位置树形字典按照行政区域逐级划分形成;若有缺失,则根据所述地理位置树形字典对所述标准数组补全;所述对所述标准数组进行编码得到地理编码结果包括对补全后的所述标准数组进行编码得到地理编码结果。
- 如权利要求1所述的地址信息解析方法,其特征在于,所述对所述标准数组进行编码得到地理编码结果包括:调用外部服务器的编码接口,对所述标准数组进行编码得到地理编码结果。
- 如权利要求1-4任一项所述的地址信息解析方法,其特征在于,所述方法还包括预先构建所述预设模型的步骤:对样本集合中的地址数据进行语料标注,获得标注了样本地理实体和样本地理实体对应的行政区划的样本数组;利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量;将所述样本特征向量作为输入,将对应的样本数组作为输出,使用神经网络与条件随机算法料进行训练获得所述预设模型。
- 如权利要求5所述的地址信息解析方法,其特征在于,所述利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量包括:计算提取的每一初级特征在地址文本中出现的频率;根据所述频率计算所述每一初级特征与每个行政区划级别的相关度作为特征权重;选择所述相关度和/或所述频率满足预设条件的所述初级特征作为所述目标特征;计算选择出的每个目标特征与所述每个政区划级别的相关度并将每个目标特征的相关度平均值作为每个目标特征的权值并根据所述权值构建加权矩阵;根据所述加权矩阵对所述目标特征进行向量化得到样本特征向量。
- 如权利要求1-4任一项所述的地址信息解析方法,其特征在于,所述方法还包括:所述预测模型设于spark计算引擎,所述地理编码结果与原始数据关联存储在elasticsearch搜索引擎。
- 一种数据获取方法,其特征在于,所述方法包括接收候选地址信息;对所述候选地址信息按照如权利要求7所述的方法进行解析获得解析后的候选地理编码数据;根据所述候选地理编码数据和预设地理范围,在预存的地理编码结果与原始数据的关联表中进行计算,获取预设地理范围内的地理编码结果和对应的原始数据。
- 一种地址信息解析装置,其特征在于,所述装置包括:待解析地址信息获取单元,用于获取原始数据中的待解析地址信息;特征提取单元,用于将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到待识别特征向量;模型预测单元,用于将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;所述预设模型基于循环神经网络与条件随机场算法相结合训练得到;排序单元,用于按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;地理编码单元,用于对所述标准数组进行编码得到地理编码结果。
- 一种计算机系统,其特征在于,包括:一个或多个处理器;以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:获取原始数据中的待解析地址信息;将所述待解析地址信息利用自然语言处理技术进行特征提取并对提取出的特征进行选择,将选择的特征向量化,得到待识别特征向量;将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;对所述标准数组进行编码得到地理编码结果。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA3145918A CA3145918A1 (en) | 2019-07-26 | 2020-06-19 | Address information parsing method and apparatus, system and data acquisition method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910684395.4A CN110569322A (zh) | 2019-07-26 | 2019-07-26 | 地址信息解析方法、装置、系统及数据获取方法 |
CN201910684395.4 | 2019-07-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021017679A1 true WO2021017679A1 (zh) | 2021-02-04 |
Family
ID=68773824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/096989 WO2021017679A1 (zh) | 2019-07-26 | 2020-06-19 | 地址信息解析方法、装置、系统及数据获取方法 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110569322A (zh) |
CA (1) | CA3145918A1 (zh) |
WO (1) | WO2021017679A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113438280A (zh) * | 2021-06-03 | 2021-09-24 | 多点生活(成都)科技有限公司 | 车辆启动控制方法和装置 |
CN113988949A (zh) * | 2021-11-15 | 2022-01-28 | 北京有竹居网络技术有限公司 | 一种推广信息处理方法、装置、设备及介质、程序产品 |
CN114463053A (zh) * | 2022-01-21 | 2022-05-10 | 浪潮卓数大数据产业发展有限公司 | 一种企业归属分类的方法及系统 |
CN114513550A (zh) * | 2021-12-30 | 2022-05-17 | 天翼云科技有限公司 | 一种地理位置信息的处理方法、装置及电子设备 |
CN115174638A (zh) * | 2022-09-06 | 2022-10-11 | 广东邦盛新能源科技发展有限公司 | 光伏板数据采集设备的组网方法及系统 |
CN115248837A (zh) * | 2022-09-21 | 2022-10-28 | 中科雨辰科技有限公司 | 一种获取文本的地理实体的数据处理系统 |
CN116501827A (zh) * | 2023-06-26 | 2023-07-28 | 北明成功软件(山东)有限公司 | 一种基于bim的市场主体与楼宇地址匹配定位方法 |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569322A (zh) * | 2019-07-26 | 2019-12-13 | 苏宁云计算有限公司 | 地址信息解析方法、装置、系统及数据获取方法 |
CN113076746B (zh) * | 2020-01-06 | 2024-05-31 | 阿里巴巴集团控股有限公司 | 数据处理方法和系统、存储介质及计算设备 |
CN113111230B (zh) * | 2020-02-13 | 2024-04-12 | 北京明亿科技有限公司 | 基于正则表达式的接处警文本户籍地地址提取方法和装置 |
CN113111229B (zh) * | 2020-02-13 | 2024-04-12 | 北京明亿科技有限公司 | 基于正则表达式的接处警文本轨迹地地址提取方法和装置 |
CN111523647B (zh) * | 2020-04-26 | 2023-11-14 | 南开大学 | 网络模型训练方法及装置、特征选择模型、方法及装置 |
CN111901450B (zh) * | 2020-07-15 | 2023-04-18 | 安徽淘云科技股份有限公司 | 实体的地址确定方法、装置、设备及存储介质 |
CN112148819A (zh) * | 2020-08-17 | 2020-12-29 | 北京来也网络科技有限公司 | 结合rpa和ai的地址识别方法和装置 |
CN112269861A (zh) * | 2020-10-09 | 2021-01-26 | 和美(深圳)信息技术股份有限公司 | 智能机器人的语料生成方法及系统 |
CN112257413B (zh) * | 2020-10-30 | 2022-05-17 | 深圳壹账通智能科技有限公司 | 地址参数处理方法及相关设备 |
CN112488200A (zh) * | 2020-11-30 | 2021-03-12 | 上海寻梦信息技术有限公司 | 物流地址特征提取方法、系统、设备及存储介质 |
CN112559661B (zh) * | 2020-12-09 | 2024-03-01 | 北京百度网讯科技有限公司 | 检索地址类型的方法、装置和电子设备 |
CN113610157A (zh) * | 2021-01-20 | 2021-11-05 | 廖彩红 | 基于人工智能的业务大数据特征采集方法及服务器 |
CN112818685A (zh) * | 2021-01-29 | 2021-05-18 | 上海寻梦信息技术有限公司 | 地址匹配方法、装置、电子设备及存储介质 |
CN112989166A (zh) * | 2021-03-26 | 2021-06-18 | 杭州有数金融信息服务有限公司 | 一种计算企业实际经营地的方法 |
CN113138985B (zh) * | 2021-04-22 | 2023-05-02 | 重庆长安汽车股份有限公司 | 一种gps数据解析方法及系统 |
CN113255346B (zh) * | 2021-07-01 | 2021-09-14 | 湖南工商大学 | 一种基于图嵌入与crf知识融入的地址要素识别方法 |
CN113592037B (zh) * | 2021-08-26 | 2023-11-24 | 吉奥时空信息技术股份有限公司 | 一种基于自然语言推断的地址匹配方法 |
CN113642313B (zh) * | 2021-09-02 | 2024-03-29 | 阿里巴巴达摩院(杭州)科技有限公司 | 地址文本的处理方法、装置、设备、存储介质及程序产品 |
CN114301629A (zh) * | 2021-11-26 | 2022-04-08 | 北京六方云信息技术有限公司 | Ip检测方法、装置、终端设备以及存储介质 |
CN114138923B (zh) * | 2021-12-03 | 2024-06-07 | 吉林大学 | 一种构建地质图知识图谱的方法 |
CN115577065B (zh) * | 2022-12-09 | 2023-06-09 | 中信证券股份有限公司 | 一种地址解析的方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955833A (zh) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | 一种通讯地址识别、标准化的方法 |
WO2014163977A1 (en) * | 2013-03-13 | 2014-10-09 | Google Inc. | Systems, methods and computer-readable media for interpreting geographical search queries |
CN109933797A (zh) * | 2019-03-21 | 2019-06-25 | 东南大学 | 基于Jieba分词及地址词库的地理编码方法和系统 |
CN110019617A (zh) * | 2017-12-05 | 2019-07-16 | 腾讯科技(深圳)有限公司 | 地址标识的确定方法和装置、存储介质、电子装置 |
CN110569322A (zh) * | 2019-07-26 | 2019-12-13 | 苏宁云计算有限公司 | 地址信息解析方法、装置、系统及数据获取方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8732435B1 (en) * | 2008-07-30 | 2014-05-20 | Altera Corporation | Single buffer multi-channel de-interleaver/interleaver |
CN102955832B (zh) * | 2011-08-31 | 2015-11-25 | 深圳市华傲数据技术有限公司 | 一种通讯地址识别、标准化的系统 |
CN109960795B (zh) * | 2019-02-18 | 2024-05-07 | 平安科技(深圳)有限公司 | 一种地址信息标准化方法、装置、计算机设备及存储介质 |
-
2019
- 2019-07-26 CN CN201910684395.4A patent/CN110569322A/zh active Pending
-
2020
- 2020-06-19 CA CA3145918A patent/CA3145918A1/en active Pending
- 2020-06-19 WO PCT/CN2020/096989 patent/WO2021017679A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955833A (zh) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | 一种通讯地址识别、标准化的方法 |
WO2014163977A1 (en) * | 2013-03-13 | 2014-10-09 | Google Inc. | Systems, methods and computer-readable media for interpreting geographical search queries |
CN110019617A (zh) * | 2017-12-05 | 2019-07-16 | 腾讯科技(深圳)有限公司 | 地址标识的确定方法和装置、存储介质、电子装置 |
CN109933797A (zh) * | 2019-03-21 | 2019-06-25 | 东南大学 | 基于Jieba分词及地址词库的地理编码方法和系统 |
CN110569322A (zh) * | 2019-07-26 | 2019-12-13 | 苏宁云计算有限公司 | 地址信息解析方法、装置、系统及数据获取方法 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113438280A (zh) * | 2021-06-03 | 2021-09-24 | 多点生活(成都)科技有限公司 | 车辆启动控制方法和装置 |
CN113988949A (zh) * | 2021-11-15 | 2022-01-28 | 北京有竹居网络技术有限公司 | 一种推广信息处理方法、装置、设备及介质、程序产品 |
CN114513550A (zh) * | 2021-12-30 | 2022-05-17 | 天翼云科技有限公司 | 一种地理位置信息的处理方法、装置及电子设备 |
CN114513550B (zh) * | 2021-12-30 | 2024-03-08 | 天翼云科技有限公司 | 一种地理位置信息的处理方法、装置及电子设备 |
CN114463053A (zh) * | 2022-01-21 | 2022-05-10 | 浪潮卓数大数据产业发展有限公司 | 一种企业归属分类的方法及系统 |
CN115174638A (zh) * | 2022-09-06 | 2022-10-11 | 广东邦盛新能源科技发展有限公司 | 光伏板数据采集设备的组网方法及系统 |
CN115248837A (zh) * | 2022-09-21 | 2022-10-28 | 中科雨辰科技有限公司 | 一种获取文本的地理实体的数据处理系统 |
CN115248837B (zh) * | 2022-09-21 | 2022-12-23 | 中科雨辰科技有限公司 | 一种获取文本的地理实体的数据处理系统 |
CN116501827A (zh) * | 2023-06-26 | 2023-07-28 | 北明成功软件(山东)有限公司 | 一种基于bim的市场主体与楼宇地址匹配定位方法 |
CN116501827B (zh) * | 2023-06-26 | 2023-09-12 | 北明成功软件(山东)有限公司 | 一种基于bim的市场主体与楼宇地址匹配定位方法 |
Also Published As
Publication number | Publication date |
---|---|
CA3145918A1 (en) | 2021-02-04 |
CN110569322A (zh) | 2019-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021017679A1 (zh) | 地址信息解析方法、装置、系统及数据获取方法 | |
US11816544B2 (en) | Composite machine learning system for label prediction and training data collection | |
CN106651057B (zh) | 一种基于安装包序列表的移动端用户年龄预测方法 | |
CN106919957B (zh) | 处理数据的方法及装置 | |
CN110688536A (zh) | 一种标签预测方法、装置、设备和存储介质 | |
CN111680506A (zh) | 数据库表的外键映射方法、装置、电子设备和存储介质 | |
JP2023536773A (ja) | テキスト品質評価モデルのトレーニング方法及びテキスト品質の決定方法、装置、電子機器、記憶媒体およびコンピュータプログラム | |
CN113516417A (zh) | 基于智能建模的业务评估方法、装置、电子设备及介质 | |
CN113628043B (zh) | 基于数据分类的投诉有效性判断方法、装置、设备及介质 | |
CN117235608B (zh) | 风险检测方法、装置、电子设备及存储介质 | |
CN110019193B (zh) | 相似帐号识别方法、装置、设备、系统及可读介质 | |
CN113591881A (zh) | 基于模型融合的意图识别方法、装置、电子设备及介质 | |
CN111581197B (zh) | 对数据集中的数据表进行抽样和校验的方法及装置 | |
CN113177644A (zh) | 一种基于词嵌入和深度时序模型的自动建模系统 | |
CN114036921A (zh) | 一种政策信息匹配方法和装置 | |
CN111738290A (zh) | 图像检测方法、模型构建和训练方法、装置、设备和介质 | |
CN113705201B (zh) | 基于文本的事件概率预测评估算法、电子设备及存储介质 | |
CN116340781A (zh) | 相似度确定方法、相似度预测模型训练方法及装置 | |
CN117296064A (zh) | 计算环境中的可解释人工智能 | |
CN115309705A (zh) | 一种自动识别城市信息模型平台基础数据元素的数据集成分类系统及其分类方法 | |
CN114297235A (zh) | 风险地址识别方法、系统及电子设备 | |
CN109919811B (zh) | 基于大数据的保险代理人培养方案生成方法及相关设备 | |
CN109308565B (zh) | 人群绩效等级识别方法、装置、存储介质及计算机设备 | |
CN112800112A (zh) | 一种数据处理系统及数据挖掘方法 | |
CN112765238A (zh) | 一种数据处理系统及数据挖掘方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20846832 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3145918 Country of ref document: CA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20846832 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20846832 Country of ref document: EP Kind code of ref document: A1 |