WO2021017679A1 - 地址信息解析方法、装置、系统及数据获取方法 - Google Patents

地址信息解析方法、装置、系统及数据获取方法 Download PDF

Info

Publication number
WO2021017679A1
WO2021017679A1 PCT/CN2020/096989 CN2020096989W WO2021017679A1 WO 2021017679 A1 WO2021017679 A1 WO 2021017679A1 CN 2020096989 W CN2020096989 W CN 2020096989W WO 2021017679 A1 WO2021017679 A1 WO 2021017679A1
Authority
WO
WIPO (PCT)
Prior art keywords
address information
data
feature
geographic
geocoding
Prior art date
Application number
PCT/CN2020/096989
Other languages
English (en)
French (fr)
Inventor
李男一
徐亮
Original Assignee
苏宁易购集团股份有限公司
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁易购集团股份有限公司, 苏宁云计算有限公司 filed Critical 苏宁易购集团股份有限公司
Priority to CA3145918A priority Critical patent/CA3145918A1/en
Publication of WO2021017679A1 publication Critical patent/WO2021017679A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of address resolution, in particular to address information resolution methods, devices, systems, and data acquisition methods.
  • Modern retail companies generate massive amounts of sales data every day, and retail companies will analyze sales data as a basis for corporate decision-making or auxiliary decision-making.
  • address data in the sales data is the basic data for smart retail analysis and decision-making.
  • small shop location decision-making, logistics resource allocation, geographic dimension sales data analysis, etc. all rely on the analysis of address data in sales data, so the efficiency and accuracy of address data analysis are very important.
  • the massive address data is parsed into the rule cleaning technology used in standard geocoding. Specifically, all standard administrative geographic data is first constructed into a dictionary library containing rules, and then the geography in the original data is proposed by regular expressions. Then, the extracted geographic data is matched with the dictionary library, and then the standard form of geographic data is obtained. Finally, the geographic data is converted into geocoding locally and provided to various upper-level retail decision-making applications.
  • the address information in the sales data is mostly filled in manually by the user, and there are many irregularities, so that some data cannot be converted into codes, and the accuracy of the analytical results is low.
  • the present application provides an address information analysis method, device, system, and data acquisition method, which have solved the problem of address resolution occupying a lot of resources and long analysis time in the prior art.
  • a method for parsing address information includes:
  • the method further includes:
  • the historical address information analysis record includes historical address information and corresponding historical geocoding data
  • Using natural language processing technology to extract features from the address information to be resolved includes: if it has not been resolved, then using natural language processing technology to extract features from the address information to be resolved.
  • the method further includes:
  • the geographic location tree dictionary is formed according to administrative regions hierarchically divided;
  • the encoding the standard array to obtain the geocoding result includes encoding the completed standard array to obtain the geocoding result.
  • said encoding the standard array to obtain a geocoding result includes:
  • the method further includes the step of constructing the preset model in advance:
  • the sample feature vector is used as an input, and the corresponding sample array is used as an output, and a neural network and a conditional random algorithm are used for training to obtain the preset model.
  • said using natural language processing technology to extract primary features of address data in said sample set and determine primary features that meet certain conditions as target features, and vectorizing said target features to obtain sample feature vectors includes:
  • the target feature is vectorized according to the weighting matrix to obtain a sample feature vector.
  • the method further includes: associating and storing the geocoding result with the original data.
  • the prediction model is set in a spark computing engine, and the geocoding result is associated with the original data and stored in an elasticsearch search engine.
  • Another aspect of the present application also provides a data acquisition method, the method includes
  • calculation is performed in the association table between the prestored geocoding result and the original data, and the geocoding result and the corresponding original data within the preset geographic range are obtained.
  • an address information parsing device which includes:
  • the address information obtaining unit to be resolved is used to obtain the address information to be resolved in the original data
  • the feature extraction unit is configured to extract features from the address information to be resolved using natural language processing technology, select the extracted features, and vectorize the selected features to obtain a feature vector;
  • the model prediction unit is configured to input the feature vector into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to the geographic entities; the preset model is trained based on a combination of cyclic neural networks and conditional random field algorithms;
  • the sorting unit is used to sort the geographic entities in the initial array according to the administrative division level to remove duplicates to obtain a standard array;
  • the geocoding unit is used to code the standard array to obtain a geocoding result.
  • Another aspect of this application provides a computer system, including:
  • One or more processors are One or more processors.
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • the technical solution of this application uses natural language processing technology to perform feature extraction and selection on address information and vectorize to obtain the feature vector to be identified, and then use the feature vector to be identified as the model input to predict an initial array including geographic entities and corresponding administrative division levels ; After sorting and removing duplicates, geocoding is performed to obtain the analytical result.
  • This process eliminates the need to build a full dictionary library containing rules, reduces the occupation of hardware resources, and places lower requirements on the deployment environment.
  • Standard geographic data extraction is performed on massive address information through model prediction, which is not affected by the address information input format, adapts to various data changes, does not require human maintenance, and improves the extraction efficiency of geographic data.
  • the prediction model optimized by the feature selection algorithm of this solution is used to discard the cluttered features that have low correlation with the administrative division level. Therefore, the accuracy of extracting geographic information is higher than that of traditional rule matching and the model calculation speed is improved. Geographic data is more accurate.
  • the address information encoding function can be encapsulated as a batch analysis interface and placed on an external independent server, without occupying computing resources extracted by geographic data analysis, improving encoding efficiency, and making data processing more real-time.
  • the program can also complete the missing administrative geographic information of address information to make the analysis result more accurate.
  • Figure 1 is a system structure diagram provided by an embodiment of the present application.
  • Figure 2 is a flow chart of specific address information analysis provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of an address resolution method provided by an embodiment of the present application.
  • Figure 4 is a structural diagram of an apparatus provided by an embodiment of the present application.
  • Fig. 5 is an architecture diagram of a computer system provided by an embodiment of the present application.
  • the purpose of this application is to provide an address information parsing method that uses natural language processing technology to extract features of address information and select features with high correlation to vectorize to obtain feature vectors, and use pre-built models and feature vector predictions to obtain geographic entities and Corresponding administrative division level, and further sorting and de-duplication to obtain standard form of geographic data, and then geocoding to obtain coordinates to complete the analysis of address information. Since the feature extraction and selection and vectorization are performed on the address information, the features that have a higher correlation with the administrative division level are extracted, thus speeding up the prediction speed of the subsequent model and improving the accuracy of the prediction. At the same time, the use of model predictions does not need to build a full dictionary library containing rules, reducing the occupation of hardware resources.
  • the system architecture diagram of this application includes an original data system, an address information processing system, and an encoding system that can exist independently of each other on hardware.
  • the original data system is used to provide the original data of the original data system, such as an external system or an OMS (Order Management) system.
  • the address information processing system is used to obtain original data such as order information from the original data system, and perform a series of processing on the address information of the original data to obtain standard form of geographic data.
  • the encoding system is used to encode the geographic data in the standard form to obtain a geocoding result (usually coordinates).
  • the coding system is encapsulated with a batch parsing interface, and the address information processing system can complete the coding of the standard form of geographic data by calling the batch parsing interface of the coding system.
  • the address information processing system may also associate the geocoding result obtained from the coding system with the original data corresponding to the geocoding result and store it in the Elasticsearch search engine for subsequent searches for related data.
  • the address information processing system may also associate the resolved address information and the corresponding geocoding result as a historical resolution record and store it in the address resolution history table.
  • the address information processing system obtains the address information, it first matches in the address resolution history table. If the same address information is matched, the corresponding geocoding result can be directly obtained without performing subsequent processing, and the analysis result of this time No need to save the address resolution history table again. If the same address information cannot be matched, the address information is considered to be resolved for the first time, and the address information processing system will follow the normal processing procedure, and the joint coding system will realize the analysis and coding of the address information, and store the result of this geocoding. Enter the address resolution history table.
  • the original data system and the address information processing system can share the same server.
  • the encoding system can also share the same server with the address information processing system.
  • the encoding system is placed on an independent server and the encoding task is completed by encapsulating the batch parsing interface. Since it does not occupy the computing resources extracted by the address information system from the address information analysis, the encoding efficiency is improved and the data Processing is more real-time.
  • the encoding system and the address information processing system are located on different servers, and the original data is order data as an example for description.
  • the address information processing system needs to first convert the address information into standard geographic data.
  • the address information is "Mr. Li, Binhai New District, No. 18, Xingang Road, Tianjin”
  • there are non-geographical data in the address information then it needs to be converted into standard form of geographic data, namely "Tianjin
  • this application In order to convert the unprocessed address information into standard form of geographic data, this application first extracts the geographic entities in the address information and the administrative division levels corresponding to the geographic entities.
  • the geographical entities are Tianjin, Binhai, Tanggu, etc.
  • the administrative division levels are the country, province, urban area, county, etc.
  • regular expressions are used to extract geographic entities and corresponding administrative division levels from strings that meet certain rules. In this way, it is not only necessary to build a rule database, but also that the character strings representing addresses comply with certain rules. . For strings that do not meet the rules, the extraction cannot be completed.
  • this application provides an administrative geographic entity relationship recognition model optimized by a feature selection algorithm, which uses natural language processing technology (NLP) to perform feature selection on address information, and calculates a feature vector.
  • NLP natural language processing technology
  • the trained administrative geographic entity relationship recognition model is used to obtain the prediction result, which is a binary geographic entity relationship array political relation composed of geographic entities and corresponding administrative division levels.
  • e1...en represents the identified geographic entity
  • t1...tn represents the administrative level
  • level classification is shown in Table 1.
  • the administrative level in the binary array can be replaced by the marker words in Table 1.
  • Such as the city can be represented by CI.
  • CI For some non-geographical entity and non-administrative division level information, we classify it as redundant information. Of course, repetitive geographic information will also be classified as redundant information.
  • the sorted binary array with the tree dictionary to determine whether there is geographic information missing in the binary array.
  • a recursive method can be used to check and complete. For example, the geographic information of Tanggu Street is missing between Binhai New Area and Xingang No. 2 Road in the above binary array.
  • the aforementioned coding technique can be used to encode the geographic data to obtain the geocoding result.
  • this application provides an administrative geographic entity relationship recognition model optimized by feature selection algorithm. Next, the construction and training process of this model will be described:
  • the first is to use natural language processing technology (NLP) to extract and select the features of the sample address information, and calculate the sample feature vector. Specific steps are as follows:
  • this application can divide the original address information corpus obtained from the original data system into data whose coordinates cannot be obtained by the coordinate analysis program, data whose coordinates are not correct, and data whose coordinates can be obtained correctly. Then each categorical aliquot is selected from the original address information corpus as the basic corpus. After that, word segmentation is performed on the selected corpus and the sample geographic entities of each segmentation and the administrative divisions (administrative geographic identifiers) corresponding to the sample geographic entities are marked. A certain proportion of labeled data is randomly selected for model training, and a certain proportion of labeled data is reserved for model verification.
  • Feature frequency FC represents the number of times the feature appears in the address information text, as (1), Ni represents the total number of features appearing in the address information.
  • EX ik is the number of texts that appear in the feature pw at levels other than the geographic administrative division level t;
  • UN ik is the number of texts that do not appear in the feature pw at the geographic administrative division level t;
  • S is the geographic classification of all administrative entities The sum of the number of entity texts.
  • each selected target feature will get x correlation degrees, and the average of these x correlation degrees is taken as the weight of each word.
  • the weighted normalized sample feature vector is constructed by formula (2)(5)(6)(7)(8) as shown in formula (9):
  • the obtained sample feature vector v is used as the vectorized input parameter of model training, and the vectorized training corpus is trained through neural network and conditional random field algorithms such as RNN recurrent neural network and CRF conditional random field algorithm to obtain administrative geographic entity relations Identify the model.
  • the final output of the model is a binary geographic entity relationship group as follows:
  • the selected target feature has a high correlation with the administrative division level, discarding some cluttered features that have low correlation with the administrative division level, reducing the adverse effects of these cluttered features on the results, and reducing the input of the model.
  • the amount of data is optimized using the aforementioned feature selection, so that the parameters of the input model are not messy address information, but feature vectors after selection and optimization, which improves the correlation between the input parameters and geographic entities and corresponding administrative divisions, thus speeding up
  • the calculation speed of the model improves the accuracy of the recognition result.
  • this method solves the problem of low quality of geographic data, increases the effective resolution of address resolution, and provides more accurate data basis for upper-level decision-making:
  • R represents the set of records where the address resolution has obtained the correct coordinates
  • G(wr) i represents a certain type of analysis error result set i
  • the main error type is the deviation of the analysis coordinates
  • T represents the total number of addresses that need to be resolved
  • S represents The address is successfully resolved to obtain the coordinate record set.
  • E represents the failed record set that did not obtain the coordinate after the address resolution.
  • the correct rate of the final address resolution is shown in equation (10)
  • the resolution rate is shown in equation (11)
  • the effective resolution rate is shown in equation (12).
  • the feature selection algorithm is used to optimize the administrative geographic entity relationship recognition model.
  • the accuracy of extracting geographic information is higher than that of traditional rule matching, and the extracted geographic data is more correct.
  • Embodiment 1 of this application is a specific implementation of Embodiment 1 of this application:
  • the analytical task cluster is based on spark technology and uses java to develop data processing tasks to achieve task scheduling and distribution.
  • the pre-trained administrative geographic entity relationship recognition model is deployed in the analysis task cluster to identify the administrative division level and geographic entity relationship of low-quality address information, and extract effective information.
  • the core administrative geographic entity relationship recognition model is implemented in python language, based on RNN recurrent neural network and CRF conditional random field algorithm for model training, embedded administrative geographic entity feature optimization algorithm, and noise reduction for human interference information. Then the administrative hierarchical sorting algorithm is used to sort and reorganize the administrative geographic entities, and the tree dictionary constructed as described above is used to check the data to obtain standard geographic data, and provide address information with improved quality for subsequent coding.
  • the geocoding function can be concurrently scheduled in the spark task cluster, and the RESTful style HTTP address batch resolution interface developed by java is used to perform coding analysis on the address information completed after the model is extracted to obtain standard geocoding information.
  • concurrent task scheduling can be used, and a single batch submission method is used to perform batch parsing and encoding of data to improve parsing and encoding throughput without increasing cluster pressure.
  • Embodiment 2 of the present application provides an address information resolution method. As shown in FIG. 3, the method includes:
  • S32 uses natural language processing technology to perform feature extraction and selection on the address information to be parsed and vectorizes the selected features to obtain the feature vector to be recognized; the specific method can refer to the steps of feature extraction and selection and vectorization in model training.
  • S33 input the feature vector to be identified into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to geographic entities;
  • S34 sorts the geographic entities in the initial array according to the administrative division level to remove duplicates to obtain a standard array
  • S35 encodes the standard array to obtain a geocoding result.
  • the encoding interface of the external server can be called to encode the standard array to obtain the geocoding result.
  • the method further includes:
  • the historical address information analysis record includes historical address information and corresponding historical geocoding data
  • feature extraction is performed on the address information to be parsed using natural language processing technology.
  • the method further includes:
  • the geographic location tree dictionary is formed according to administrative regions hierarchically divided;
  • the encoding the standard array to obtain the geocoding result includes encoding the completed standard array to obtain the geocoding result.
  • the method of this application also includes the step of pre-constructing the preset model:
  • the sample feature vector is used as an input, and the corresponding sample array is used as an output, and a neural network and a conditional random algorithm are used for training to obtain the preset model.
  • said using natural language processing technology to extract primary features of address data in said sample set and determine primary features that meet certain conditions as target features, and vectorizing said target features to obtain sample feature vectors includes:
  • the target feature is vectorized according to the weighting matrix to obtain a sample feature vector.
  • the aforementioned geocoding result can be combined with other data to provide a data basis for subsequent application decision-making. For this reason, in this application, the aforementioned geocoding result can be associated and stored with the original data corresponding to the result.
  • the geocoding result can be stored in association with the corresponding original data, and then the product sales in a certain geographic location can be obtained.
  • the associated information can be stored in the elasticsearch search engine.
  • the third embodiment provided in this application provides a data obtaining method, including:
  • calculation is performed in the association table between the prestored geocoding result and the original data, and the geocoding result and the corresponding original data within the preset geographic range are obtained.
  • the original data within a certain geographic range can be obtained by using the geocoding result, which provides a data basis for subsequent sales and promotion decisions.
  • the fourth embodiment of the present invention provides an address information parsing device. As shown in FIG. 4, the device includes:
  • the to-be-resolved address information obtaining unit 41 is configured to obtain the to-be-resolved address information in the original data
  • the first feature vectorization unit 42 is configured to use natural language processing technology for feature extraction and selection and vectorization of the address information to be parsed to obtain a feature vector;
  • the model prediction unit 43 is configured to input the feature vector into a preset model to obtain an initial array including geographic entities and administrative division levels corresponding to the geographic entities; the preset model is trained based on a combination of cyclic neural networks and conditional random field algorithms ;
  • the sorting unit 44 is configured to sort the geographic entities in the initial array and remove duplicates according to the administrative division level to obtain a standard array;
  • the geocoding unit 45 is configured to code the standard array to obtain a geocoding result.
  • the device further includes:
  • the resolution record judging unit 46 is connected to the to-be-resolved address information obtaining unit 41, and is used for judging whether the to-be-resolved address information has been resolved according to the pre-stored historical address information analysis record; the historical address information resolution record includes historical address information And corresponding historical geocoding data;
  • the analysis record obtaining unit 47 is connected to the analysis record judging unit 46, and is used to obtain the corresponding historical geocoding data as the geocoding result when it is determined that the address information to be resolved is parsed.
  • the first feature vectorization unit 42 is specifically configured to perform feature extraction on the address information to be resolved using natural language processing technology when it is determined that the address information to be resolved has not been parsed.
  • the device also includes
  • the method further includes:
  • the completion unit 48 is configured to match the standard array obtained by the sorting unit 44 with a pre-stored geographic location tree dictionary, determine whether the standard array is missing, and if there is a defect based on the geographic location tree dictionary Completing the standard array; the geographical location tree dictionary is formed according to the administrative region classification;
  • the geocoding unit 45 is specifically configured to code the completed standard array to obtain a geocoding result.
  • the device of this application also includes a unit for pre-building the preset model, specifically including
  • the second feature vectorization unit is used to extract features from the address data in the sample set using natural language processing technology and perform feature selection, and vectorize the selected features to obtain the sample feature vector; for the specific process of this step, please refer to the first embodiment Related description in.
  • the second feature vectorization unit and the first feature vectorization unit may be the same or different.
  • the sample administrative entity relationship unit is used to label the address data in the sample set to obtain a sample array consisting of sample geographic entities and sample administrative division levels corresponding to the sample geographic entities;
  • the model training unit is configured to use the sample feature vector as input and the sample array as output, and train through RNN cyclic neural network and CRF conditional random field algorithm to construct the preset model.
  • the aforementioned geocoding result can be combined with other data to provide a data basis for subsequent application decision-making.
  • the aforementioned device of the present application further includes an associated storage unit for associating and storing the aforementioned geocoding result with the original data corresponding to the result.
  • the geocoding result can be stored in association with the corresponding original data, and then the product sales in a certain geographic location can be obtained.
  • the associated information can be stored in the elasticsearch search engine.
  • Embodiment 5 of the present application provides a computer system, including:
  • One or more processors are One or more processors.
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • FIG. 5 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520.
  • the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
  • the processor 1510 may be implemented by a general CPU (Central Processing Unit, central processing unit), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for Perform relevant procedures to realize the technical solutions provided in this application.
  • a general CPU Central Processing Unit, central processing unit
  • microprocessor microprocessor
  • application specific integrated circuit Application Specific Integrated Circuit, ASIC
  • integrated circuits etc.
  • the memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc.
  • the memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500 and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500.
  • BIOS basic input output system
  • web browser 1523, data storage management system 1524, and icon font processing system 1525 can also be stored.
  • the aforementioned icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510.
  • the input/output interface 1513 is used to connect the input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the network interface 1514 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 1530 includes a path for transmitting information between various components of the device (such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
  • various components of the device such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
  • the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
  • the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation.
  • the above-mentioned device may also include only the components necessary for implementing the solution of the present application, and not necessarily all the components shown in the figure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种地址信息解析方法、装置、系统及数据获取方法。其中地址信息解析方法包括:获取原始数据中的待解析地址信息;将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到特征向量;将所述特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组:对所述标准数组进行编码得到地理编码结果。基于模型对地址信息的地理实体和行政区划进行识别,无需构建规则库,占用资源少。预测模型经过特征选择算法优化,提高了预测的准确度和计算速率。

Description

地址信息解析方法、装置、系统及数据获取方法 技术领域
本申请涉及地址解析领域,特别是涉及地址信息解析方法、装置、系统及数据获取方法。
背景技术
现代零售企业每天都会产生海量的销售数据,零售企业都会对销售数据进行解析,作为企业决策或辅助决策的依据。尤其是销售数据中的地址数据,它是智慧零售分析与决策的基础数据。比如小店选址决策、物流资源配置、地理维度的销售数据分析等都依赖于销售数据中地址数据的解析,所以地址数据解析的高效与准确非常重要。
目前将海量地址数据解析为标准地理编码都采用的规则清洗技术,具体来说就是先把所有标准行政地理数据构建成一个包含规则的字典库,然后采用正则表达式的方式提出原始数据中的地理数据,再将提取出的地理数据与字典库进行匹配,然后获得标准形式的地理数据,最后在本地将地理数据转换成地理编码,提供给上层各种零售决策应用使用。
但上述方式中需要把所有标准行政地理数据构建成一个包含规则的字典库,这需要消耗大量硬件资源。同时因销售数据的数据量巨大,解析起来耗时较长。
另外销售数据中的地址信息多为用户手动填写,存在很多不规范的情况,使得有部分数据无法转换成编码,解析得到的结果准确性较低。
上述问题也同样出现在其他业务领域的地址数据解析中。
发明内容
本申请提供了一种地址信息解析方法、装置、系统及数据获取方法,已解决现有技术中地址解析占用资源多,解析时间长的问题。
本申请提供了如下方案:
一方面提供了一种地址信息解析方法,所述方法包括:
获取原始数据中的待解析地址信息;
将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到待识别特征向量;
将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级 别的初始数组;
按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
对所述标准数组进行编码得到地理编码结果。
优选的,在将所述待解析地址信息利用自然语言处理技术进行特征提取前,所述方法还包括:
根据预存的历史地址信息解析记录,判断所述待解析地址信息是否被解析过;所述历史地址信息解析记录包括历史地址信息及对应的历史地理编码数据;
若被解析过,则获取对应的历史地理编码数据作为地理编码结果;
将所述待解析地址信息利用自然语言处理技术提取特征包括:若未被解析过,则将所述待解析地址信息利用自然语言处理技术进行特征提取。
优选的,对所述标准数组进行编码得到地理编码结果前,所述方法还包括:
将所述标准数组与预存的地理位置树形字典进行匹配,判断所述标准数组是否有缺失;所述地理位置树形字典按照行政区域逐级划分形成;
若有缺失,则根据所述地理位置树形字典对所述标准数组补全;
所述对所述标准数组进行编码得到地理编码结果包括对补全后的所述标准数组进行编码得到地理编码结果。
优选的,所述对所述标准数组进行编码得到地理编码结果包括:
调用外部服务器的编码接口,对所述标准数组进行编码得到地理编码结果。
优选的,所述方法还包括预先构建所述预设模型的步骤:
对样本集合中的地址数据进行语料标注,获得标注了样本地理实体和样本地理实体对应的行政区划的样本数组;
利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量;
将所述样本特征向量作为输入,将对应的样本数组作为输出,使用神经网络与条件随机算法料进行训练获得所述预设模型。
优选的,所述利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量包括:
计算提取的每一初级特征在地址文本中出现的频率;
根据所述频率计算所述每一初级特征与每个行政区划级别的相关度作为特征权重;
选择所述相关度和/或所述频率满足预设条件的所述初级特征作为所述目标特征;
计算选择出的每个目标特征与所述每个政区划级别的相关度并将每个目标特征的相关度平均值作为每个目标特征的权值并根据所述权值构建加权矩阵;
根据所述加权矩阵对所述目标特征进行向量化得到样本特征向量。
优选的,所述方法还包括:将所述地理编码结果与所述原始数据进行关联存储。
优选的,所述预测模型设于spark计算引擎,所述地理编码结果与所述原始数据关联存储在elasticsearch搜索引擎。
本申请另一方面还提供一种数据获取方法,所述方法包括
接收候选地址信息;
对所述候选地址信息按照如上述的方法进行解析获得解析后的候选地理编码数据;
根据所述候选地理编码数据和预设地理范围,在预存的地理编码结果与原始数据的关联表中进行计算,获取预设地理范围内的地理编码结果和对应的原始数据。
本申请再一方面还提供一种地址信息解析装置,所述装置包括:
待解析地址信息获取单元,用于获取原始数据中的待解析地址信息;
特征提取单元,用于将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到特征向量;
模型预测单元,用于将所述特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;所述预设模型基于循环神经网络与条件随机场算法相结合训练得到;
排序单元,用于按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
地理编码单元,用于对所述标准数组进行编码得到地理编码结果。
本申请还一方面提供一种计算机系统,包括:
一个或多个处理器;以及
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:
获取原始数据中的待解析地址信息;
将所述待解析地址信息利用自然语言处理技术进行特征提取并对提取出的特征进行选 择,将选择的特征向量化,得到待识别特征向量;
将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;
按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
对所述标准数组进行编码得到地理编码结果
根据本申请提供的具体实施例,本申请公开了以下技术效果:
本申请的技术方案,通过自然语言处理技术对地址信息进行特征提取选择并向量化得到待识别特征向量,进而利用待识别特征向量作为模型输入,预测得到包括地理实体和对应行政区划级别的初始数组;之后进行排序去重后进行地理编码得到解析结果。这一过程无需构建包含规则的全量字典库,减少硬件资源的占用,对部署环境要求更低。通过模型预测的方式对海量地址信息进行标准地理数据提取,不受地址信息录入格式的影响,自适应各种数据变化,无需人力维护,同时提升了地理数据的提取效率。进一步的,利用本方案的特征选择算法优化的预测模型,由于摒弃了与行政区划级别相关度低的杂乱特征,因此提取地理信息的正确率高于传统规则匹配且提高了模型计算速度,提取的地理数据更加正确。
更进一步的,地址信息编码功能可以封装为批量解析接口放在外部独立的服务器,不占用地理数据分析提取的计算资源,提高编码效率,让数据处理更实时。另外,该方案还可以对地址信息的缺失行政地理信息进行补全,让解析结果更加准确。
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的系统结构图;
图2是本申请实施例提供的具体地址信息解析流程图;
图3是本申请实施例提供的地址解析方法流程图;
图4是本申请实施例提供的装置结构图;
图5是本申请实施例提供的计算机系统架构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
本申请旨在提供一种地址信息解析方法,通过自然语言处理技术对地址信息进行特征提取并选择相关度大的特征进行向量化得到特征向量,利用预先构建的模型和特征向量预测得到地理实体和对应的行政区划级别,并进一步排序去重得到标准形式的地理数据,进而进行地理编码得到坐标从而完成地址信息的解析。由于对地址信息进行了特征提取选择和向量化处理,提取了与行政区划级别具有较高相关度的特征,因此加快了后续模型的预测速度,提高了预测的准确度。同时利用模型预测无需构建包含规则的全量字典库,减少了硬件资源的占用。
实施例一
如图1所示,为本申请的系统架构图,其包括在硬件上可互相独立存在的原始数据系统、地址信息处理系统和编码系统。其中原始数据系统用于提供原始数据的原始数据系统,如外部系统或者OMS(订单管理)系统等。地址信息处理系统用于从原始数据系统获得原始数据如订单信息,并对原始数据的地址信息进行一系列处理以得到标准形式的地理数据。编码系统用于对所述标准形式的地理数据进行编码得到地理编码结果(通常为坐标)。其中编码系统中封装有批量解析接口,地址信息处理系统可以通过调用编码系统的批量解析接口完成对标准形式的地理数据的编码。
其中地址信息处理系统还可以将从编码系统获取的地理编码结果与该地理编码结果对应的原始数据进行关联并存储在Elasticsearch搜索引擎,用于后续对相关数据的搜索。
如图1所示,地址信息处理系统还可以将已经解析过的地址信息及对应的地理编码结果关联后作为历史解析记录存储在地址解析历史表。当地址信息处理系统获取到地址信息时先在地址解析历史表中进行匹配,如果匹配到相同的地址信息,则直接获取对应的地理编码结果即可,无需执行后续处理,且该次的解析结果无需再次存入地址解析历史表。如果匹配不到相同的地址信息,则认为该地址信息为首次解析,则地址信息处理系统将按照正常的处理流程,联合编码系统实现对该地址信息的解析编码,并将此次地理编码结果存入地址解析历史表中。
在另一实施例的系统结构中,原始数据系统可以与地址信息处理系统共用同一服务器。并且编码系统也可以与地址信息处理系统共用同一服务器。但相比较而言,采用编码系统置于独立的服务器,并通过封装批量解析接口的方式完成编码任务,由于不占用地址信息系统对地址信息分析提取的计算资源,因此提高了编码效率,让数据处理更实时。
本申请以下实施例将以编码系统与地址信息处理系统分置于不同服务器,原始数据为订单数据为例进行描述。
在订单数据中,存在用以表示信息不同属性的字段,如订单人、价格、地址等,通过这些字段可以快速的定位到地址信息。由于原始数据中的地址信息多数由人手动填写,存在各种错误和不规范,为此地址信息处理系统需要首先将这些地址信息转换为标准形式的地理数据。如地址信息为“天津新港二号路18号滨海新区李先生”,该地址信息中存在非地理数据,那么就需要将其转换为标准形式的地理数据即“天津市|滨海新区|塘沽街道|新港二号路18号”。
为将未经处理的地址信息转换为标准形式的地理数据,本申请首先提取地址信息中的地理实体以及地理实体对应的行政区划级别。地理实体即天津、滨海、塘沽等,行政区划级别即国家、省份、市区、县等级别。如现有技术中所讲,都是利用正则表达式将符合一定规则的字符串提取出地理实体及对应的行政区划级别,这样不仅需要构建规则库,还需要表征地址的字符串符合一定的规则。对于不符合规则的字符串则无法完成提取。为解决此问题,本申请特提供一种通过特征选择算法优化的行政地理实体关系识别模型,利用自然语言处理技术(NLP)对地址信息进行特征选择,并计算得到特征向量。以特征向量为输入,利用训练好的行政地理实体关系识别模型得到预测结果即由地理实体和对应的行政区划级别组成的一个二元地理实体关系数组political relation。如下式:
political relation=[(e1,t1),(e2,t2),...(en,tn)]
这里e1…en代表识别出地理实体,t1…tn代表行政级别,级别分类见表1,二元数组中的行政级别可以用表1中的标志词代替。如市可以用CI表示。对于一些非地理实体和非行政区划级别的信息,我们归为冗余信息。当然重复的地理信息我们也会归为冗余信息。
表1
Figure PCTCN2020096989-appb-000001
如图2所示,以地址信息为“天津新港二号路18号滨海新区李先生谢谢合作”为例,经此模型预测步骤会得到:
[(‘天津’,‘CI’),(‘新港二号路18号’,‘RO’),(‘滨海新区’,‘AR’),(‘李先生’,‘OT’),(‘谢谢’,‘OT’),(‘合作’,‘OT’)]
显然的,上述得到的二元数组还存在几个问题:
1、缺少部分地理实体。如滨海新区与新港二号路之间缺少街道信息。
2、存在很多冗余信息。需要说明的是,如果上述地址中出现多次相同的地理信息,则只会保留一个,其余重复的也应当归于冗余信息。
为解决上述2个问题,我们按照行政区划级别的顺序,将每一行政区划级别以及该级别的每一地理实体作为一个节点,将国家行政级别地理信息编辑为树形字典。
对上述模型预测的二元数组进行排序去重,剔除冗余并按照行政区划级别进行排序后得到新的二元数组即一个标准地址。具体参照行政级别标准CO>PR>CI>AR>ST>RO>BU,进行类别编码,按照编码升序排列,无对应任何行政区划级别的信息以及重复的地理信息作为冗余信息被剔除。如上述二元数组排序去重后如图2所示会得到如下数组:
[(‘天津’,‘CI’),(‘滨海新区’,‘AR’),(‘新港二号路18号’,‘RO’)]
之后将该排序去重后的二元数组与树形字典进行匹配,以确定二元数组中是否有地理 信息缺失。具体可采用递归方法进行查缺补全。比如上述二元数组中的滨海新区与新港二号路之间缺少塘沽街道这一地理信息。
如有地理信息缺失,则按照树形字典将二元数组的地理信息补全。之后获得标准形式的地理数据,如图2所示:
[(‘天津’,‘CI’),(‘滨海新区’,‘AR’),(‘塘沽街道’,‘ST’),(‘新港二号路18号’,‘RO’)]
获得标准形式的地理数据之后即可采用前述的编码技术对地理数据编码,得到地理编码结果。
上述提及本申请提供一种通过特征选择算法优化的行政地理实体关系识别模型,接下来将对该模型的构建训练过程进行描述:
首先是利用自然语言处理技术(NLP)对样本地址信息进行特征提取和选择,并计算得到样本特征向量。具体步骤如下:
1、构建地址信息语料的样本集合,地址信息语料可以从图1中的原始数据系统获得。为进一步提高准确度,本申请可以将从原始数据系统获得的原始地址信息语料分为坐标解析程序无法获得坐标编码的数据,获取坐标不正确的数据,以及能够正确获取坐标的数据。然后每个分类等份从原始地址信息语料中筛选出来,作为基础语料。之后对筛选出来的语料进行分词并标注出每个分词的样本地理实体和样本地理实体对应的行政区划(行政地理标识)。随机选取一定比例的标注数据进行模型训练,并预留一定比例的标注数据进行模型验证。
2、特征提取和选择:
2.1对上述用于模型训练的标注的地址数据进行特征提取,之后对每一个地理行政区划级别,将提取的特征进行重算特征频率FC,Nik表示特征在地址信息文本中出现的次数,如式(1),Ni表示地址信息中出现的特征总数。
Figure PCTCN2020096989-appb-000002
2.2计算每一特征pw和每一地理行政区划级别t相关度,获得特征权重W如式(2):
Figure PCTCN2020096989-appb-000003
其中,EX ik为在除了地理行政区划级别t的其他级别中特征pw出现的文本数;UN ik为在地理行政区划级别t中特征pw未出现的文本数;S为所有行政实体分 类中的地理实体文本数的总和。
2.3计算权重平均值W avg和特征频率平均值FC avg,(3)和(4)中FN表示特征类型总数,当特征的权重满足W>W avg或者(W<W avg且FC>FC avg),即为选定的目标特征
Figure PCTCN2020096989-appb-000004
Figure PCTCN2020096989-appb-000005
3、计算目标特征的样本特征向量:
3.1有x个地理行政区划级别,那么选择出的每个目标特征将得到x个相关度,取这x个相关度的平均值作为每个词的权值。根据特征权值获得加权矩阵A rc
A rc=(W ija ij) r*c  (5)
3.2特征向量计算,设Y∈R n*n有n个无关的特征向量,主特征值m 1满足|m 1|>|m 2|≥...≥|m n|,则对任意行政地理实体特征向量v 0=c 0,按下述方法构造的向量序列{c k},{v k}:
Figure PCTCN2020096989-appb-000006
则有:
lim k→∞μ k=m 1  (7)
Figure PCTCN2020096989-appb-000007
由式(2)(5)(6)(7)(8)构建获得加权归一化样本特征向量如式(9)所表示:
Figure PCTCN2020096989-appb-000008
之后将获得的样本特征向量v作为模型训练的向量化入参,通过神经网络与条件随机场算法如RNN循环神经网络与CRF条件随机场算法对向量化的训练语料进行训练,获得行政地理实体关系识别模型。模型最终输出的是一个二元地理实体关系组如下:
political relation=[(e1,t1),(e2,t2),...(en,tn)]
上述模型的构建中,选择的目标特征与行政区划级别的相关度大,摒弃了一些与行政区划级别相关度低的杂乱特征,减少了这些杂乱特征对结果的不利影响,而且减少了模型 输入的数据量。利用前述的特征选择进行了算法优化,使得输入模型的参数不是杂乱的地址信息,而是经过选择优化后的特征向量,提高了输入的参数与地理实体以及对应行政区划的相关度,因此加快了模型的计算速度,提高了识别结果的准确度。
基于正则规则的地址数据解析,要将全量的标准地理信息与地址规则读入内存构建词典树,以一台服务器为例,全量的规则词典树需要4GB内存,使用本申请方案,以行政区地理实体识别模型代替全量地理信息规则词典树,该模型只需要200MB内存空间,对比现有技术,本申请对内存方使用只需要现有技术的4.88%,降低了使用成本。
另外该方法相对现有技术解决了地理数据质量不高的问题,增加了地址解析的有效解析量,为上层决策提供更加准确的数据依据:
构建标准地理字典库结合正则提取的地址解析技术方案在对地址数据处理时有比较多的局限性,对于地址信息因人为因素存在比较多的脏数据场景,用这种普遍的技术方案的基本无法获得正确的地理信息。这里结合地址解析场景定义评价指标:正确率、解析率、有效解析率。
如下,R表示地址解析获取到了正确坐标的记录集合,G(wr) i表示某种类型的解析错误结果集i,主要错误类型是解析坐标有偏差,T表示需要解析地址的总数量,S表示地址成功解析获取到了坐标的记录集合,E表示地址解析后没有获得坐标的失败记录集合,最终地址解析的正确率如式(10),解析率如式(11),有效解析率如式(12)。
解析正确结果集:R 解析错误结果集:
Figure PCTCN2020096989-appb-000009
总样本数:T 解析成功结果集:S=T-E 解析失败结果集:E
Figure PCTCN2020096989-appb-000010
Figure PCTCN2020096989-appb-000011
Figure PCTCN2020096989-appb-000012
以10000条地址数据测试结果进行对比评估,基于字典与正则匹配技术的解析正确率为86.41%,其中13.59%解析结果不正确是由于地址信息中存在冗余信息、词序混乱等数据质量问题,同时数据质量问题还导致了部分数据解析失败获取不到坐标,使用该技术的解析率只有81%。而本申请方案在同样本下,解析率达到了98%,对比现有技术提升了17%,有效解析率从70%提升到了93%,如表2所示。
表2技术指标提升量
Figure PCTCN2020096989-appb-000013
而利用特征选择算法对行政地理实体关系识别模型进行优化,提取地理信息的正确率高于传统规则匹配,提取的地理数据更加正确。
以下为本申请实施例一的一种具体实现:
构建底层数据同步任务,将原始数据系统中的原始录入的地址信息存储到解析任务集群的HDFS中。解析任务集群基于spark技术,用java开发数据处理任务,实现任务调度分配。在解析任务集群中部署预先训练好的行政地理实体关系识别模型,对低质量的地址信息进行行政区划级别和地理实体关系的识别,提取有效信息。其中核心的行政地理实体关系识别模型采用python语言实现,基于RNN循环神经网络与CRF条件随机场算法进行模型训练,嵌入行政地理实体特征优化算法,对人为干扰信息进行降噪。然后采用行政分级排序算法对行政地理实体进行排序重组,利用前述构建的树形字典对数据进行检查补漏,获得标准的地理数据,为后续编码提供提高质量地址信息。
地理编码功能,可以在spark任务集群进行并发调度,采用java开发的基于RESTful风格的http解析地址批量解析接口,对模型提取后补全的地址信息进行编码解析,获取标准地理编码信息。为了提升解析效率,可以采用任务并发调度的同时,运用了单次用批量提交的方式,对数据进行批量解析编码,在不增加集群压力情况下,提升解析编码吞吐量。
由于采用独立的批量编码解析服务,不会与提取计算抢占资源,解析时间明显缩短,在结合行政地理实体关系模型嵌入spark计算引擎内,原1千万数据需要15天解析完,采用专利方案后只需10个小时,速度提升了36倍。
实施例二
基于上述描述,本申请实施例二提供一种地址信息解析方法,如图3所示,所述方法包括:
S31获取原始数据中的待解析地址信息;
S32将所述待解析地址信息利用自然语言处理技术进行特征提取选择并对选择的特征向量化得到待识别特征向量;具体的方式可以参考模型训练中的特征提取选择以及向量化的步骤。
S33将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;
S34按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
S35对所述标准数组进行编码得到地理编码结果。具体的,可以调用外部服务器的编码接口,对所述标准数组进行编码得到地理编码结果。
优选的,在将所述待解析地址信息利用自然语言处理技术进行特征提取前,所述方法还包括:
根据预存的历史地址信息解析记录,判断所述待解析地址信息是否被解析过;所述历史地址信息解析记录包括历史地址信息及对应的历史地理编码数据;
若被解析过,则获取对应的历史地理编码数据作为地理编码结果;
若未被解析过,则将所述待解析地址信息利用自然语言处理技术进行特征提取。
为避免数组中的信息不完整,在对所述标准数组进行编码得到地理编码结果前,所述方法还包括:
将所述标准数组与预存的地理位置树形字典进行匹配,判断所述标准数组是否有缺失;所述地理位置树形字典按照行政区域逐级划分形成;
若有缺失,则根据所述地理位置树形字典对所述标准数组补全;
所述对所述标准数组进行编码得到地理编码结果包括对补全后的所述标准数组进行编码得到地理编码结果。
本申请方法还包括预先构建所述预设模型的步骤:
对样本集合中的地址数据进行语料标注,获得标注了样本地理实体和样本地理实体对应的行政区划的样本数组;
利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量;
将所述样本特征向量作为输入,将对应的样本数组作为输出,使用神经网络与条件随机算法料进行训练获得所述预设模型。
优选的,所述利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将 符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量包括:
计算提取的每一初级特征在地址文本中出现的频率;
根据所述频率计算所述每一初级特征与每个行政区划级别的相关度作为特征权重;
选择所述相关度和/或所述频率满足预设条件的所述初级特征作为所述目标特征;
计算选择出的每个目标特征与所述每个政区划级别的相关度并将每个目标特征的相关度平均值作为每个目标特征的权值并根据所述权值构建加权矩阵;
根据所述加权矩阵对所述目标特征进行向量化得到样本特征向量。
上述预先构建所述预设模型的更具体步骤可以参见上述模型训练的过程。
上述地理编码结果可以结合其他数据一起为后续应用决策提供数据基础,为此,本申请中可将上述地理编码结果与该结果对应的原始数据进行关联存储。
以原始数据为销售数据为例,在将一原始数据的地址信息解析得到准确的地理编码结果后,可将该地理编码结果与对应的原始数据关联存储,就可以获得某一地理位置的商品销售情况。为方便后续检索,该关联信息可以存储在elasticsearch搜索引擎中。
实施例三
以上述关联存储为基础,以请求获得某一地域范围内的相关数据为例,本申请提供实施例三提供一种数据获取方法,包括:
接收候选地址信息;
对所述候选地址信息按照上述的地址解析方法进行解析获得解析后的候选地理编码数据;
根据所述候选地理编码数据和预设地理范围,在预存的地理编码结果与原始数据的关联表中进行计算,获取预设地理范围内的地理编码结果和对应的原始数据。
通过上述方法即可以利用地理编码结果获得一定地理范围内的原始数据,为后续进行销售、推广等决策提供数据基础。
实施例四
对应上述实施例二的方法,本发明实施例四提供一种地址信息解析装置,如图4所示,该装置包括:
待解析地址信息获取单元41,用于获取原始数据中的待解析地址信息;
第一特征向量化单元42,用于将所述待解析地址信息利用自然语言处理技术进行特征 提取选择并向量化,得到特征向量;
模型预测单元43,用于将所述特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;所述预设模型基于循环神经网络与条件随机场算法相结合训练得到;
排序单元44,用于按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
地理编码单元45,用于对所述标准数组进行编码得到地理编码结果。
优选的,所述装置还包括:
解析记录判断单元46,与待解析地址信息获取单元41相连,用于根据预存的历史地址信息解析记录,判断所述待解析地址信息是否被解析过;所述历史地址信息解析记录包括历史地址信息及对应的历史地理编码数据;
解析记录获取单元47,与解析记录判断单元46相连,用于在判断到待解析地址信息被解析时,获取对应的历史地理编码数据作为地理编码结果。
所述第一特征向量化单元42,具体用于在判断到待解析地址信息未被解析过时,将所述待解析地址信息利用自然语言处理技术进行特征提取。
为避免数组中的信息不完整,所述装置还包括
在对所述标准数组进行编码得到地理编码结果前,所述方法还包括:
补全单元48,用于将排序单元44排序得到的所述标准数组与预存的地理位置树形字典进行匹配,判断所述标准数组是否有缺失并在有缺失时根据所述地理位置树形字典对所述标准数组补全;所述地理位置树形字典按照行政区域逐级划分形成;
地理编码单元45具体用于对补全后的所述标准数组进行编码得到地理编码结果。
本申请装置还包括预先构建所述预设模型的单元,具体包括
第二特征向量化单元,用于对样本集合中的地址数据利用自然语言处理技术提取特征并进行特征选择,对选择的特征进行向量化得到样本特征向量;该步骤的具体过程可以参见实施例一中的相关描述。其中第二特征向量化单元与第一特征向量化单元可以相同或不同。
样本行政实体关系单元,用于对样本集合中的地址数据进行语料标注,得到包括样本地理实体和样本地理实体对应的样本行政区划级别构成的样本数组;
模型训练单元,用于以所述样本特征向量做输入,以所述样本数组做输出,通过RNN 循环神经网络与CRF条件随机场算法进行训练,构建所述预设模型。
上述地理编码结果可以结合其他数据一起为后续应用决策提供数据基础,为此,本申请上述装置还包括关联存储单元,用于将上述地理编码结果与该结果对应的原始数据进行关联存储。
以原始数据为销售数据为例,在将一原始数据的地址信息解析得到准确的地理编码结果后,可将该地理编码结果与对应的原始数据关联存储,就可以获得某一地理位置的商品销售情况。为方便后续检索,该关联信息可以存储在elasticsearch搜索引擎中。
实施例五
对应上述方法和装置,本申请实施例五提供一种计算机系统,包括:
一个或多个处理器;以及
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:
获取原始数据中的待解析地址信息;
将所述待解析地址信息利用自然语言处理技术进行特征提取选择,并将选择的特征向量化,得到特征向量;
将所述特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;
按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
对所述标准数组进行编码得到地理编码结果。
其中,图5示例性的展示出了计算机系统的架构,具体可以包括处理器1510,视频显示适配器1511,磁盘驱动器1512,输入/输出接口1513,网络接口1514,以及存储器1520。上述处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520之间可以通过通信总线1530进行通信连接。
其中,处理器1510可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请所提供的技术方案。
存储器1520可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1520可以存储用于控制计算机系统1500运行的操作系统1521,用于控制计算机系统1500的低级别操作的基 本输入输出系统(BIOS)。另外,还可以存储网页浏览器1523,数据存储管理系统1524,以及图标字体处理系统1525等等。上述图标字体处理系统1525就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之,在通过软件或者固件来实现本申请所提供的技术方案时,相关的程序代码保存在存储器1520中,并由处理器1510来调用执行。
输入/输出接口1513用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。
网络接口1514用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。
总线1530包括一通路,在设备的各个组件(例如处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520)之间传输信息。
另外,该计算机系统1500还可以从虚拟资源对象领取条件信息数据库1541中获得具体领取条件的信息,以用于进行条件判断,等等。
需要说明的是,尽管上述设备仅示出了处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,存储器1520,总线1530等,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本申请方案所必需的组件,而不必包含图中所示的全部组件。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,云服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法 实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上对本申请所提供的数据处理方法、装置及设备,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种地址信息解析方法,其特征在于,所述方法包括:
    获取原始数据中的待解析地址信息;
    将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到待识别特征向量;
    将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;
    按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
    对所述标准数组进行编码得到地理编码结果。
  2. 如权利要求1所述的地址信息解析方法,其特征在于,在将所述待解析地址信息利用自然语言处理技术进行特征提取前,所述方法还包括:
    根据预存的历史地址信息解析记录,判断所述待解析地址信息是否被解析过;所述历史地址信息解析记录包括历史地址信息及对应的历史地理编码数据;
    若被解析过,则获取对应的历史地理编码数据作为地理编码结果;
    所述将所述待解析地址信息利用自然语言处理技术提取特征包括:若未被解析过,则将所述待解析地址信息利用自然语言处理技术进行特征提取。
  3. 如权利要求1所述的地址信息解析方法,其特征在于,在对所述标准数组进行编码得到地理编码结果前,所述方法还包括:
    将所述标准数组与预存的地理位置树形字典进行匹配,判断所述标准数组是否有缺失;所述地理位置树形字典按照行政区域逐级划分形成;
    若有缺失,则根据所述地理位置树形字典对所述标准数组补全;
    所述对所述标准数组进行编码得到地理编码结果包括对补全后的所述标准数组进行编码得到地理编码结果。
  4. 如权利要求1所述的地址信息解析方法,其特征在于,所述对所述标准数组进行编码得到地理编码结果包括:
    调用外部服务器的编码接口,对所述标准数组进行编码得到地理编码结果。
  5. 如权利要求1-4任一项所述的地址信息解析方法,其特征在于,所述方法还包括预先构建所述预设模型的步骤:
    对样本集合中的地址数据进行语料标注,获得标注了样本地理实体和样本地理实体对应的行政区划的样本数组;
    利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量;
    将所述样本特征向量作为输入,将对应的样本数组作为输出,使用神经网络与条件随机算法料进行训练获得所述预设模型。
  6. 如权利要求5所述的地址信息解析方法,其特征在于,所述利用自然语言处理技术提取所述样本集合中的地址数据的初级特征并将符合一定条件的初级特征确定为目标特征,对所述目标特征进行向量化得到样本特征向量包括:
    计算提取的每一初级特征在地址文本中出现的频率;
    根据所述频率计算所述每一初级特征与每个行政区划级别的相关度作为特征权重;
    选择所述相关度和/或所述频率满足预设条件的所述初级特征作为所述目标特征;
    计算选择出的每个目标特征与所述每个政区划级别的相关度并将每个目标特征的相关度平均值作为每个目标特征的权值并根据所述权值构建加权矩阵;
    根据所述加权矩阵对所述目标特征进行向量化得到样本特征向量。
  7. 如权利要求1-4任一项所述的地址信息解析方法,其特征在于,所述方法还包括:
    所述预测模型设于spark计算引擎,所述地理编码结果与原始数据关联存储在elasticsearch搜索引擎。
  8. 一种数据获取方法,其特征在于,所述方法包括
    接收候选地址信息;
    对所述候选地址信息按照如权利要求7所述的方法进行解析获得解析后的候选地理编码数据;
    根据所述候选地理编码数据和预设地理范围,在预存的地理编码结果与原始数据的关联表中进行计算,获取预设地理范围内的地理编码结果和对应的原始数据。
  9. 一种地址信息解析装置,其特征在于,所述装置包括:
    待解析地址信息获取单元,用于获取原始数据中的待解析地址信息;
    特征提取单元,用于将所述待解析地址信息利用自然语言处理技术提取特征并对提取出的特征进行选择,将选择的特征向量化得到待识别特征向量;
    模型预测单元,用于将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;所述预设模型基于循环神经网络与条件随机场算法相结合训练得到;
    排序单元,用于按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
    地理编码单元,用于对所述标准数组进行编码得到地理编码结果。
  10. 一种计算机系统,其特征在于,包括:
    一个或多个处理器;以及
    与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:
    获取原始数据中的待解析地址信息;
    将所述待解析地址信息利用自然语言处理技术进行特征提取并对提取出的特征进行选择,将选择的特征向量化,得到待识别特征向量;
    将所述待识别特征向量输入预设模型得到包括地理实体及地理实体对应的行政区划级别的初始数组;
    按照行政区划级别对所述初始数组中的地理实体进行排序去重以得到标准数组;
    对所述标准数组进行编码得到地理编码结果。
PCT/CN2020/096989 2019-07-26 2020-06-19 地址信息解析方法、装置、系统及数据获取方法 WO2021017679A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3145918A CA3145918A1 (en) 2019-07-26 2020-06-19 Address information parsing method and apparatus, system and data acquisition method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910684395.4A CN110569322A (zh) 2019-07-26 2019-07-26 地址信息解析方法、装置、系统及数据获取方法
CN201910684395.4 2019-07-26

Publications (1)

Publication Number Publication Date
WO2021017679A1 true WO2021017679A1 (zh) 2021-02-04

Family

ID=68773824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096989 WO2021017679A1 (zh) 2019-07-26 2020-06-19 地址信息解析方法、装置、系统及数据获取方法

Country Status (3)

Country Link
CN (1) CN110569322A (zh)
CA (1) CA3145918A1 (zh)
WO (1) WO2021017679A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438280A (zh) * 2021-06-03 2021-09-24 多点生活(成都)科技有限公司 车辆启动控制方法和装置
CN113988949A (zh) * 2021-11-15 2022-01-28 北京有竹居网络技术有限公司 一种推广信息处理方法、装置、设备及介质、程序产品
CN114463053A (zh) * 2022-01-21 2022-05-10 浪潮卓数大数据产业发展有限公司 一种企业归属分类的方法及系统
CN114513550A (zh) * 2021-12-30 2022-05-17 天翼云科技有限公司 一种地理位置信息的处理方法、装置及电子设备
CN115174638A (zh) * 2022-09-06 2022-10-11 广东邦盛新能源科技发展有限公司 光伏板数据采集设备的组网方法及系统
CN115248837A (zh) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 一种获取文本的地理实体的数据处理系统
CN116501827A (zh) * 2023-06-26 2023-07-28 北明成功软件(山东)有限公司 一种基于bim的市场主体与楼宇地址匹配定位方法

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569322A (zh) * 2019-07-26 2019-12-13 苏宁云计算有限公司 地址信息解析方法、装置、系统及数据获取方法
CN113076746B (zh) * 2020-01-06 2024-05-31 阿里巴巴集团控股有限公司 数据处理方法和系统、存储介质及计算设备
CN113111230B (zh) * 2020-02-13 2024-04-12 北京明亿科技有限公司 基于正则表达式的接处警文本户籍地地址提取方法和装置
CN113111229B (zh) * 2020-02-13 2024-04-12 北京明亿科技有限公司 基于正则表达式的接处警文本轨迹地地址提取方法和装置
CN111523647B (zh) * 2020-04-26 2023-11-14 南开大学 网络模型训练方法及装置、特征选择模型、方法及装置
CN111901450B (zh) * 2020-07-15 2023-04-18 安徽淘云科技股份有限公司 实体的地址确定方法、装置、设备及存储介质
CN112148819A (zh) * 2020-08-17 2020-12-29 北京来也网络科技有限公司 结合rpa和ai的地址识别方法和装置
CN112269861A (zh) * 2020-10-09 2021-01-26 和美(深圳)信息技术股份有限公司 智能机器人的语料生成方法及系统
CN112257413B (zh) * 2020-10-30 2022-05-17 深圳壹账通智能科技有限公司 地址参数处理方法及相关设备
CN112488200A (zh) * 2020-11-30 2021-03-12 上海寻梦信息技术有限公司 物流地址特征提取方法、系统、设备及存储介质
CN112559661B (zh) * 2020-12-09 2024-03-01 北京百度网讯科技有限公司 检索地址类型的方法、装置和电子设备
CN113610157A (zh) * 2021-01-20 2021-11-05 廖彩红 基于人工智能的业务大数据特征采集方法及服务器
CN112818685A (zh) * 2021-01-29 2021-05-18 上海寻梦信息技术有限公司 地址匹配方法、装置、电子设备及存储介质
CN112989166A (zh) * 2021-03-26 2021-06-18 杭州有数金融信息服务有限公司 一种计算企业实际经营地的方法
CN113138985B (zh) * 2021-04-22 2023-05-02 重庆长安汽车股份有限公司 一种gps数据解析方法及系统
CN113255346B (zh) * 2021-07-01 2021-09-14 湖南工商大学 一种基于图嵌入与crf知识融入的地址要素识别方法
CN113592037B (zh) * 2021-08-26 2023-11-24 吉奥时空信息技术股份有限公司 一种基于自然语言推断的地址匹配方法
CN113642313B (zh) * 2021-09-02 2024-03-29 阿里巴巴达摩院(杭州)科技有限公司 地址文本的处理方法、装置、设备、存储介质及程序产品
CN114301629A (zh) * 2021-11-26 2022-04-08 北京六方云信息技术有限公司 Ip检测方法、装置、终端设备以及存储介质
CN114138923B (zh) * 2021-12-03 2024-06-07 吉林大学 一种构建地质图知识图谱的方法
CN115577065B (zh) * 2022-12-09 2023-06-09 中信证券股份有限公司 一种地址解析的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955833A (zh) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 一种通讯地址识别、标准化的方法
WO2014163977A1 (en) * 2013-03-13 2014-10-09 Google Inc. Systems, methods and computer-readable media for interpreting geographical search queries
CN109933797A (zh) * 2019-03-21 2019-06-25 东南大学 基于Jieba分词及地址词库的地理编码方法和系统
CN110019617A (zh) * 2017-12-05 2019-07-16 腾讯科技(深圳)有限公司 地址标识的确定方法和装置、存储介质、电子装置
CN110569322A (zh) * 2019-07-26 2019-12-13 苏宁云计算有限公司 地址信息解析方法、装置、系统及数据获取方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732435B1 (en) * 2008-07-30 2014-05-20 Altera Corporation Single buffer multi-channel de-interleaver/interleaver
CN102955832B (zh) * 2011-08-31 2015-11-25 深圳市华傲数据技术有限公司 一种通讯地址识别、标准化的系统
CN109960795B (zh) * 2019-02-18 2024-05-07 平安科技(深圳)有限公司 一种地址信息标准化方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955833A (zh) * 2011-08-31 2013-03-06 深圳市华傲数据技术有限公司 一种通讯地址识别、标准化的方法
WO2014163977A1 (en) * 2013-03-13 2014-10-09 Google Inc. Systems, methods and computer-readable media for interpreting geographical search queries
CN110019617A (zh) * 2017-12-05 2019-07-16 腾讯科技(深圳)有限公司 地址标识的确定方法和装置、存储介质、电子装置
CN109933797A (zh) * 2019-03-21 2019-06-25 东南大学 基于Jieba分词及地址词库的地理编码方法和系统
CN110569322A (zh) * 2019-07-26 2019-12-13 苏宁云计算有限公司 地址信息解析方法、装置、系统及数据获取方法

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438280A (zh) * 2021-06-03 2021-09-24 多点生活(成都)科技有限公司 车辆启动控制方法和装置
CN113988949A (zh) * 2021-11-15 2022-01-28 北京有竹居网络技术有限公司 一种推广信息处理方法、装置、设备及介质、程序产品
CN114513550A (zh) * 2021-12-30 2022-05-17 天翼云科技有限公司 一种地理位置信息的处理方法、装置及电子设备
CN114513550B (zh) * 2021-12-30 2024-03-08 天翼云科技有限公司 一种地理位置信息的处理方法、装置及电子设备
CN114463053A (zh) * 2022-01-21 2022-05-10 浪潮卓数大数据产业发展有限公司 一种企业归属分类的方法及系统
CN115174638A (zh) * 2022-09-06 2022-10-11 广东邦盛新能源科技发展有限公司 光伏板数据采集设备的组网方法及系统
CN115248837A (zh) * 2022-09-21 2022-10-28 中科雨辰科技有限公司 一种获取文本的地理实体的数据处理系统
CN115248837B (zh) * 2022-09-21 2022-12-23 中科雨辰科技有限公司 一种获取文本的地理实体的数据处理系统
CN116501827A (zh) * 2023-06-26 2023-07-28 北明成功软件(山东)有限公司 一种基于bim的市场主体与楼宇地址匹配定位方法
CN116501827B (zh) * 2023-06-26 2023-09-12 北明成功软件(山东)有限公司 一种基于bim的市场主体与楼宇地址匹配定位方法

Also Published As

Publication number Publication date
CA3145918A1 (en) 2021-02-04
CN110569322A (zh) 2019-12-13

Similar Documents

Publication Publication Date Title
WO2021017679A1 (zh) 地址信息解析方法、装置、系统及数据获取方法
US11816544B2 (en) Composite machine learning system for label prediction and training data collection
CN106651057B (zh) 一种基于安装包序列表的移动端用户年龄预测方法
CN106919957B (zh) 处理数据的方法及装置
CN110688536A (zh) 一种标签预测方法、装置、设备和存储介质
CN111680506A (zh) 数据库表的外键映射方法、装置、电子设备和存储介质
JP2023536773A (ja) テキスト品質評価モデルのトレーニング方法及びテキスト品質の決定方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN113516417A (zh) 基于智能建模的业务评估方法、装置、电子设备及介质
CN113628043B (zh) 基于数据分类的投诉有效性判断方法、装置、设备及介质
CN117235608B (zh) 风险检测方法、装置、电子设备及存储介质
CN110019193B (zh) 相似帐号识别方法、装置、设备、系统及可读介质
CN113591881A (zh) 基于模型融合的意图识别方法、装置、电子设备及介质
CN111581197B (zh) 对数据集中的数据表进行抽样和校验的方法及装置
CN113177644A (zh) 一种基于词嵌入和深度时序模型的自动建模系统
CN114036921A (zh) 一种政策信息匹配方法和装置
CN111738290A (zh) 图像检测方法、模型构建和训练方法、装置、设备和介质
CN113705201B (zh) 基于文本的事件概率预测评估算法、电子设备及存储介质
CN116340781A (zh) 相似度确定方法、相似度预测模型训练方法及装置
CN117296064A (zh) 计算环境中的可解释人工智能
CN115309705A (zh) 一种自动识别城市信息模型平台基础数据元素的数据集成分类系统及其分类方法
CN114297235A (zh) 风险地址识别方法、系统及电子设备
CN109919811B (zh) 基于大数据的保险代理人培养方案生成方法及相关设备
CN109308565B (zh) 人群绩效等级识别方法、装置、存储介质及计算机设备
CN112800112A (zh) 一种数据处理系统及数据挖掘方法
CN112765238A (zh) 一种数据处理系统及数据挖掘方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20846832

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3145918

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20846832

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20846832

Country of ref document: EP

Kind code of ref document: A1