TW202020688A - Method for determining address text similarity, address searching method, apparatus, and device - Google Patents

Method for determining address text similarity, address searching method, apparatus, and device Download PDF

Info

Publication number
TW202020688A
TW202020688A TW108129457A TW108129457A TW202020688A TW 202020688 A TW202020688 A TW 202020688A TW 108129457 A TW108129457 A TW 108129457A TW 108129457 A TW108129457 A TW 108129457A TW 202020688 A TW202020688 A TW 202020688A
Authority
TW
Taiwan
Prior art keywords
address
text
similarity
similarity calculation
texts
Prior art date
Application number
TW108129457A
Other languages
Chinese (zh)
Inventor
劉楚
謝朋峻
鄭華飛
李林琳
司羅
Original Assignee
香港商阿里巴巴集團服務有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 香港商阿里巴巴集團服務有限公司 filed Critical 香港商阿里巴巴集團服務有限公司
Publication of TW202020688A publication Critical patent/TW202020688A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed by the present invention are a method for determining address text similarity, an address searching method, an apparatus, and a device, wherein the address text comprises a plurality of address elements that are arranged from high to low, and the method comprises: acquiring an address text pair the similarity of which is to be determined; and outputting the address text pair to a preset address text similarity calculation model so as to output the similarity between the two pieces of address text comprised in the address text pair. The present invention increases the accuracy of calculating the similarity between address texts.

Description

地址文本相似度確定方法以及地址搜索方法Address text similarity determination method and address search method

本發明涉及人工智慧領域,具體涉及一種地址文本相似度確定方法、地址搜索方法以及計算設備。The invention relates to the field of artificial intelligence, in particular to a method for determining the similarity of address text, an address search method and a computing device.

在某些對地址敏感的行業或部門裡,例如公安、快遞、物流、電子地圖等,其內部通常會維護一份標準的地址庫。而在其地址數據的使用場景中常常存在與標準地址庫不統一的描述,比如110報警時候的口述地址與公安系統內部的標準化地址就相去甚遠。此時需要一種有效且快速的方法能夠將非標準的地址文本映射到到標準地址庫中的對應或者相近的地址,其中如何判斷兩段地址文本的相似程度則至關重要。 常用的地址文本相似度有如下幾種計算方式: 1、利用編輯距離計算兩段文本的相似程度,此種方式忽略了文本的語義內涵,例如:“阿里巴巴”和“阿里地區”之間的編輯距離與“阿里巴巴”和“阿里媽媽”之間的編輯距離相同,但是從語義上“阿里巴巴”和“阿里媽媽”之間的語義相似程度應該是大於“阿里地區”。 2、利用語義相似度計算兩段文本之間的相似度,例如word2vec,此種方式適合於所有文本領域,並不單獨針對地址文本。在應用到地址文本時,準確度不夠高。 3、將地址文本分解為多個地址元素,人工指定各個級別的地址元素的權重後加權求和,缺點是無法針對數據集自動生成各地址級別的權重,不能很好的自動化。In certain address-sensitive industries or departments, such as public security, express delivery, logistics, electronic maps, etc., a standard address database is usually maintained internally. In the usage scenarios of its address data, there is often a description that is not unified with the standard address library. For example, the oral address at the time of 110 alarm is far from the standardized address within the public security system. In this case, an effective and fast method is needed to map non-standard address text to the corresponding or similar addresses in the standard address library. How to determine the similarity between the two address texts is very important. The commonly used address text similarity has the following calculation methods: 1. Use the editing distance to calculate the similarity of the two texts. This method ignores the semantic connotation of the text, for example: the editing distance between "Alibaba" and "Ali Region" and the difference between "Alibaba" and "Ali Mom" The editing distance between them is the same, but the semantic similarity between "Alibaba" and "Ali Mom" should be greater than that of "Ali Region". 2. Use semantic similarity to calculate the similarity between two pieces of text, such as word2vec. This method is suitable for all text fields, and is not specific to address text. When applied to address text, the accuracy is not high enough. 3. The address text is decomposed into multiple address elements, and the weights of the address elements of each level are manually specified to be weighted and summed. The disadvantage is that the weights of each address level cannot be automatically generated for the data set and cannot be well automated.

鑒於上述問題,提出了本發明以便提供一種克服上述問題或者至少部分地解決上述問題的地址文本相似度確定方法和地址搜索方法。 根據本發明的一個方面,提供了一種地址文本相似度確定方法,該地址文本包括級別從高到低排列的多個地址元素,該方法包括: 獲取待確定相似度的地址文本對; 將該地址文本對輸入到預設的地址文本相似度計算模型,以輸出該地址文本對所包括的兩個地址文本的相似度; 其中,該地址文本相似度計算模型基於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對。 可選地,在根據本發明的地址文本相似度確定方法中,該地址文本相似度計算模型包括詞嵌入層、文本編碼層和相似度計算層,訓練該地址文本相似度計算模型的步驟包括:將每條訓練數據的第一、二、三地址文本輸入到詞嵌入層,以得到對應的第一、二、三詞向量集;將第一、二、三詞向量集輸入到文本編碼層,以得到對應的第一、二、三文本向量;利用相似度計算層計算第一、二文本向量的第一相似度和第一、三文本向量的第二相似度;根據第一、二相似度調整該地址文本相似度計算模型的網路參數。 可選地,在根據本發明的地址文本相似度確定方法中,該網路參數包括:詞嵌入層的參數和/或文本編碼層的參數。 可選地,在根據本發明的地址文本相似度確定方法中,第一、二、三詞向量集中的各詞向量集包括多個詞向量,每個詞向量與地址文本中的一個地址元素相對應。 可選地,在根據本發明的地址文本相似度確定方法中,該詞嵌入層採用Glove模型或者Word2Vec模型。 可選地,在根據本發明的地址文本相似度確定方法中,該第一相似度和第二相似度包括歐氏距離、餘弦相似度或者Jaccard係數中的至少一個。 可選地,在根據本發明的地址文本相似度確定方法中,該根據第一、二相似度調整詞該地址文本相似度計算模型的網路參數,包括:根據第一、二相似度計算損失函數值;利用反向傳播演算法調整地址文本相似度計算模型的網路參數,直到損失函數值低於預設值,或者訓練次數達到預定次數。 可選地,在根據本發明的地址文本相似度確定方法中,該損失函數值為:Loss=Margin-(第一相似度-第二相似度),其中,Loss為損失函數值,Margin為超參數。 可選地,在根據本發明的地址文本相似度確定方法中,該文本編碼層包括RNN模型、CNN模型或者DBN模型中的至少一個。 根據本發明的另一個方面,提供了一種地址搜索方法,包括: 獲取待查詢地址文本對應的一個或多個候選地址文本; 將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型基於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本。 根據本發明的另一個方面,提供了一種地址搜索裝置,包括: 查詢模組,適於獲取待查詢地址文本對應的一個或多個候選地址文本; 第一相似度計算模組,適於將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 輸出模組,適於將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本。 根據本發明的另一個方面,提供了一種地址文本相似度計算模型的訓練裝置,該地址文本包括級別從高到低排列的多個地址元素,該地址文本相似度計算模型包括詞嵌入層、文本編碼層和相似度計算層,該裝置包括: 獲取模組,適於獲取訓練數據集,該訓練數據集包括多條訓練數據,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 詞向量獲取模組,適於將每條訓練數據的第一、二、三地址文本輸入到詞嵌入層,以得到對應的第一、二、三詞向量集; 文本向量獲取模組,適於將第一、二、三詞向量集輸入到文本編碼層,以得到對應的第一、二、三文本向量; 第二相似度計算模組,適於利用相似度計算層計算第一、二文本向量的第一相似度和第一、三文本向量的第二相似度; 調參模組,適於根據第一、二相似度調整地址文本相似度計算模型的網路參數。 根據本發明的另一個方面,提供了一種計算設備,包括: 一個或多個處理器; 記憶體;以及 一個或多個程式,其中該一個或多個程式儲存在該記憶體中並被配置為由該一個或多個處理器執行,該一個或多個程式包括用於執行根據上述的方法中的任一方法的指令。 由於地址文本天然包含層級關係,不同級別的地址元素在地址相似度計算中起到不同的作用。本發明實施例利用地址文本中的層級關係自動學習到不同級別地址元素的權重,避免了人工指定權重的主觀性,同時具有了對目標數據源的自適應能力,進而能夠準確的計算出兩個地址文本的相似程度。 上述說明僅是本發明技術方案的概述,為了能夠更清楚瞭解本發明的技術手段,而可依照說明書的內容予以實施,並且為了讓本發明的上述和其它目的、特徵和優點能夠更明顯易懂,以下特舉本發明的具體實施方式。In view of the above problems, the present invention is proposed to provide an address text similarity determination method and address search method that overcome the above problems or at least partially solve the above problems. According to an aspect of the present invention, a method for determining the similarity of address text is provided. The address text includes a plurality of address elements arranged from high to low. The method includes: Obtain the address-text pair whose similarity is to be determined; Input the address text pair into a preset address text similarity calculation model to output the similarity of the two address texts included in the address text pair; The address text similarity calculation model is trained based on a training data set that includes multiple pieces of training data, and each piece of training data includes at least the first, second, and third address texts, where the first n of the first and second address texts The address elements of the same level constitute a positive sample pair, and the address elements of the first (n-1) levels of the first and third address texts are the same, and the address elements of the nth level are different, constituting a negative sample pair. Optionally, in the address text similarity determination method according to the present invention, the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the steps of training the address text similarity calculation model include: Input the first, second and third address texts of each training data to the word embedding layer to get the corresponding first, second and third word vector sets; input the first, second and third word vector sets to the text encoding layer, To obtain the corresponding first, second, and third text vectors; use the similarity calculation layer to calculate the first similarity of the first and second text vectors and the second similarity of the first and third text vectors; according to the first and second similarities Adjust the network parameters of the address text similarity calculation model. Optionally, in the address text similarity determination method according to the present invention, the network parameters include: parameters of the word embedding layer and/or parameters of the text encoding layer. Optionally, in the address text similarity determination method according to the present invention, each word vector set in the first, second, and third word vector sets includes multiple word vectors, and each word vector corresponds to an address element in the address text correspond. Optionally, in the address text similarity determination method according to the present invention, the word embedding layer adopts the Glove model or the Word2Vec model. Optionally, in the address text similarity determination method according to the present invention, the first similarity and the second similarity include at least one of Euclidean distance, cosine similarity, or Jaccard coefficient. Optionally, in the method for determining the similarity of the address text according to the present invention, the network parameters of the calculation model of the address text similarity according to the first and second similarities include: calculating the loss according to the first and second similarities Function value; use the back propagation algorithm to adjust the network parameters of the address text similarity calculation model until the loss function value is lower than the preset value, or the number of trainings reaches the predetermined number. Optionally, in the address text similarity determination method according to the present invention, the loss function value is: Loss=Margin-(first similarity-second similarity), where Loss is the loss function value and Margin is the super parameter. Optionally, in the address text similarity determination method according to the present invention, the text encoding layer includes at least one of an RNN model, a CNN model, or a DBN model. According to another aspect of the present invention, an address search method is provided, including: Obtain one or more candidate address texts corresponding to the address text to be queried; The address text to be queried and the candidate address text are input into a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is trained based on a training data set including multiple pieces of training data It is obtained that each piece of training data includes at least the first, second, and third address texts, where the address elements of the first n levels of the first and second address texts are the same, forming a positive sample pair, and the first (n -1) The address elements of the same level are the same, and the address elements of the nth level are different, forming a negative sample pair; The candidate address text with the largest similarity is determined as the target address text corresponding to the address text to be queried. According to another aspect of the present invention, an address search device is provided, including: The query module is suitable for obtaining one or more candidate address texts corresponding to the address text to be queried; The first similarity calculation module is adapted to input the address text to be queried and the candidate address text into a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model includes The training data set of multiple pieces of training data is obtained by training, and each piece of training data includes at least the first, second, and third address texts, wherein the address elements of the first n levels of the first and second address texts are the same, forming a positive sample pair, The address elements of the first (n-1) levels of the first and third address texts are the same, and the address elements of the nth level are different, forming a negative sample pair; The output module is adapted to determine the candidate address text with the largest similarity as the target address text corresponding to the address text to be queried. According to another aspect of the present invention, there is provided a training apparatus for an address text similarity calculation model, the address text includes a plurality of address elements arranged from high to low, and the address text similarity calculation model includes a word embedding layer and a text The coding layer and the similarity calculation layer, the device includes: The acquisition module is suitable for acquiring a training data set. The training data set includes multiple pieces of training data, and each piece of training data includes at least the first, second, and third address texts, wherein the first n levels of the first and second address texts The address elements are the same, forming a positive sample pair, the first (n-1) level address elements of the first and third address texts are the same, and the nth level address elements are different, forming a negative sample pair; The word vector acquisition module is suitable for inputting the first, second, and third address texts of each piece of training data into the word embedding layer to obtain the corresponding first, second, and third word vector sets; The text vector acquisition module is suitable for inputting the first, second, and third word vector sets into the text encoding layer to obtain the corresponding first, second, and third text vectors; The second similarity calculation module is adapted to use the similarity calculation layer to calculate the first similarity of the first and second text vectors and the second similarity of the first and third text vectors; The parameter adjustment module is adapted to adjust the network parameters of the address text similarity calculation model according to the first and second similarities. According to another aspect of the present invention, a computing device is provided, including: One or more processors; Memory; and One or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs include any of the methods for performing the method according to the above One method instruction. Because address text naturally contains hierarchical relationships, address elements of different levels play different roles in the calculation of address similarity. The embodiment of the present invention uses the hierarchical relationship in the address text to automatically learn the weights of address elements of different levels, avoiding the subjectivity of manually specifying the weights, and at the same time having the adaptive ability to the target data source, which can accurately calculate the two The similarity of the address text. The above description is only an overview of the technical solutions of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objects, features and advantages of the present invention more obvious and understandable The specific embodiments of the present invention are listed below.

下面將參照圖式更詳細地描述本公開的示例性實施例。雖然圖式中顯示了本公開的示例性實施例,然而應當理解,可以以各種形式實現本公開而不應被這裡闡述的實施例所限制。相反,提供這些實施例是為了能夠更透徹地理解本公開,並且能夠將本公開的範圍完整的傳達給本領域的技術人員。 首先,在對本發明實施例進行描述的過程中出現的部分名詞或術語適用於如下解釋: 地址文本:比如“杭州文一西路969號阿里巴巴”、“四川省眉山市彭山區彭溪鎮錦江大道1號四川大學錦江學院”等包含地址資訊的文本。地址文本包括級別從高到低排列的多個地址元素。 地址元素:構成地址文本的各個粒度的要素,比如“杭州文一西路969號阿里巴巴”,“杭州”表示城市、“文一西路”表示道路、“969號”表示路號、阿里巴巴表示興趣點(Point of Interest,POI)。 地址級別:地址中的地址元素對應的區域具有大小包含的關係,即地級元素具有相應的地址級別,例如:省 > 市 > 區 > 街道/社區 > 路 > 樓棟。 地址相似度:兩段地址文本之間的相似程度,取值為0到1之間,值越大表示兩個地址為同一地點的可能性越大,取值為1時兩段文本表示同一地址,取值為0時,兩段地址無關係。 偏序關係:地址中的區域具有大小包含的層級關係,例如:省 > 市 > 區 > 街道/社區 > 路 > 樓棟。 由於地址文本天然包含層級關係,即上述的偏序關係,不同級別的地址元素在地址相似度計算中起到不同的作用。本發明實施例利用地址文本中的層級關係自動生成不同級別地址元素的權重,且該權重隱含體現在地址文本相似度計算模型的網路參數中,從而能夠準確的計算出兩個地址文本的相似程度。 圖1示出了根據本發明一個實施例的地址搜索系統100的示意圖。如圖1所示,地址搜索系統100包括用戶終端110和計算設備200。 用戶終端110即用戶所使用的終端設備,其具體可以是桌上型電腦、筆記型電腦等個人電腦,也可以是手機、平板電腦、多媒體設備、智慧型可穿戴設備等,但不限於此。計算設備200用於向用戶終端110提供服務,其可以實現為伺服器,例如應用伺服器、Web伺服器等;也可以實現為桌上型電腦、筆記型電腦、處理器晶片、手機、平板電腦等,但不限於此。 在本發明的實施例中,計算設備200可用於向用戶提供地址搜索服務,例如,計算設備200可以作為電子地圖應用的伺服器,但是,本領域技術人員應當理解,計算設備200可以是任何能夠向用戶提供地址搜索服務的設備,而不僅限於電子地圖應用的伺服器。 在一個實施例中,地址搜索系統100還包括資料儲存裝置120。資料儲存裝置120可以是關係型資料庫例如MySQL、ACCESS等,也可以是非關係型資料庫例如NoSQL等;可以是駐留於計算設備200中的本地資料庫,也可以作為分布式資料庫例如HBase等設置於多個地理位置處,總之,資料儲存裝置120用於儲存數據,本發明對資料儲存裝置120的具體部署、配置情況不做限制。計算設備200可以與資料儲存裝置120連接,並獲取資料儲存裝置120中所儲存的數據。例如,計算設備200可以直接讀取資料儲存裝置120中的數據(在資料儲存裝置120為計算設備200的本地資料庫時),也可以透過有線或無線的方式存取網際網路,並透過數據介面來獲取資料儲存裝置120中的數據。 在本發明的實施例中,資料儲存裝置120中儲存有標準地址庫,標準地址庫中的地址文本為標準地址文本(完整和準確的地址文本)。在地址搜索服務中,用戶透過用戶終端110輸入待查詢地址文本(query),通常,用戶的輸入是殘缺和不準確的地址文本。用戶終端110將查詢query發送到計算設備200,計算設備200中的地址搜索裝置透過檢索標準地址庫後會召回一批候選地址文本,通常在幾條到幾千條不等。之後地址搜索裝置對這些候選地址文本和查詢query之間計算相關程度,地址相似度則是相關程度的一種重要的參考資訊,透過分別計算查詢query和所有候選地址文本之間的地址相似度後,將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本,並將該目標地址文本返回給用戶。 具體地,地址搜索裝置可以利用地址文本相似度計算模型來計算待查詢地址文本和候選地址文本之間的相似度。相應地,計算設備200中還可以包括地址文本相似度計算模型的訓練裝置,資料儲存裝置120還儲存有訓練地址庫,訓練地址庫可以與上述標準地址庫相同或不同,訓練地址庫中包括多個地址文本,該訓練裝置利用訓練地址庫中的地址文本來訓練地址文本相似度計算模型。 圖2示出了根據本發明一個實施例的計算設備200的結構圖。如圖2所示,在基本的配置202中,計算設備200典型地包括系統記憶體206和一個或者多個處理器204。記憶體匯流排208可以用於在處理器204和系統記憶體206之間的通信。 取決於期望的配置,處理器204可以是任何類型的處理,包括但不限於:微處理器(µP)、微控制器(µC)、數位資訊處理器(DSP)或者它們的任何組合。處理器204可以包括諸如一級高速快取記憶體210和二級高速快取記憶體212之類的一個或者多個級別的高速快取記憶體、處理器核心214和暫存器216。示例的處理器核心214可以包括運算邏輯單元(ALU)、浮點數單元(FPU)、數位信號處理核心(DSP核心)或者它們的任何組合。示例的記憶體控制器218可以與處理器204一起使用,或者在一些實現中,記憶體控制器218可以是處理器204的一個內部部分。 取決於期望的配置,系統記憶體206可以是任意類型的記憶體,包括但不限於:揮發性記憶體(諸如RAM)、非揮發性記憶體(諸如ROM、快閃記憶體等)或者它們的任何組合。系統記憶體106可以包括操作系統220、一個或者多個應用程式222以及程式數據224。應用程式222實際上是多條程式指令,其用於指示處理器204執行相應的操作。在一些實施方式中,應用程式222可以佈置為在操作系統上使得處理器204利用程式數據224進行操作。 計算設備200還可以包括有助於從各種介面設備(例如,輸出設備242、外設介面244和通信設備246)到基本配置202經由匯流排/介面控制器230的通信的介面匯流排240。示例的輸出設備242包括圖形處理單元248和音頻處理單元250。它們可以被配置為有助於經由一個或者多個A/V端口252與諸如顯示器或者揚聲器之類的各種外部設備進行通信。示例外設介面244可以包括串行介面控制器254和並行介面控制器256,它們可以被配置為有助於經由一個或者多個I/O端口258和諸如輸入設備(例如,鍵盤、滑鼠、筆、語音輸入設備、觸控輸入設備)或者其他外設(例如印表機、掃描器等)之類的外部設備進行通信。示例的通信設備246可以包括網路控制器260,其可以被佈置為便於經由一個或者多個通信端口264與一個或者多個其他計算設備262透過網路通信鏈路的通信。 網路通信鏈路可以是通信介質的一個示例。通信介質通常可以體現為在諸如載波或者其他傳輸機制之類的調變數據信號中的電腦可讀取指令、資料結構、程式模組,並且可以包括任何資訊遞送介質。“調變數據信號”可以這樣的信號,它的數據集中的一個或者多個或者它的改變可以在信號中編碼資訊的方式進行。作為非限制性的示例,通信介質可以包括諸如有線網路或者專線網路之類的有線介質,以及諸如聲音、射頻(RF)、微波、紅外線(IR)或者其它無線介質在內的各種無線介質。這裡使用的術語電腦可讀取介質可以包括儲存介質和通信介質二者。 在根據本發明的計算設備200中,應用程式222包括地址文本相似度計算模型的訓練裝置600和地址搜索裝置700。裝置600包括多條程式指令,這些程式指令可以指示處理器104執行地址文本相似度計算模型的訓練方法300。裝置700包括多條程式指令,這些程式指令可以指示處理器104執行地地址搜索方法600。 圖3示出了根據本發明一個實施例的地址文本相似度計算模型的訓練方法300的流程圖。方法300適於在計算設備(例如前述計算設備200)中執行。如圖3所示,該方法300始於步驟S310。在步驟S310中,獲取訓練數據集,訓練數據集包括多條訓練數據,每條訓練數據包括3個地址文本,分別為第一地址文本、第二地址文本和第三地址文本。每個地址文本包括級別從高到低排列的多個地址元素,第一地址文本和第二地址文本的前n個級別的地址元素相同;第一地址文本和第三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同。這裡,n的取值範圍為(1,N),N為地址文本所包括的地址級別的數目,例如,地址文本共包括5個地址級別,分別為:省、市、區、道路、路號,則N取值為5。當然,n也可以根據具體的應用場景採用其他的取值範圍。 在本發明實施例中,每條訓練數據為3個地址文本構成的三元組{target_addr,pos_addr,neg_addr},target_addr對應上述的第一地址文本,pos_addr對應上述的第二地址文本,neg_addr對應上述的第三地址文本。{target_addr, pos_addr}構成一對正樣本對,{target_addr, neg_addr}構成一對負樣本對。 在一個實施例中,訓練數據集的獲取方式如下: 首先,從訓練地址庫(或者標準地址庫)中獲取原始地址文本,並解析原始地址文本,將地址文本的字符串切分並格式化為地址元素。例如,對於地址文本“浙江省杭州市余杭區文一西路969號阿里巴巴西溪園區1號樓7層910號”,可以切分為“prov(省)=浙江省 city(市)=杭州市 district(區)=余杭區 road(道路)=文一西路 roadno(路號)=969號 poi=阿里巴巴西溪園區 houseno(樓號)=1號樓 floorno(樓層號)=7層 roomno(房間號)=910號”。具體地,可以結合分詞模型和命名實體模型來完成上述解析,本發明實施例對具體的分詞模型和命名實體模型不做限制,本領域技術人員可以根據需要進行合理選擇。 然後,將格式化為地址元素的地址文本按不同級別的地址元素做聚合(去重和排序),形成如下的表格:

Figure 108129457-A0304-0001
最後,將表格中聚合後的數據按不同地址級別組合成訓練數據的正負樣本對,輸出格式為:{target_addr, pos_addr,neg_addr}。如前所述,{target_addr,pos_addr}構成一對正樣本對,{target_addr, neg_addr}構成一對負樣本對。需要說明的是,一對正樣本對可以對應多對負樣本對,即,一個target_addr對應一個pos_addr,該target_addr可以對應多個neg_addr。 具體操作如下: (1)選定一個地址文本,例如:prov=浙江省 city=杭州市 district=余杭區 road=文一西路 roadno=969號 poi=阿里巴巴西溪園區; (2)從高到低遍歷所有地址級別,例如,省 -> 市 -> 區 -> 道路,在每個地址級別上分別找到和當前地址元素相同和不同的地址元素,分別與當前地址元素構成正樣本對和負樣本對,例如: 在省級別,浙江省 杭州市 余杭區 文一西路 969號 阿里巴巴西溪園區的正例為:浙江省 寧波市 鄞州區 宜園路 245號 國驊宜家花園1期;負例為:上海 上海市 長寧區 虹橋路 2550號 上海虹橋國際機場。 在市級別,浙江省 杭州市 余杭區 文一西路 969號 阿里巴巴西溪園區的正例為:浙江省 杭州市 余杭區 文一西路 1008號 浙江省社會主義學院;負例為:浙江省 寧波市 鄞州區 宜園路 525號 宜家家居。 在區級別,浙江省 杭州市 余杭區 文一西路 969號 阿里巴巴西溪園區 的正例為:浙江省 杭州市 余杭區 高教路 248號 賽銀國際廣場;負例為:浙江省 杭州市 上城區 南山路 218號 中國美術學院南山校區。 在獲取到訓練數據集後,方法300進入步驟S320。在描述步驟S320的處理過程之前,先介紹一下本發明實施例的地址文本相似度計算模型的結構。 參照圖4,本發明實施例的地址文本相似度計算模型400包括:詞嵌入層410、文本編碼層420和相似度計算層430。詞嵌入層410適於將地址文本中的各地址元素轉換為詞向量,並將各詞向量組合為地址文本對應的詞向量集;文本編碼層420適於將地址文本對應的詞向量集編碼為文本向量;相似度計算層430適於計算兩個文本向量之間的相似度,利用文本向量之間的相似度來表徵地址文本之間的相似度。 在步驟S320中,將每條訓練數據中的第一地址文本、第二地址文本和第三地址文本分別輸入到詞嵌入層進行處理,以得到與第一地址文本對應的第一詞向量集,與第二地址文本對應的第二詞向量集,與第三地址文本對應的第三詞向量集。 詞嵌入層(embedding層)能夠將一個句子中的每一個詞轉化成一個數字向量(詞向量),embedding層的權重可以透過海量語料庫的文本共現資訊預計算得到,例如採用Glove演算法,或者,Word2Vec中的CBOW和skip-gram演算法進行計算。這些演算法都是基於這樣一個事實:在相同潛在語義的不同文本表示會反復出現在同樣的上下文語境當中,利用這種上下文和單詞之間的關係進行單詞到上下文的預測,或者透過上下文預測單詞,從而得到每個單詞的潛在語義。在本發明實施例中,詞嵌入層的參數可以利用語料庫單獨進行訓練得到;也可以將詞嵌入層和文本編碼層一起進行訓練,從而同時得到詞嵌入層的參數和文本編碼層的參數。下文以詞嵌入層和文本編碼層一起進行訓練為例進行說明。 具體地,地址文本包括多個格式化的地址元素,將地址文本輸入到詞嵌入層後,詞嵌入層將地址文本中的每個地址元素作為一個詞,轉換為詞向量,這樣得到多個詞向量,然後,將這些詞向量組合為詞向量集合。 在一種實現方式中,詞向量集合表示為一個列表,即詞向量列表,詞向量列表中的每個列表項對應一個詞向量,列表的項數為地址文本中地址元素的數目。在另一種實現方式中,詞向量集合表示為一個矩陣,即詞向量矩陣,矩陣的每列對應一個詞向量,矩陣的列數即為地址文本中地址元素的數目。 在獲取到詞向量集後,方法300進入步驟S330。在步驟S330中,分別將第一詞向量集、第二詞向量集和第三詞向量集輸入到文本編碼層進行處理,從而將第一詞向量集編碼為第一文本向量,將第二詞向量集編碼為第二文本向量,將第三詞向量集編碼為第三文本向量。 文本編碼層採用深度神經網路(Deep Neural Network,DNN)模型來實現,例如可以採用循環神經網路(Recurrent Neural Network, RNN)模型、卷積神經網路(Convolutional Neural Network ,CNN)模型或者深度信念網路(Deep Belief Network,DBN)模型。透過DNN將不定長度的地址句子文本的embedding輸出編碼為一個定長的句子向量,此時target_addr, pos_addr, neg_addr分別轉化為vector_A, vector_B, vector_C。vector_A即上述的第一文本向量,vector_B即上述的第二文本向量,vector_C即上述的第三文本向量。 以RNN為例,可以將地址文本對應的詞向量序列看作時間序列,按照順序將詞向量序列中的詞向量輸入到RNN中,最終輸出的向量為地址文本對應的文本向量(句子向量)。 以CNN為例,將地址文本對應的詞向量矩陣輸入到CNN中,透過多個卷積層和池化層的處理,最後透過全連接層將二維特徵圖轉換為一維的特徵向量,此特徵向量即為地址文本對應的文本向量。 在獲取到文本向量後,方法300進入步驟S340。在步驟S340中,利用相似度計算層計算第一文本向量與第二文本向量之間的第一相似度,以及第一文本向量與第三文本向量之間的第二相似度。這樣,第一相似度可以表徵第一地址文本與第二地址文本之間的相似度,第二相似度可以表徵第一地址文本和第三地址文本之間的相似度。 可以選擇多種相似度距離計算方式,例如:歐氏距離、餘弦相似度、Jaccard係數等。在本實施例中,vector_A和vector_B之間的相似度記作SIM_AB,vector_A和vector_C之間的相似度記作SIM_AC。 最後,在步驟S350中,根據第一相似度和第二相似度調整詞嵌入層和文本編碼層的網路參數。具體包括:根據第一相似度和第二相似度計算損失函數值;利用反向傳播演算法調整詞嵌入層和文本編碼層的網路參數,直到損失函數值低於預設值,或者訓練次數達到預定次數。 這裡的損失函數為三元組損失函數,利用三元組損失函數可以拉近正樣本對之間的距離,推開負樣本對之間的距離。損失函數具體可以表示為:loss = Margin - (SIM_AB - SIM_AC)。利用反向傳播演算法去優化網路的目標min(loss),這樣網路會主動學習到參數使得target_addr在語義空間上更加靠近pos_addr,同時遠離neg_addr。 其中,Margin是一個超參數,它表示訓練的目標要保證SIM_AB和SIM_AC之間要保持一定的距離,以增大模型的區分度,Margin的取值可以根據數據情況和實際任務反復調整直到效果最優。 完成上述訓練過程後,最終就得到了可用於計算兩段地址文本之間的相似度的相似度計算模型。基於該相似度計算模型,本發明實施例還提供一種地址文本相似度確定方法,包括如下步驟: 1)獲取待確定相似度的地址文本對; 2)將該地址文本對輸入到訓練好的地址文本相似度計算模型,以輸出該地址文本對所包括的兩個地址文本的相似度。 另外,該相似度計算模型可以應用於各種需要計算地址文本相似度的場景,例如可以應用於公安、快遞、物流、電子地圖等領域的地址標準化。在這些場景中,利用本發明實施例的地址文本相似度計算模型,可以為用戶提供地址搜索服務。 圖5示出了根據本發明一個實施例的地址搜索方法500的流程圖。參照圖5,方法500包括步驟S510~S530。 在步驟S510中,獲取待查詢地址文本對應的一個或多個候選地址文本。在地址搜索服務中,用戶透過用戶終端輸入待查詢地址文本(query),通常,用戶的輸入是殘缺和不準確的地址文本。用戶終端將查詢query發送到計算設備,計算設備中的地址搜索裝置透過檢索標準地址庫後會召回一批候選地址文本,通常在幾條到幾千條不等。 在步驟S520中,將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型根據上述的方法300訓練得到。在本步驟中,是分別計算待查詢地址文本和各候選地址文件的相似度。 在得到了待查詢地址文本與所有候選地址文本的相似度之後,方法500進入步驟S530。在步驟S530中,將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本,並將該目標地址文本返回給用戶。 圖6示出了根據本發明一個實施例的地址文本相似度計算模型的訓練裝置600的示意圖。地址文本相似度計算模型包括詞嵌入層、文本編碼層和相似度計算層,訓練裝置600包括: 獲取模組610,適於獲取訓練數據集,該訓練數據集包括多條訓練數據,每條訓練數據包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同。獲取模組610具體用於執行如前述步驟S310的方法,關於獲取模組610的處理邏輯和功能可以參見前述步驟S310的相關描述,此處不再贅述。 詞向量獲取模組620,適於將每條訓練數據的第一、二、三地址文本輸入到詞嵌入層,以得到對應的第一、二、三詞向量集。詞向量獲取模組620具體用於執行如前述步驟S320的方法,關於詞向量獲取模組620的處理邏輯和功能可以參見前述步驟S320的相關描述,此處不再贅述。 文本向量獲取模組630,適於將第一、二、三詞向量集輸入到文本編碼層,以得到對應的第一、二、三文本向量。文本向量獲取模組630具體用於執行如前述步驟S330的方法,關於詞向量獲取模組630的處理邏輯和功能可以參見前述步驟S330的相關描述,此處不再贅述。 第二相似度計算模組640,適於利用相似度計算層計算第一、二文本向量的第一相似度和第一、三文本向量的第二相似度。第二相似度計算模組640具體用於執行如前述步驟S340的方法,關於第二相似度計算模組640的處理邏輯和功能可以參見前述步驟S340的相關描述,此處不再贅述。 調參模組650,適於根據第一、二相似度調整詞嵌入層和文本編碼層的網路參數。調參模組650具體用於執行如前述步驟S350的方法,關於第二相似度計算模組650的處理邏輯和功能可以參見前述步驟S350的相關描述,此處不再贅述。 圖7示出了根據本發明一個實施例的地址搜索裝置700的示意圖。參照圖7,地址搜索裝置700包括: 查詢模組710,適於獲取待查詢地址文本對應的一個或多個候選地址文本; 第一相似度計算模組720,適於將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型由訓練裝置600進行訓練得到; 輸出模組730,適於將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本。 這裡描述的各種技術可結合硬體或軟體,或者它們的組合一起實現。從而,本發明的方法和設備,或者本發明的方法和設備的某些方面或部分可採取嵌入有形媒介,例如可移動硬碟、隨身碟、軟碟、CD-ROM或者其它任意機器可讀取的儲存介質中的程式代碼(即指令)的形式,其中當程式被載入諸如電腦之類的機器,並被該機器執行時,該機器變成實踐本發明的設備。 在程式代碼在可編程電腦上執行的情況下,計算設備一般包括處理器、處理器可讀取的儲存介質(包括揮發性和非揮發性記憶體和/或儲存元件),至少一個輸入裝置,和至少一個輸出裝置。其中,記憶體被配置用於儲存程式代碼;處理器被配置用於根據該記憶體中儲存的該程式代碼中的指令,執行本發明的多語言垃圾文本的識別方法。 以示例而非限制的方式,可讀取介質包括可讀取儲存介質和通信介質。可讀取儲存介質儲存諸如電腦可讀取指令、資料結構、程式模組或其它數據等資訊。通信介質一般以諸如載波或其它傳輸機制等已調變數據信號來體現電腦可讀取指令、資料結構、程式模組或其它數據,並且包括任何資訊傳遞介質。以上的任一種的組合也包括在可讀取介質的範圍之內。 在此處所提供的說明書中,演算法和顯示不與任何特定電腦、虛擬系統或者其它設備固有相關。各種通用系統也可以與本發明的示例一起使用。根據上面的描述,構造這類系統所要求的結構是顯而易見的。此外,本發明也不針對任何特定編程語言。應當明白,可以利用各種編程語言實現在此描述的本發明的內容,並且上面對特定語言所做的描述是為了披露本發明的最佳實施方式。 在此處所提供的說明書中,說明瞭大量具體細節。然而,能夠理解,本發明的實施例可以在沒有這些具體細節的情況下被實踐。在一些實例中,並未詳細示出公知的方法、結構和技術,以便不模糊對本說明書的理解。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art. First, some of the nouns or terms that appear during the process of describing the embodiments of the present invention are suitable for the following explanation: Address text: such as "Alibaba, No.969 Wenyi West Road, Hangzhou", "Jinjiang, Pengxi Town, Pengshan District, Meishan City, Sichuan Province "No. 1 Avenue, Jinjiang College of Sichuan University" and other texts containing address information. The address text includes multiple address elements arranged from high to low. Address elements: Elements with various granularities that make up the address text, such as "Hangzhou Wenyi West Road 969 Alibaba", "Hangzhou" means city, "Wenyi West Road" means road, "969" means road number, Alibaba Point of Interest (POI). Address level: The area corresponding to the address element in the address has the relationship of size inclusion, that is, the prefecture-level element has the corresponding address level, for example: province>city>district>street/community>road> building. Address similarity: The degree of similarity between two pieces of address text. The value is between 0 and 1. The greater the value, the greater the probability that the two addresses are at the same location. When the value is 1, the two pieces of text indicate the same address , When the value is 0, the two addresses are irrelevant. Partial order relationship: The area in the address has a hierarchical relationship of size, for example: province>city>district>street/community>road> building. Since the address text naturally contains a hierarchical relationship, that is, the above partial order relationship, address elements of different levels play different roles in the calculation of address similarity. The embodiment of the present invention uses the hierarchical relationship in the address text to automatically generate the weights of address elements of different levels, and the weight is implicitly reflected in the network parameters of the address text similarity calculation model, so that the two address texts can be accurately calculated the similarity. FIG. 1 shows a schematic diagram of an address search system 100 according to an embodiment of the present invention. As shown in FIG. 1, the address search system 100 includes a user terminal 110 and a computing device 200. The user terminal 110 is a terminal device used by a user, which may specifically be a personal computer such as a desktop computer, a notebook computer, or a mobile phone, a tablet computer, a multimedia device, a smart wearable device, etc., but is not limited thereto. The computing device 200 is used to provide services to the user terminal 110, and it can be implemented as a server, such as an application server, a Web server, etc.; it can also be implemented as a desktop computer, notebook computer, processor chip, mobile phone, tablet computer Etc., but not limited to this. In the embodiment of the present invention, the computing device 200 may be used to provide an address search service to the user. For example, the computing device 200 may serve as a server for electronic map applications. However, those skilled in the art should understand that the computing device 200 may be any device capable of Devices that provide users with address search services, not just servers for electronic map applications. In one embodiment, the address search system 100 further includes a data storage device 120. The data storage device 120 may be a relational database such as MySQL, ACCESS, etc., or a non-relational database such as NoSQL, etc.; it may be a local database residing in the computing device 200, or may be used as a distributed database such as HBase, etc. It is installed in multiple geographic locations. In short, the data storage device 120 is used to store data. The present invention does not limit the specific deployment and configuration of the data storage device 120. The computing device 200 can be connected to the data storage device 120 and obtain the data stored in the data storage device 120. For example, the computing device 200 can directly read the data in the data storage device 120 (when the data storage device 120 is the local database of the computing device 200), or can access the Internet through wired or wireless means, and through the data Interface to obtain data in the data storage device 120. In the embodiment of the present invention, a standard address library is stored in the data storage device 120, and the address text in the standard address library is a standard address text (complete and accurate address text). In the address search service, a user inputs an address text to be queried through the user terminal 110. Generally, the user's input is incomplete and inaccurate address text. The user terminal 110 sends the query query to the computing device 200, and the address search device in the computing device 200 will recall a batch of candidate address texts after searching the standard address library, usually ranging from several to several thousand. The address search device then calculates the correlation between these candidate address texts and the query query, and the address similarity is an important reference for the correlation degree. After calculating the address similarity between the query query and all candidate address texts, The candidate address text with the highest similarity is determined as the target address text corresponding to the address text to be queried, and the target address text is returned to the user. Specifically, the address search apparatus may use the address text similarity calculation model to calculate the similarity between the address text to be queried and the candidate address text. Correspondingly, the computing device 200 may also include a training device for the address text similarity calculation model. The data storage device 120 also stores a training address library. The training address library may be the same as or different from the above standard address library. The training address library includes multiple Address text, the training device uses the address text in the training address library to train the address text similarity calculation model. FIG. 2 shows a structural diagram of a computing device 200 according to an embodiment of the present invention. As shown in FIG. 2, in a basic configuration 202, the computing device 200 typically includes system memory 206 and one or more processors 204. The memory bus 208 may be used for communication between the processor 204 and the system memory 206. Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: microprocessor (µP), microcontroller (µC), digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache memory, such as a level one cache memory 210 and a level two cache memory 212, a processor core 214, and a scratchpad 216. The example processor core 214 may include an arithmetic logic unit (ALU), a floating point number unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204. Depending on the desired configuration, the system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.) or their Any combination. The system memory 106 may include an operating system 220, one or more application programs 222, and program data 224. The application program 222 is actually a plurality of program instructions, which are used to instruct the processor 204 to perform corresponding operations. In some embodiments, the application program 222 may be arranged to cause the processor 204 to operate with the program data 224 on the operating system. The computing device 200 may also include an interface bus 240 that facilitates communication from the various interface devices (eg, output device 242, peripheral interface 244, and communication device 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate via one or more I/O ports 258 and such as input devices (eg, keyboard, mouse, External devices such as pens, voice input devices, and touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. The example communication device 246 may include a network controller 260, which may be arranged to facilitate communication with one or more other computing devices 262 via a network communication link via one or more communication ports 264. The network communication link may be an example of a communication medium. Communication media can generally be embodied as computer readable instructions, data structures, program modules in modulated data signals such as carrier waves or other transmission mechanisms, and can include any information delivery media. A "modulated data signal" can be a signal in which one or more of its data set or its changes can be made by encoding information in the signal. As a non-limiting example, the communication medium may include wired media such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media . The term computer readable media as used herein may include both storage media and communication media. In the computing device 200 according to the present invention, the application program 222 includes an address text similarity calculation model training device 600 and an address search device 700. The device 600 includes multiple program instructions, which can instruct the processor 104 to execute the training method 300 of the address text similarity calculation model. The device 700 includes multiple program instructions, which can instruct the processor 104 to perform the address search method 600. FIG. 3 shows a flowchart of a training method 300 of an address text similarity calculation model according to an embodiment of the present invention. The method 300 is suitable for execution in a computing device (eg, the aforementioned computing device 200). As shown in FIG. 3, the method 300 starts at step S310. In step S310, a training data set is obtained. The training data set includes multiple pieces of training data, and each piece of training data includes three address texts, which are a first address text, a second address text, and a third address text, respectively. Each address text includes multiple address elements arranged from high to low. The first n level address elements of the first address text and the second address text are the same; the first (n- 1) The address elements of the same level are the same, and the address elements of the nth level are different. Here, the value range of n is (1, N), N is the number of address levels included in the address text, for example, the address text includes a total of 5 address levels, namely: province, city, district, road, road number , The value of N is 5. Of course, n can also adopt other value ranges according to specific application scenarios. In the embodiment of the present invention, each piece of training data is a triple consisting of 3 address texts {target_addr, pos_addr, neg_addr}, target_addr corresponds to the above first address text, pos_addr corresponds to the above second address text, and neg_addr corresponds to the above Third address text. {target_addr, pos_addr} constitutes a pair of positive sample pairs, and {target_addr, neg_addr} constitutes a pair of negative sample pairs. In one embodiment, the training data set is obtained as follows: First, the original address text is obtained from the training address library (or standard address library), and the original address text is parsed, and the character string of the address text is segmented and formatted as Address element. For example, for the address text "No. 910, 7th Floor, Building 1, Alibaba Basixi Park, No. 969 Wenyi West Road, Yuhang District, Hangzhou, Zhejiang Province, it can be divided into "prov (province) = Zhejiang city (city) = Hangzhou city district (area)=Yuhang District road (road)=Wenyi West Road roadno (road number)=969 poi=Alibaba Basin Creek Park houseno (building number)=1 building floorno (floor number)=7th floor roomno (room Number)=910 number". Specifically, the above analysis can be completed by combining the word segmentation model and the named entity model. The embodiment of the present invention does not limit the specific word segmentation model and the named entity model, and those skilled in the art can make reasonable choices as needed. Then, the address text formatted as address elements is aggregated (deduplicated and sorted) according to different levels of address elements to form the following table:
Figure 108129457-A0304-0001
Finally, the aggregated data in the table is combined into positive and negative sample pairs of training data at different address levels. The output format is: {target_addr, pos_addr, neg_addr}. As mentioned earlier, {target_addr, pos_addr} constitutes a pair of positive sample pairs, and {target_addr, neg_addr} constitutes a pair of negative sample pairs. It should be noted that a pair of positive sample pairs can correspond to multiple pairs of negative sample pairs, that is, one target_addr corresponds to one pos_addr, and the target_addr can correspond to multiple neg_addr. The specific operations are as follows: (1) Select an address text, for example: prov=Zhejiang city=Hangzhou district=Yuhang District road=Wenyi West Road roadno=969 poi=Alibabasi Creek Park; (2)From high to low Traverse all address levels, for example, province -> city -> district -> road, at each address level, find the same and different address elements as the current address element, and form a positive sample pair and a negative sample pair with the current address element, respectively For example, at the provincial level, the positive examples of the Alibaba Basi Creek Park, No. 969 Wenyi West Road, Yuhang District, Hangzhou City, Zhejiang Province are: Phase 1 of Guohua IKEA Garden, No. 245 Yiyuan Road, Yinzhou District, Ningbo City, Zhejiang Province; the negative examples are: Shanghai Hongqiao International Airport, No. 2550 Hongqiao Road, Changning District, Shanghai, Shanghai. At the city level, the positive example of Alibaba Basi Creek Park, No. 969 Wenyi West Road, Yuhang District, Hangzhou, Zhejiang Province is: Zhejiang Provincial Institute of Socialism, No. 1008 Wenyi West Road, Yuhang District, Hangzhou, Zhejiang Province; the negative example is: Ningbo, Zhejiang Province IKEA, 525 Yiyuan Road, Yinzhou District At the district level, the positive examples of Alibaba Basi Creek Park, 969 Wenyi West Road, Yuhang District, Hangzhou City, Zhejiang Province are: Saiyin International Plaza, No. 248 Gaojiao Road, Yuhang District, Hangzhou City, Zhejiang Province; the negative examples are: Shangcheng District, Hangzhou City, Zhejiang Province Nanshan Road, 218 Nanshan Road, China Academy of Art. After acquiring the training data set, the method 300 proceeds to step S320. Before describing the processing procedure of step S320, the structure of the address text similarity calculation model of the embodiment of the present invention is introduced first. 4, the address text similarity calculation model 400 of the embodiment of the present invention includes: a word embedding layer 410, a text encoding layer 420 and a similarity calculation layer 430. The word embedding layer 410 is adapted to convert each address element in the address text into a word vector, and combine each word vector into a word vector set corresponding to the address text; the text encoding layer 420 is adapted to encode the word vector set corresponding to the address text into Text vector; the similarity calculation layer 430 is adapted to calculate the similarity between two text vectors, and use the similarity between text vectors to characterize the similarity between address texts. In step S320, the first address text, the second address text, and the third address text in each piece of training data are input to the word embedding layer for processing to obtain a first word vector set corresponding to the first address text, The second word vector set corresponding to the second address text and the third word vector set corresponding to the third address text. The word embedding layer (embedding layer) can convert each word in a sentence into a digital vector (word vector). The weight of the embedding layer can be pre-calculated through the text co-occurrence information of the massive corpus, for example, using the Glove algorithm, or , Word2Vec CBOW and skip-gram algorithm for calculation. These algorithms are based on the fact that different text representations with the same latent semantics will repeatedly appear in the same context, and use the relationship between this context and words to predict words to context, or through context prediction Words to get the latent semantics of each word. In the embodiment of the present invention, the parameters of the word embedding layer can be separately trained using the corpus; the word embedding layer and the text encoding layer can also be trained together, so as to obtain the parameters of the word embedding layer and the parameters of the text encoding layer at the same time. The following uses the word embedding layer and the text encoding layer to train together as an example for description. Specifically, the address text includes multiple formatted address elements. After the address text is input to the word embedding layer, the word embedding layer converts each address element in the address text as a word into a word vector, thus obtaining multiple words Vector, and then combine these word vectors into a set of word vectors. In one implementation, the set of word vectors is represented as a list, that is, a list of word vectors. Each list item in the word vector list corresponds to a word vector, and the number of items in the list is the number of address elements in the address text. In another implementation, the set of word vectors is represented as a matrix, that is, a matrix of word vectors, each column of the matrix corresponds to a word vector, and the number of columns in the matrix is the number of address elements in the address text. After acquiring the word vector set, the method 300 proceeds to step S330. In step S330, the first word vector set, the second word vector set, and the third word vector set are input to the text encoding layer for processing, thereby encoding the first word vector set into the first text vector, and the second word The vector set is encoded as the second text vector, and the third word vector set is encoded as the third text vector. The text encoding layer is implemented using a Deep Neural Network (DNN) model, for example, a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, or a depth Belief Network (Deep Belief Network, DBN) model. Through the DNN, the embedding output of the address sentence text of indefinite length is encoded into a fixed-length sentence vector. At this time, target_addr, pos_addr, and neg_addr are converted into vector_A, vector_B, and vector_C, respectively. vector_A is the aforementioned first text vector, vector_B is the aforementioned second text vector, and vector_C is the aforementioned third text vector. Taking RNN as an example, the word vector sequence corresponding to the address text can be regarded as a time series, and the word vectors in the word vector sequence are input into the RNN in order, and the final output vector is the text vector (sentence vector) corresponding to the address text. Taking CNN as an example, the word vector matrix corresponding to the address text is input into CNN, through the processing of multiple convolutional layers and pooling layers, and finally the two-dimensional feature map is converted into a one-dimensional feature vector through the fully connected layer. The vector is the text vector corresponding to the address text. After acquiring the text vector, the method 300 proceeds to step S340. In step S340, a similarity calculation layer is used to calculate a first similarity between the first text vector and the second text vector, and a second similarity between the first text vector and the third text vector. In this way, the first similarity may represent the similarity between the first address text and the second address text, and the second similarity may represent the similarity between the first address text and the third address text. You can choose a variety of similarity distance calculation methods, such as: Euclidean distance, cosine similarity, and Jaccard coefficient. In this embodiment, the similarity between vector_A and vector_B is recorded as SIM_AB, and the similarity between vector_A and vector_C is recorded as SIM_AC. Finally, in step S350, the network parameters of the word embedding layer and the text encoding layer are adjusted according to the first similarity and the second similarity. It includes: calculating the loss function value according to the first similarity and the second similarity; using the back propagation algorithm to adjust the network parameters of the word embedding layer and the text encoding layer until the loss function value is lower than the preset value, or the number of trainings The predetermined number of times has been reached. The loss function here is a triple loss function. Using the triple loss function can narrow the distance between positive sample pairs and push the distance between negative sample pairs. The loss function can be specifically expressed as: loss = Margin-(SIM_AB-SIM_AC). Use the back propagation algorithm to optimize the target min(loss) of the network, so that the network will actively learn the parameters so that target_addr is closer to pos_addr in the semantic space, and away from neg_addr. Among them, Margin is a hyperparameter, which means that the training goal must ensure that SIM_AB and SIM_AC must maintain a certain distance to increase the differentiation of the model. The value of Margin can be adjusted repeatedly according to the data and the actual task until the effect is the most excellent. After completing the above training process, a similarity calculation model that can be used to calculate the similarity between two pieces of address text is finally obtained. Based on the similarity calculation model, an embodiment of the present invention also provides a method for determining the similarity of address texts, including the following steps: 1) acquiring an address text pair to be determined similarity; 2) inputting the address text pair to a trained address A text similarity calculation model to output the similarity of the address text to the two included address texts. In addition, the similarity calculation model can be applied to various scenarios where address text similarity needs to be calculated, for example, it can be applied to address standardization in the fields of public security, express delivery, logistics, electronic maps, and the like. In these scenarios, using the address text similarity calculation model of the embodiment of the present invention, an address search service can be provided for users. FIG. 5 shows a flowchart of an address search method 500 according to an embodiment of the present invention. Referring to FIG. 5, the method 500 includes steps S510 to S530. In step S510, one or more candidate address texts corresponding to the address text to be queried are obtained. In the address search service, the user inputs the to-be-queried address text (query) through the user terminal. Generally, the user's input is torn and inaccurate address text. The user terminal sends the query query to the computing device, and the address search device in the computing device will recall a batch of candidate address texts after searching the standard address library, usually ranging from several to several thousand. In step S520, the address text to be queried and the candidate address text are input into a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is trained according to the above method 300 . In this step, the similarity between the address text to be queried and each candidate address file is calculated separately. After obtaining the similarity between the address text to be queried and all candidate address texts, the method 500 proceeds to step S530. In step S530, the candidate address text with the largest similarity is determined as the target address text corresponding to the address text to be queried, and the target address text is returned to the user. FIG. 6 shows a schematic diagram of a training device 600 for an address text similarity calculation model according to an embodiment of the present invention. The address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer. The training device 600 includes: an acquisition module 610, adapted to acquire a training data set, the training data set includes multiple pieces of training data, each piece of training The data includes the first, second, and third address texts, in which the address elements of the first n levels of the first and second address texts are the same, and the address elements of the first (n-1) levels of the first and third address texts are the same, and The address elements of level n are different. The acquiring module 610 is specifically used to perform the method as described in step S310. For the processing logic and function of the acquiring module 610, reference may be made to the related description in step S310, and details are not described here. The word vector acquisition module 620 is adapted to input the first, second, and third address texts of each piece of training data into the word embedding layer to obtain the corresponding first, second, and third word vector sets. The word vector acquisition module 620 is specifically used to execute the method as described in step S320. For the processing logic and function of the word vector acquisition module 620, reference may be made to the related description in step S320, which will not be repeated here. The text vector acquisition module 630 is adapted to input the first, second, and third word vector sets to the text encoding layer to obtain the corresponding first, second, and third text vectors. The text vector acquisition module 630 is specifically used to perform the method as described in step S330. For the processing logic and function of the word vector acquisition module 630, reference may be made to the related description in step S330, and details are not repeated here. The second similarity calculation module 640 is adapted to use the similarity calculation layer to calculate the first similarity of the first and second text vectors and the second similarity of the first and third text vectors. The second similarity calculation module 640 is specifically used to perform the method as described in step S340. For the processing logic and function of the second similarity calculation module 640, reference may be made to the related description in step S340, and details are not described here. The parameter adjustment module 650 is adapted to adjust the network parameters of the word embedding layer and the text encoding layer according to the first and second similarities. The parameter adjustment module 650 is specifically used to execute the method as described in step S350. For the processing logic and function of the second similarity calculation module 650, reference may be made to the related description in step S350, which is not repeated here. FIG. 7 shows a schematic diagram of an address search apparatus 700 according to an embodiment of the present invention. Referring to FIG. 7, the address search device 700 includes: a query module 710 adapted to obtain one or more candidate address texts corresponding to the address text to be queried; a first similarity calculation module 720 adapted to compare the address text to be queried and the candidate The address text is input into the preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is trained by the training device 600; the output module 730 is adapted to maximize the similarity The candidate address text of is determined as the target address text corresponding to the address text to be queried. The various technologies described here can be implemented in combination with hardware or software, or a combination thereof. Thus, the method and apparatus of the present invention, or some aspects or parts of the method and apparatus of the present invention, may be embedded in a tangible medium, such as a removable hard disk, pen drive, floppy disk, CD-ROM, or any other machine-readable In the form of program code (ie, instructions) in the storage medium of the computer, where when the program is loaded into a machine such as a computer and executed by the machine, the machine becomes a device for practicing the invention. When the program code is executed on a programmable computer, the computing device generally includes a processor, a processor-readable storage medium (including volatile and non-volatile memory and/or storage elements), and at least one input device. And at least one output device. Wherein, the memory is configured to store program code; the processor is configured to execute the multilingual spam text recognition method of the present invention according to the instructions in the program code stored in the memory. By way of example, and not limitation, readable media includes readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media generally embody computer-readable instructions, data structures, program modules, or other data with modulated data signals such as carrier waves or other transmission mechanisms, and include any information delivery media. Combinations of any of the above are also included in the range of readable media. In the instructions provided here, the algorithms and displays are not inherently related to any particular computer, virtual system or other equipment. Various general-purpose systems can also be used with examples of the present invention. From the above description, the structure required to construct such systems is obvious. In addition, the present invention is not directed to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above descriptions of specific languages are for disclosure of the best embodiments of the present invention. The specification provided here explains a lot of specific details. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.

100:地址搜索系統 110:用戶終端 120:資料儲存裝置 200:計算設備 202:基本配置 204:處理器 206:系統記憶體 208:記憶體匯流排 210:一級高速快取記憶體 212:二級高速快取記憶體 214:處理器核心 216:暫存器 218:記憶體控制器 220:操作系統 222:應用程式 224:程式數據 226:其他應用程式 230:匯流排/介面控制器 232:儲存設備 234:儲存介面匯流排 236:可移除儲存器 238:不可移除儲存器 240:介面匯流排 242:輸出設備 244:外設介面 246:通信設備 248:圖形處理單元 250:音頻處理單元 252:A/V端口 254:串行介面控制器 256:並行介面控制器 258:I/O端口 260:網路控制器 262:其他計算設備 264:通信端口 600:訓練裝置 700:搜索裝置 300、500:方法 S310~S350、S510~S530:步驟 400:地址文本相似度計算模型 410:詞嵌入層 420:文本編碼層 430:相似度計算層 600:訓練裝置 610:獲取模組 620:詞向量獲取模組 630:文本向量獲取模組 640:第二相似度計算模組 650:調參模組 700:地址搜索裝置 710:查詢模組 720:第一相似度計算模組 730:輸出模組100: address search system 110: user terminal 120: Data storage device 200: computing device 202: basic configuration 204: processor 206: System memory 208: Memory bus 210: Class 1 high-speed cache 212: Secondary high-speed cache 214: processor core 216: scratchpad 218: Memory controller 220: Operating system 222: Application 224: Program data 226: Other applications 230: bus/interface controller 232: storage device 234: Storage interface bus 236: Removable storage 238: Non-removable storage 240: interface bus 242: output device 244: Peripheral interface 246: Communication equipment 248: graphics processing unit 250: audio processing unit 252: A/V port 254: Serial interface controller 256: Parallel interface controller 258: I/O port 260: Network controller 262: Other computing equipment 264: communication port 600: training device 700: search device 300, 500: method S310~S350, S510~S530: Steps 400: Address text similarity calculation model 410: word embedding layer 420: Text encoding layer 430: Similarity calculation layer 600: training device 610: Obtain Module 620: Word vector acquisition module 630: Text vector acquisition module 640: Second similarity calculation module 650: Assistant module 700: Address search device 710: Query module 720: The first similarity calculation module 730: output module

透過閱讀下文優選實施方式的詳細描述,各種其他的優點和益處對於本領域普通技術人員將變得清楚明瞭。圖式僅用於示出優選實施方式的目的,而並不認為是對本發明的限制。而且在整個圖式中,用相同的參考符號表示相同的部件。在圖式中: 圖1示出了根據本發明一個實施例的地址搜索系統100的示意圖; 圖2示出了根據本發明一個實施例的計算設備200的示意圖; 圖3示出了根據本發明一個實施例的地址文本相似度計算模型的訓練方法300的流程圖; 圖4示出了根據本發明一個實施例的的地址文本相似度計算模型400的示意圖; 圖5示出了根據本發明一個實施例的地址搜索方法500的流程圖; 圖6示出了根據本發明一個實施例的地址文本相似度計算模型的訓練裝置600的示意圖; 圖7示出了根據本發明一個實施例的地址搜索裝置700的示意圖。By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those skilled in the art. The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present invention. Furthermore, throughout the drawings, the same reference symbols are used to denote the same components. In the diagram: FIG. 1 shows a schematic diagram of an address search system 100 according to an embodiment of the present invention; 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention; 3 shows a flowchart of a training method 300 for an address text similarity calculation model according to an embodiment of the present invention; 4 shows a schematic diagram of an address text similarity calculation model 400 according to an embodiment of the present invention; FIG. 5 shows a flowchart of an address search method 500 according to an embodiment of the present invention; 6 shows a schematic diagram of a training device 600 for an address text similarity calculation model according to an embodiment of the present invention; FIG. 7 shows a schematic diagram of an address search apparatus 700 according to an embodiment of the present invention.

300:方法 300: Method

Claims (13)

一種地址文本相似度確定方法,該地址文本包括級別從高到低排列的多個地址元素,該方法包括: 獲取待確定相似度的地址文本對; 將該地址文本對輸入到預設的地址文本相似度計算模型,以輸出該地址文本對所包括的兩個地址文本的相似度; 其中,該地址文本相似度計算模型基於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對。A method for determining the similarity of address text, the address text includes a plurality of address elements arranged from high to low, the method includes: Obtain the address-text pair whose similarity is to be determined; Input the address text pair into a preset address text similarity calculation model to output the similarity of the two address texts included in the address text pair; The address text similarity calculation model is trained based on a training data set that includes multiple pieces of training data, and each piece of training data includes at least the first, second, and third address texts, where the first n of the first and second address texts The address elements of the same level constitute a positive sample pair, and the address elements of the first (n-1) levels of the first and third address texts are the same, and the address elements of the nth level are different, constituting a negative sample pair. 如請求項1所述的方法,其中,該地址文本相似度計算模型包括詞嵌入層、文本編碼層和相似度計算層,訓練該地址文本相似度計算模型的步驟包括: 將每條訓練數據的第一、二、三地址文本輸入到詞嵌入層,以得到對應的第一、二、三詞向量集; 將第一、二、三詞向量集輸入到文本編碼層,以得到對應的第一、二、三文本向量; 利用相似度計算層計算第一、二文本向量的第一相似度和第一、三文本向量的第二相似度; 根據第一、二相似度調整該地址文本相似度計算模型的網路參數。The method according to claim 1, wherein the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the steps of training the address text similarity calculation model include: Input the first, second, and third address texts of each piece of training data into the word embedding layer to obtain the corresponding first, second, and third word vector sets; Input the first, second, and third word vector sets to the text encoding layer to obtain the corresponding first, second, and third text vectors; Use the similarity calculation layer to calculate the first similarity of the first and second text vectors and the second similarity of the first and third text vectors; The network parameters of the address text similarity calculation model are adjusted according to the first and second similarities. 如請求項2所述的方法,其中,該網路參數包括:詞嵌入層的參數和/或文本編碼層的參數。The method according to claim 2, wherein the network parameters include: parameters of the word embedding layer and/or parameters of the text encoding layer. 如請求項2所述的方法,其中,第一、二、三詞向量集中的各詞向量集包括多個詞向量,每個詞向量與地址文本中的一個地址元素相對應。The method according to claim 2, wherein each word vector set in the first, second, and third word vector sets includes multiple word vectors, and each word vector corresponds to an address element in the address text. 如請求項2所述的方法,其中,該詞嵌入層採用Glove模型或者Word2Vec模型。The method according to claim 2, wherein the word embedding layer uses the Glove model or the Word2Vec model. 如請求項2所述的方法,其中,該第一相似度和第二相似度包括歐氏距離、餘弦相似度或者Jaccard係數中的至少一個。The method according to claim 2, wherein the first similarity and the second similarity include at least one of Euclidean distance, cosine similarity, or Jaccard coefficient. 如請求項2所述的方法,其中,該根據第一、二相似度調整詞該地址文本相似度計算模型的網路參數,包括: 根據第一、二相似度計算損失函數值; 利用反向傳播演算法調整地址文本相似度計算模型的網路參數,直到損失函數值低於預設值,或者訓練次數達到預定次數。The method according to claim 2, wherein the network parameters of the address text similarity calculation model based on the first and second similarity adjustment words include: Calculate the loss function value according to the first and second similarity; Use the back propagation algorithm to adjust the network parameters of the address text similarity calculation model until the loss function value is lower than the preset value, or the number of training times reaches a predetermined number of times. 如請求項7所述的方法,其中,該損失函數值為: Loss=Margin-(第一相似度-第二相似度) 其中,Loss為損失函數值,Margin為超參數。The method according to claim 7, wherein the loss function value is: Loss=Margin-(first similarity-second similarity) Among them, Loss is the value of the loss function, and Margin is the hyperparameter. 如請求項2所述的方法,其中,該文本編碼層包括RNN模型、CNN模型或者DBN模型中的至少一個。The method according to claim 2, wherein the text encoding layer includes at least one of an RNN model, a CNN model, or a DBN model. 一種地址搜索方法,包括: 獲取待查詢地址文本對應的一個或多個候選地址文本; 將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型基於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本。An address search method, including: Obtain one or more candidate address texts corresponding to the address text to be queried; The address text to be queried and the candidate address text are input into a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is trained based on a training data set including multiple pieces of training data It is obtained that each piece of training data includes at least the first, second, and third address texts, where the address elements of the first n levels of the first and second address texts are the same, forming a positive sample pair, and the first (n -1) The address elements of the same level are the same, and the address elements of the nth level are different, forming a negative sample pair; The candidate address text with the largest similarity is determined as the target address text corresponding to the address text to be queried. 一種地址搜索裝置,包括: 查詢模組,適於獲取待查詢地址文本對應的一個或多個候選地址文本; 第一相似度計算模組,適於將待查詢地址文本和候選地址文本輸入到預設的地址文本相似度計算模型,以得到二者的相似度,其中,該地址文本相似度計算模型於包括多條訓練數據的訓練數據集進行訓練得到,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 輸出模組,適於將相似度最大的候選地址文本確定為待查詢地址文本對應的目標地址文本。An address search device, including: The query module is suitable for obtaining one or more candidate address texts corresponding to the address text to be queried; The first similarity calculation module is adapted to input the address text to be queried and the candidate address text into a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model includes The training data set of multiple pieces of training data is obtained by training, and each piece of training data includes at least the first, second, and third address texts, wherein the address elements of the first n levels of the first and second address texts are the same, forming a positive sample pair, The address elements of the first (n-1) levels of the first and third address texts are the same, and the address elements of the nth level are different, forming a negative sample pair; The output module is adapted to determine the candidate address text with the largest similarity as the target address text corresponding to the address text to be queried. 一種地址文本相似度計算模型的訓練裝置,該地址文本包括級別從高到低排列的多個地址元素,該地址文本相似度計算模型包括詞嵌入層、文本編碼層和相似度計算層,該裝置包括: 獲取模組,適於獲取訓練數據集,該訓練數據集包括多條訓練數據,每條訓練數據至少包括第一、二、三地址文本,其中,第一、二地址文本的前n個級別的地址元素相同,構成正樣本對,第一、三地址文本的前(n-1)個級別的地址元素相同、且第n級別的地址元素不相同,構成負樣本對; 詞向量獲取模組,適於將每條訓練數據的第一、二、三地址文本輸入到詞嵌入層,以得到對應的第一、二、三詞向量集; 文本向量獲取模組,適於將第一、二、三詞向量集輸入到文本編碼層,以得到對應的第一、二、三文本向量; 第二相似度計算模組,適於利用相似度計算層計算第一、二文本向量的第一相似度和第一、三文本向量的第二相似度; 調參模組,適於根據第一、二相似度調整地址文本相似度計算模型的網路參數。A training device for an address text similarity calculation model. The address text includes a plurality of address elements arranged from high to low. The address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer. include: The acquisition module is suitable for acquiring a training data set. The training data set includes multiple pieces of training data, and each piece of training data includes at least the first, second, and third address texts, wherein the first n levels of the first and second address texts The address elements are the same, forming a positive sample pair, the first (n-1) level address elements of the first and third address texts are the same, and the nth level address elements are different, forming a negative sample pair; The word vector acquisition module is suitable for inputting the first, second, and third address texts of each piece of training data into the word embedding layer to obtain the corresponding first, second, and third word vector sets; The text vector acquisition module is suitable for inputting the first, second, and third word vector sets into the text encoding layer to obtain the corresponding first, second, and third text vectors; The second similarity calculation module is adapted to use the similarity calculation layer to calculate the first similarity of the first and second text vectors and the second similarity of the first and third text vectors; The parameter adjustment module is adapted to adjust the network parameters of the address text similarity calculation model according to the first and second similarities. 一種計算設備,包括: 一個或多個處理器; 記憶體;以及 一個或多個程式,其中該一個或多個程式儲存在該記憶體中並被配置為由該一個或多個處理器執行,該一個或多個程式包括用於執行根據請求項1-10所述的方法中的任一方法的指令。A computing device, including: One or more processors; Memory; and One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs include Instructions of any of the methods described.
TW108129457A 2018-11-19 2019-08-19 Method for determining address text similarity, address searching method, apparatus, and device TW202020688A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811375413.2 2018-11-19
CN201811375413.2A CN111274811B (en) 2018-11-19 2018-11-19 Address text similarity determining method and address searching method

Publications (1)

Publication Number Publication Date
TW202020688A true TW202020688A (en) 2020-06-01

Family

ID=70773096

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108129457A TW202020688A (en) 2018-11-19 2019-08-19 Method for determining address text similarity, address searching method, apparatus, and device

Country Status (3)

Country Link
CN (1) CN111274811B (en)
TW (1) TW202020688A (en)
WO (1) WO2020103783A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783419B (en) * 2020-06-12 2024-02-27 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111753516B (en) * 2020-06-29 2024-04-16 平安国际智慧城市科技股份有限公司 Text check and repeat processing method and device, computer equipment and computer storage medium
CN111881677A (en) * 2020-07-28 2020-11-03 武汉大学 Address matching algorithm based on deep learning model
CN112070429B (en) * 2020-07-31 2024-03-15 深圳市跨越新科技有限公司 Address merging method and system
CN112632406B (en) * 2020-10-10 2024-04-09 咪咕文化科技有限公司 Query method, query device, electronic equipment and storage medium
CN113779370B (en) * 2020-11-03 2023-09-26 北京京东振世信息技术有限公司 Address retrieval method and device
CN112559658B (en) * 2020-12-08 2022-12-30 中国科学技术大学 Address matching method and device
CN112579919B (en) * 2020-12-09 2023-04-21 小红书科技有限公司 Data processing method and device and electronic equipment
CN113544700A (en) * 2020-12-31 2021-10-22 商汤国际私人有限公司 Neural network training method and device, and associated object detection method and device
CN113204612B (en) * 2021-04-24 2024-05-03 上海赛可出行科技服务有限公司 Priori knowledge-based network about vehicle similar address identification method
CN113468881B (en) * 2021-07-23 2024-02-27 浙江大华技术股份有限公司 Address standardization method and device
CN113626730A (en) * 2021-08-02 2021-11-09 同盾科技有限公司 Similar address screening method and device, computing equipment and storage medium
CN114048797A (en) * 2021-10-20 2022-02-15 盐城金堤科技有限公司 Method, device, medium and electronic equipment for determining address similarity
CN114254139A (en) * 2021-12-17 2022-03-29 北京百度网讯科技有限公司 Data processing method, sample acquisition method, model training method and device
CN114970525B (en) * 2022-06-14 2023-06-27 城云科技(中国)有限公司 Text co-event recognition method, device and readable storage medium
CN116306627A (en) * 2023-02-09 2023-06-23 北京海致星图科技有限公司 Multipath fusion address similarity calculation method, device, storage medium and equipment
CN116150625B (en) * 2023-03-08 2024-03-29 华院计算技术(上海)股份有限公司 Training method and device for text search model and computing equipment
CN115952779B (en) * 2023-03-13 2023-09-29 中规院(北京)规划设计有限公司 Position name calibration method and device, computer equipment and storage medium
CN117725909B (en) * 2024-02-18 2024-05-14 四川日报网络传媒发展有限公司 Multi-dimensional comment auditing method and device, electronic equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323968A1 (en) * 2011-06-14 2012-12-20 Microsoft Corporation Learning Discriminative Projections for Text Similarity Measures
CN105988988A (en) * 2015-02-13 2016-10-05 阿里巴巴集团控股有限公司 Method and device for processing text address
CN105930413A (en) * 2016-04-18 2016-09-07 北京百度网讯科技有限公司 Training method for similarity model parameters, search processing method and corresponding apparatuses
CN106557574B (en) * 2016-11-23 2020-02-04 广东电网有限责任公司佛山供电局 Target address matching method and system based on tree structure
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN107239442A (en) * 2017-05-09 2017-10-10 北京京东金融科技控股有限公司 A kind of method and apparatus of calculating address similarity
CN107609461A (en) * 2017-07-19 2018-01-19 阿里巴巴集团控股有限公司 The training method of model, the determination method, apparatus of data similarity and equipment
CN108536657B (en) * 2018-04-10 2021-09-21 百融云创科技股份有限公司 Method and system for processing similarity of artificially filled address texts
CN108805583B (en) * 2018-05-18 2020-01-31 连连银通电子支付有限公司 E-commerce fraud detection method, device, equipment and medium based on address mapping

Also Published As

Publication number Publication date
CN111274811A (en) 2020-06-12
CN111274811B (en) 2023-04-18
WO2020103783A1 (en) 2020-05-28

Similar Documents

Publication Publication Date Title
TW202020688A (en) Method for determining address text similarity, address searching method, apparatus, and device
CN109960800B (en) Weak supervision text classification method and device based on active learning
Zheng et al. A survey of location prediction on twitter
CN108388559B (en) Named entity identification method and system under geographic space application and computer program
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
CN104834747A (en) Short text classification method based on convolution neutral network
WO2019052403A1 (en) Training method for image-text matching model, bidirectional search method, and related apparatus
US11675975B2 (en) Word classification based on phonetic features
CN110147421B (en) Target entity linking method, device, equipment and storage medium
WO2022174552A1 (en) Method and apparatus for obtaining poi state information
CN108614897B (en) Content diversification searching method for natural language
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
WO2022141876A1 (en) Word embedding-based search method, apparatus and device, and storage medium
WO2024099037A1 (en) Data processing method and apparatus, entity linking method and apparatus, and computer device
CN109492027B (en) Cross-community potential character relation analysis method based on weak credible data
KR20230142754A (en) Document analysis using model intersections
CN114997288A (en) Design resource association method
Song et al. A novel automatic ontology construction method based on web data
KR20220068462A (en) Method and apparatus for generating knowledge graph
CN116401350A (en) Intelligent retrieval method, system and storage medium based on exploration and development knowledge graph
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
Lin et al. A music retrieval method based on hidden markov model
Gu et al. Automatic recognition of Chinese personal name using conditional random fields and knowledge base