CN111274811B

CN111274811B - Address text similarity determining method and address searching method

Info

Publication number: CN111274811B
Application number: CN201811375413.2A
Authority: CN
Inventors: 刘楚; 谢朋峻; 郑华飞; 李林琳; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2023-04-18
Anticipated expiration: 2038-11-19
Also published as: TW202020688A; CN111274811A; WO2020103783A1

Abstract

The invention discloses an address text similarity determining method and an address searching method, wherein an address text comprises a plurality of address elements with the levels ranging from high to low, and the method comprises the following steps: acquiring an address text pair with similarity to be determined; and inputting the address text pair into a preset address text similarity calculation model so as to output the similarity of the two address texts included in the address text pair. The invention improves the accuracy of the similarity calculation of the address text.

Description

Address text similarity determining method and address searching method

Technical Field

The invention relates to the field of artificial intelligence, in particular to an address text similarity determining method, an address searching method and computing equipment.

Background

In some address-sensitive industries or departments, such as police, courier, logistics, electronic maps, etc., a standard address library is usually maintained inside the system. In the use scenario of the address data, the description which is not uniform with the standard address library often exists, for example, the spoken address at the time of 110 alarm is far from the standard address in the public security system. At this time, an effective and fast method is needed to map the non-standard address text to the corresponding or similar address in the standard address library, wherein how to judge the similarity between two sections of address texts is crucial.

The common address text similarity has the following calculation modes:

1. the similarity of two text segments is calculated by using the editing distance, and the semantic meaning of the text is ignored in the way, for example: the edit distance between "Alibap" and "Alima" is the same as the edit distance between "Alibap" and "Alima mother", but the semantic similarity between "Alibap" and "Alima mother" should be semantically greater than "Alima".

2. The semantic similarity is used for calculating the similarity between two sections of texts, such as word2vec, and the method is suitable for all text fields and does not aim at the address text independently. When applied to address text, the accuracy is not high enough.

3. The address text is decomposed into a plurality of address elements, the weights of the address elements of all levels are manually specified and then weighted and summed, and the defects that the weights of all the address levels cannot be automatically generated aiming at the data set and automation cannot be well realized are overcome.

Disclosure of Invention

In view of the above problems, the present invention has been made to provide an address text similarity determination method and an address search method that overcome or at least partially solve the above problems.

According to an aspect of the present invention, there is provided an address text similarity determination method, the address text including a plurality of address elements arranged from high to low in rank, the method including:

acquiring an address text pair with similarity to be determined;

inputting the address text pair into a preset address text similarity calculation model to output the similarity of two address texts included in the address text pair;

the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair.

Optionally, in the address text similarity determining method according to the present invention, the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the step of training the address text similarity calculation model includes: inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets; inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors; calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer; and adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.

Optionally, in the method for determining address text similarity according to the present invention, the network parameter includes: parameters of a word embedding layer and/or parameters of a text encoding layer.

Optionally, in the address text similarity determining method according to the present invention, each word vector set in the first, second, and third word vector sets includes a plurality of word vectors, and each word vector corresponds to one address element in the address text.

Optionally, in the address text similarity determining method according to the present invention, the Word embedding layer employs a Glove model or a Word2Vec model.

Optionally, in the method for determining similarity between address texts according to the present invention, the first similarity and the second similarity include at least one of euclidean distance, cosine similarity, or Jaccard coefficient.

Optionally, in the method for determining similarity of address texts according to the present invention, the adjusting network parameters of the address text similarity calculation model according to the first and second similarities includes: calculating a loss function value according to the first similarity and the second similarity; and adjusting the network parameters of the address text similarity calculation model by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.

Optionally, in the address text similarity determining method according to the present invention, the loss function value is: loss = Margin- (first similarity-second similarity), where Loss is the Loss function value and Margin is the hyperparameter.

Optionally, in the address text similarity determining method according to the present invention, the text encoding layer includes at least one of an RNN model, a CNN model, or a DBN model.

According to another aspect of the present invention, there is provided an address search method including:

acquiring one or more candidate address texts corresponding to the address texts to be inquired;

inputting an address text to be inquired and a candidate address text into a preset address text similarity calculation model to obtain the similarity of the address text to be inquired and the candidate address text, wherein the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, the first n levels of address elements of the first address text and the third address text are the same to form a positive sample pair, the first (n-1) levels of address elements of the first address text and the third address text are the same, and the nth level of address elements are different to form a negative sample pair;

and determining the candidate address text with the maximum similarity as a target address text corresponding to the address text to be inquired.

According to another aspect of the present invention, there is provided an address search apparatus including:

the query module is suitable for acquiring one or more candidate address texts corresponding to the address texts to be queried;

the first similarity calculation module is suitable for inputting the address text to be inquired and the candidate address text into a preset address text similarity calculation model to obtain the similarity of the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair;

and the output module is suitable for determining the candidate address text with the maximum similarity as the target address text corresponding to the address text to be inquired.

According to another aspect of the present invention, there is provided an apparatus for training an address text similarity calculation model, the address text including a plurality of address elements arranged in a high-to-low order, the address text similarity calculation model including a word embedding layer, a text encoding layer, and a similarity calculation layer, the apparatus comprising:

the training data set comprises a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, and the first (n-1) levels of address elements of the first address text and the third address text are the same and the nth level of address elements are different to form a negative sample pair;

the word vector acquisition module is suitable for inputting the first, second and third address texts of each piece of training data into the word embedding layer to obtain corresponding first, second and third word vector sets;

the text vector acquisition module is suitable for inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors;

the second similarity calculation module is suitable for calculating the first similarities of the first text vector and the second similarities of the first text vector and the third text vector by utilizing the similarity calculation layer;

and the parameter adjusting module is suitable for adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.

According to another aspect of the invention, there is provided a computing device comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method according to any of the methods described above.

Since the address text naturally contains hierarchical relationships, address elements of different levels play different roles in address similarity calculation. The embodiment of the invention automatically learns the weights of the address elements of different levels by utilizing the hierarchical relation in the address text, avoids the subjectivity of manually assigning the weights, has the self-adaptive capacity to the target data source, and further can accurately calculate the similarity degree of the two address texts.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a schematic diagram of an address search system 100 according to one embodiment of the invention;

FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of a method 300 for training an address text similarity calculation model according to one embodiment of the invention;

FIG. 4 illustrates a schematic diagram of an address text similarity calculation model 400 according to one embodiment of the invention;

FIG. 5 illustrates a flow diagram of an address search method 500 according to one embodiment of the invention;

FIG. 6 is a diagram illustrating a training apparatus 600 for an address text similarity calculation model according to an embodiment of the present invention;

fig. 7 shows a schematic diagram of an address search apparatus 700 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:

address text: such as "Ali Baba No. 969 Hangzhou West Lu", "Pengshan area Peng Xi Shanxi, si Chun Jiang Dadao No. 1 Jingjiang university, etc. The address text includes a plurality of address elements arranged in a high-to-low order.

Address element: elements constituting each granularity of the address text, such as "hangzhou wen west road No. 969 a," hangzhou "represents a city," Wen Yi west road "represents a road," 969 a "represents a road number, and aribaba represents a Point of Interest (POI).

Address level: the area corresponding to the address element in the address has a size-containing relationship, i.e. the level element has a corresponding address level, for example: province > city > district > street/community > way > building.

The address similarity is as follows: the similarity between two sections of address texts is between 0 and 1, the larger the value is, the higher the possibility that two addresses are in the same place is, when the value is 1, the two sections of texts represent the same address, and when the value is 0, the two sections of addresses have no relation.

Partial order relationship: regions in an address have a hierarchical relationship of size inclusion, such as: province, city, district, street, community, road and building.

Since the address text naturally contains a hierarchical relationship, i.e., the partial order relationship described above, address elements of different levels play different roles in address similarity calculation. The embodiment of the invention automatically generates the weights of the address elements with different levels by utilizing the hierarchical relationship in the address text, and the weights are implicitly embodied in the network parameters of the address text similarity calculation model, thereby accurately calculating the similarity degree of the two address texts.

FIG. 1 shows a schematic diagram of an address search system 100 according to one embodiment of the invention. As shown in fig. 1, the address search system 100 includes a user terminal 110 and a computing device 200.

The user terminal 110 is a terminal device used by a user, and may specifically be a personal computer such as a desktop computer and a notebook computer, or may also be a mobile phone, a tablet computer, a multimedia device, an intelligent wearable device, and the like, but is not limited thereto. Computing device 200 is used to provide services to user terminal 110, and may be implemented as a server, such as an application server, a Web server, or the like; but may also be implemented as a desktop computer, a notebook computer, a processor chip, a mobile phone, a tablet computer, etc., but is not limited thereto.

In an embodiment of the present invention, the computing device 200 may be used to provide address search services to the user, for example, the computing device 200 may serve as a server of an electronic map application, but it should be understood by those skilled in the art that the computing device 200 may be any device capable of providing address search services to the user, and is not limited to only a server of an electronic map application.

In one embodiment, the address search system 100 also includes a data storage 120. The data storage 120 may be a relational database such as MySQL, ACCESS, etc., or a non-relational database such as NoSQL, etc.; the data storage device 120 may be a local database residing in the computing device 200, or may be disposed at a plurality of geographic locations as a distributed database, such as HBase, in short, the data storage device 120 is used for storing data, and the present invention is not limited to the specific deployment and configuration of the data storage device 120. The computing device 200 may connect with the data storage 120 and retrieve data stored in the data storage 120. For example, the computing device 200 may directly read the data in the data storage 120 (when the data storage 120 is a local database of the computing device 200), or may access the internet in a wired or wireless manner and obtain the data in the data storage 120 through a data interface.

In the embodiment of the present invention, the data storage device 120 stores therein a standard address library, and the address text in the standard address library is standard address text (complete and accurate address text). In the address search service, a user inputs a query address text (query) through the user terminal 110, and generally, the user inputs a incomplete and inaccurate address text. The user terminal 110 sends the query to the computing device 200, and the address search device in the computing device 200 recalls a batch of candidate address texts, usually several to several thousand address texts, by searching the standard address library. And then the address searching device calculates the correlation degree between the candidate address texts and the query, wherein the address similarity is important reference information of the correlation degree, and after the address similarity between the query and all the candidate address texts is respectively calculated, the candidate address text with the maximum similarity is determined as a target address text corresponding to the address text to be queried, and the target address text is returned to the user.

Specifically, the address search means may calculate the similarity between the address text to be queried and the candidate address text using the address text similarity calculation model. Correspondingly, the computing device 200 may further include a training device for the address text similarity calculation model, and the data storage device 120 further stores a training address library, which may be the same as or different from the standard address library, where the training address library includes a plurality of address texts, and the training device trains the address text similarity calculation model by using the address texts in the training address library.

FIG. 2 shows a block diagram of a computing device 200, according to one embodiment of the invention. As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communicating between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 220, one or more applications 222, and program data 224. The application 222 is actually a plurality of program instructions that direct the processor 204 to perform corresponding operations. In some embodiments, application 222 may be arranged to cause processor 204 to operate with program data 224 on an operating system.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In the computing device 200 according to the invention, the application 222 comprises training means 600 and address search means 700 of an address text similarity calculation model. The apparatus 600 includes a plurality of program instructions that may direct the processor 104 to perform the method 300 of training an address text similarity calculation model. The apparatus 700 includes a plurality of program instructions that may direct the processor 104 to perform the address search method 600.

FIG. 3 shows a flow diagram of a method 300 for training an address text similarity calculation model according to one embodiment of the invention. The method 300 is suitable for execution in a computing device, such as the computing device 200 described above. As shown in fig. 3, the method 300 begins at step S310. In step S310, a training data set is obtained, where the training data set includes a plurality of pieces of training data, and each piece of training data includes 3 address texts, which are a first address text, a second address text, and a third address text, respectively. Each address text comprises a plurality of address elements with the levels ranging from high to low, and the first n levels of the address elements of the first address text and the second address text are the same; the first (n-1) levels of address elements of the first address text and the third address text are the same, and the nth level of address elements are different. Here, the value range of N is (1,N), N is the number of address levels included in the address text, for example, the address text includes 5 address levels, which are: and the province, city, district, road and road number, the value of N is 5. Of course, n may also adopt other value ranges according to specific application scenarios.

In the embodiment of the present invention, each piece of training data is a triplet { target _ addr, pos _ addr, neg _ addr } formed by 3 address texts, where target _ addr corresponds to the first address text, pos _ addr corresponds to the second address text, and neg _ addr corresponds to the third address text. { target _ addr, pos _ addr } constitutes a pair of positive sample pairs, and { target _ addr, neg _ addr } constitutes a pair of negative sample pairs.

In one embodiment, the training data set is obtained as follows:

firstly, an original address text is obtained from a training address library (or a standard address library), the original address text is analyzed, and character strings of the address text are segmented and formatted into address elements. For example, the address text "none, no. 1, no. 7, layers 910 of the aribab west park No. 969 of the yubby west of hangzhou city in zhejiang province" prov (province) = city in zhejiang province) = districts of hangzhou city (district) = row ad (road) = Wen Yi west way roadno (road number) =969 poi = aribby west park house No. (floor number) = 7-layer roomno (room number) =910 "may be cut into the address text. Specifically, the above analysis may be completed by combining a word segmentation model and a named entity model, and the embodiment of the present invention does not limit the specific word segmentation model and named entity model, and those skilled in the art may reasonably select the word segmentation model and named entity model according to needs.

Then, the address texts formatted as address elements are aggregated (deduplicated and sorted) according to the address elements of different levels, so as to form the following table:

and finally, combining the aggregated data in the table into positive and negative sample pairs of training data according to different address levels, wherein the output format is as follows: { target _ addr, pos _ addr, neg _ addr }. As previously described, { target _ addr, pos _ addr } constitutes a pair of positive sample pairs, and { target _ addr, neg _ addr } constitutes a pair of negative sample pairs. It should be noted that a pair of positive sample pairs may correspond to multiple pairs of negative sample pairs, that is, one target _ addr corresponds to one pos _ addr, and the target _ addr may correspond to multiple neg _ addr.

The specific operation is as follows:

(1) Selecting an address text, for example: prov = city in Zhejiang province = district in Hangzhou city = row in Yunzhou region = Wen Yi west way roadno =969 poi = Ali bab xi park;

(2) Traversing all address levels, e.g. province- > city- > district- > road, finding the same and different address elements respectively at each address level as the current address element, constituting positive and negative sample pairs respectively with the current address element, e.g.:

at the province level, a good example of an aribab Xixi park, no. 969 Wenyu district, hangzhou city, zhejiang: 37150, yuan Ye Yijia Garden No. 245 in State area, zhejiang Ningbo city, zhejiang province, 1; the negative example is: shanghai hong Qiao No. 2550 International airport of Shanghai hong Qiao in Changning district of Shanghai city.

At the city level, a good example of an aribab Xixi park No. 969 Wenyu district in Yunhong city, hangzhou, zhejiang: wen-West road 1008 Zhejiang socialist college from the Yunjun of Hangzhou city, zhejiang; the negative example is: 37150of Ningbo city in Zhejiang province, and 525 # garden road in State region for household.

At district level, a good example of an aribab Xixi park, no. 969, wen Yixi district, hangzhou, zhejiang: high education road No. 248 Saiyin International Square in Hangzhou city, hangzhou, zhejiang; the negative example is: south mountain school area of China institute of Art 218 south mountain way of urban area of Hangzhou, zhejiang province.

After the training data set is acquired, the method 300 proceeds to step S320. Before describing the processing procedure of step S320, the structure of the address text similarity calculation model according to the embodiment of the present invention will be described.

Referring to fig. 4, an address text similarity calculation model 400 according to an embodiment of the present invention includes: a word embedding layer 410, a text encoding layer 420, and a similarity calculation layer 430. The word embedding layer 410 is adapted to convert each address element in the address text into a word vector, and combine each word vector into a set of word vectors corresponding to the address text; the text encoding layer 420 is adapted to encode a set of word vectors corresponding to the address text as text vectors; the similarity calculation layer 430 is adapted to calculate a similarity between two text vectors, and characterize the similarity between address texts by using the similarity between the text vectors.

In step S320, the first address text, the second address text, and the third address text in each piece of training data are respectively input to the word embedding layer for processing, so as to obtain a first word vector set corresponding to the first address text, a second word vector set corresponding to the second address text, and a third word vector set corresponding to the third address text.

The Word embedding layer (embedding layer) can convert each Word in a sentence into a digital vector (Word vector), and the weight of the embedding layer can be obtained by pre-calculating the text co-occurrence information of a massive corpus, for example, by adopting a Glove algorithm, or by adopting a CBOW and skip-gram algorithm in Word2 Vec. These algorithms are based on the fact that: in the case that different text representations of the same latent semantic can repeatedly appear in the same context, the word-to-context prediction is performed by using the context and the relation between the words, or the words are predicted through the context, so that the latent semantic of each word is obtained. In the embodiment of the invention, the parameters of the word embedding layer can be obtained by utilizing a corpus to carry out independent training; the word embedding layer and the text coding layer can also be trained together, so that the parameters of the word embedding layer and the parameters of the text coding layer are obtained simultaneously. The following description takes the example of training the word embedding layer and the text encoding layer together.

Specifically, the address text comprises a plurality of formatted address elements, after the address text is input into the word embedding layer, the word embedding layer converts each address element in the address text into a word vector as one word, so that a plurality of word vectors are obtained, and then the word vectors are combined into a word vector set.

In one implementation, the word vector set is represented as a list, i.e., a word vector list, each list item in the word vector list corresponds to a word vector, and the number of items in the list is the number of address elements in the address text. In another implementation, the word vector set is represented as a matrix, that is, a word vector matrix, each column of the matrix corresponds to a word vector, and the number of columns of the matrix is the number of address elements in the address text.

After obtaining the set of word vectors, the method 300 proceeds to step S330. In step S330, the first word vector set, the second word vector set, and the third word vector set are respectively input to the text encoding layer for processing, so that the first word vector set is encoded as a first text vector, the second word vector set is encoded as a second text vector, and the third word vector set is encoded as a third text vector.

The text coding layer is implemented by using a Deep Neural Network (DNN) model, for example, a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, or a Deep Belief Network (DBN) model. The embedding output of the address sentence text with indefinite length is encoded into a sentence vector with fixed length through DNN, and at the moment, target _ addr, pos _ addr and neg _ addr are respectively converted into vector _ A, vector _ B and vector _ C. vector _ a is the first text vector, vector _ B is the second text vector, and vector _ C is the third text vector.

Taking RNN as an example, a word vector sequence corresponding to the address text may be regarded as a time sequence, word vectors in the word vector sequence are sequentially input into RNN, and a finally output vector is a text vector (sentence vector) corresponding to the address text.

Taking CNN as an example, inputting a word vector matrix corresponding to the address text into CNN, processing by a plurality of convolution layers and pooling layers, and finally converting the two-dimensional feature map into a one-dimensional feature vector through a full-connection layer, wherein the feature vector is a text vector corresponding to the address text.

After the text vector is obtained, the method 300 proceeds to step S340. In step S340, a first similarity between the first text vector and the second text vector and a second similarity between the first text vector and the third text vector are calculated by using the similarity calculation layer. In this way, the first similarity may represent a similarity between the first address text and the second address text, and the second similarity may represent a similarity between the first address text and the third address text.

Various similarity distance calculation methods can be selected, for example: euclidean distance, cosine similarity, jaccard coefficient, etc. In this embodiment, the similarity between vector _ a and vector _ B is referred to as SIM _ AB, and the similarity between vector _ a and vector _ C is referred to as SIM _ AC.

Finally, in step S350, network parameters of the word embedding layer and the text encoding layer are adjusted according to the first similarity and the second similarity. The method specifically comprises the following steps: calculating a loss function value according to the first similarity and the second similarity; and adjusting network parameters of the word embedding layer and the text coding layer by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.

The loss function is a triplet loss function, and the distance between the positive sample pairs can be shortened and the distance between the negative sample pairs can be shortened by using the triplet loss function. The loss function may be specifically expressed as: loss = Margin- (SIM _ AB-SIM _ AC). The target min (loss) of the network is optimized by using a back propagation algorithm, so that the network actively learns the parameters to enable the target _ addr to be closer to the pos _ addr in the semantic space and to be far away from the neg _ addr.

The Margin is a hyper-parameter which indicates that a training target needs to ensure that a certain distance is kept between the SIM _ AB and the SIM _ AC so as to increase the discrimination of the model, and the value of the Margin can be repeatedly adjusted according to the data condition and the actual task until the effect is optimal.

After the training process is completed, a similarity calculation model for calculating the similarity between the two sections of address texts is finally obtained. Based on the similarity calculation model, the embodiment of the invention also provides an address text similarity determination method, which comprises the following steps:

1) Acquiring an address text pair with similarity to be determined;

2) And inputting the address text pair into a trained address text similarity calculation model to output the similarity of the two address texts included in the address text pair.

In addition, the similarity calculation model can be applied to various scenes in which the similarity of the address text needs to be calculated, and can be applied to address standardization in the fields of public security, express delivery, logistics, electronic maps and the like. In these scenarios, the address search service can be provided for the user by using the address text similarity calculation model of the embodiment of the present invention.

FIG. 5 shows a flow diagram of an address search method 500 according to one embodiment of the invention. Referring to FIG. 5, the method 500 includes steps S510 to S530.

In step S510, one or more candidate address texts corresponding to the address texts to be queried are obtained. In the address search service, a user inputs a query address text (query) through a user terminal, and generally, the input of the user is a incomplete and inaccurate address text. The user terminal sends the query to the computing device, and an address searching device in the computing device recalls a batch of candidate address texts after searching the standard address library, wherein the number of the candidate address texts is usually several to thousands.

In step S520, the address text to be queried and the candidate address text are input to a preset address text similarity calculation model to obtain the similarity between the two, wherein the address text similarity calculation model is obtained by training according to the method 300. In this step, the similarity between the address text to be queried and each candidate address file is calculated respectively.

After the similarity between the address text to be queried and all candidate address texts is obtained, the method 500 proceeds to step S530. In step S530, the candidate address text with the maximum similarity is determined as the target address text corresponding to the address text to be queried, and the target address text is returned to the user.

Fig. 6 is a schematic diagram of a training apparatus 600 for an address text similarity calculation model according to an embodiment of the present invention. The address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the training apparatus 600 includes:

the obtaining module 610 is adapted to obtain a training data set, where the training data set includes a plurality of pieces of training data, and each piece of training data includes first, second, and third address texts, where address elements of first n levels of the first and third address texts are the same, address elements of first (n-1) levels of the first and third address texts are the same, and address elements of nth level are different. The obtaining module 610 is specifically configured to execute the method of step S310, and for processing logic and functions of the obtaining module 610, reference may be made to the related description of step S310, which is not described herein again.

The word vector obtaining module 620 is adapted to input the first, second, and third address texts of each piece of training data into the word embedding layer to obtain corresponding first, second, and third word vector sets. The word vector obtaining module 620 is specifically configured to execute the method in step S320, and for processing logic and functions of the word vector obtaining module 620, reference may be made to the related description in step S320, which is not described herein again.

The text vector obtaining module 630 is adapted to input the first, second, and third word vector sets into the text encoding layer to obtain corresponding first, second, and third text vectors. The text vector obtaining module 630 is specifically configured to execute the method in step S330, and for processing logic and functions of the word vector obtaining module 630, reference may be made to the related description in step S330, which is not described herein again.

The second similarity calculation module 640 is adapted to calculate the first similarities of the first and second text vectors and the second similarities of the first and third text vectors by using the similarity calculation layer. The second similarity calculation module 640 is specifically configured to execute the method in step S340, and for processing logic and functions of the second similarity calculation module 640, reference may be made to the related description in step S340, which is not described herein again.

And the parameter adjusting module 650 is adapted to adjust the network parameters of the word embedding layer and the text encoding layer according to the first similarity and the second similarity. The reference module 650 is specifically configured to execute the method of step S350, and for the processing logic and functions of the second similarity calculation module 650, reference may be made to the related description of step S350, which is not repeated herein.

Fig. 7 shows a schematic diagram of an address search apparatus 700 according to an embodiment of the present invention. Referring to fig. 7, the address search apparatus 700 includes:

the query module 710 is adapted to obtain one or more candidate address texts corresponding to the address texts to be queried;

the first similarity calculation module 720 is adapted to input the address text to be queried and the candidate address text into a preset address text similarity calculation model to obtain the similarity between the address text and the candidate address text, wherein the address text similarity calculation model is obtained by training through the training device 600;

the output module 730 is adapted to determine the candidate address text with the maximum similarity as the target address text corresponding to the address text to be queried.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the multilingual spam-text recognition method of the present invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims

1. An address text similarity determination method, the address text including a plurality of address elements arranged in order of high to low levels, the method comprising:

acquiring an address text pair with similarity to be determined;

the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, wherein the first n levels of address elements of the first address text and the second address text are the same to form a positive sample pair, the first (n-1) levels of address elements of the first address text and the third address text are the same to form a negative sample pair, the address text similarity calculation model comprises a word embedding layer, a text coding layer and a similarity calculation layer, and the step of training the address text similarity calculation model comprises the following steps:

inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets;

inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors;

calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer;

and adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity.

2. The method of claim 1, wherein the network parameters comprise: parameters of a word embedding layer and/or parameters of a text encoding layer.

3. The method of claim 1, wherein each word vector set in the first, second, and third word vector sets comprises a plurality of word vectors, each word vector corresponding to an address element in the address text.

4. The method of claim 1, wherein the Word embedding layer employs a Glove model or a Word2Vec model.

5. The method of claim 1, wherein the first and second similarities comprise at least one of euclidean distance, cosine similarity, or Jaccard coefficients.

6. The method of claim 1, wherein said adjusting network parameters of said address text similarity calculation model according to said first and second similarity comprises:

calculating a loss function value according to the first similarity and the second similarity;

and adjusting the network parameters of the address text similarity calculation model by using a back propagation algorithm until the loss function value is lower than a preset value or the training times reach a preset number.

7. The method of claim 6, wherein the loss function value is:

loss = Margin- (first similarity-second similarity)

Wherein, loss is the Loss function value, and Margin is the over parameter.

8. The method of claim 1, wherein the text encoding layer comprises at least one of an RNN model, a CNN model, or a DBN model.

9. An address search method, comprising:

acquiring one or more candidate address texts corresponding to the address text to be inquired;

inputting an address text to be queried and a candidate address text into a preset address text similarity calculation model to obtain the similarity of the address text to be queried and the candidate address text, wherein the address text similarity calculation model is obtained by training based on a training data set comprising a plurality of pieces of training data, each piece of training data at least comprises a first address text, a second address text and a third address text, the first n levels of the first address text and the second address text are identical in address elements to form a positive sample pair, the first (n-1) levels of the first address text and the third address text are identical in address elements and different in the nth level of the first address text and the third address text to form a negative sample pair, the address text similarity calculation model comprises a word embedding layer, a text coding layer and a similarity calculation layer, and the step of training the address text similarity calculation model comprises the following steps: inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets; inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors; calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer; adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity;

10. An address search apparatus comprising:

a first similarity calculation module, adapted to input an address text to be queried and a candidate address text into a preset address text similarity calculation model to obtain similarities of the two, where the address text similarity calculation model is obtained by training a training data set including a plurality of pieces of training data, each piece of training data at least includes a first, a second, and a third address texts, where first n levels of address elements of the first and second address texts are the same to form a positive sample pair, first (n-1) levels of address elements of the first and third address texts are the same and nth level of address elements are different to form a negative sample pair, and the address text similarity calculation model includes a word embedding layer, a text encoding layer, and a similarity calculation layer, and the step of training the address text similarity calculation model includes: inputting the first, second and third address texts of each piece of training data into a word embedding layer to obtain corresponding first, second and third word vector sets; inputting the first, second and third word vector sets into a text coding layer to obtain corresponding first, second and third text vectors; calculating a first similarity of the first text vector and the second text vector and a second similarity of the first text vector and the third text vector by using a similarity calculation layer; adjusting the network parameters of the address text similarity calculation model according to the first similarity and the second similarity;

11. An apparatus for training an address text similarity calculation model, the address text including a plurality of address elements arranged from high to low in order, the address text similarity calculation model including a word embedding layer, a text encoding layer, and a similarity calculation layer, the apparatus comprising:

the second similarity calculation module is suitable for calculating the first similarity of the first text vector and the second similarity of the first text vector and the third text vector by utilizing the similarity calculation layer;

12. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-9.