CN111488334B

CN111488334B - Data processing method and electronic equipment

Info

Publication number: CN111488334B
Application number: CN201910087847.0A
Authority: CN
Inventors: 郑华飞; 刘楚; 谢朋峻; 李林琳; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2023-04-14
Anticipated expiration: 2039-01-29
Also published as: CN111488334A

Abstract

The embodiment of the application provides a data processing method and electronic equipment. The data processing method comprises the following steps: acquiring a first address and a second address; vectorizing the first address and the second address respectively to obtain a first vector and a second vector; performing a redundancy operation on the first address if it is determined, based on the first vector and the second vector, that address redundancy exists for the first address and the second address. According to the technical scheme provided by the embodiment of the application, the address is expressed as the vector, so that the semantic similarity between words or between words in the address can be well acquired, the condition whether address redundancy exists between the two addresses or not can be well determined, and the accuracy is high.

Description

Data processing method and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and an electronic device.

Background

Due to a plurality of factors such as the entrance of a new building tray, a newly-built shop and a newly-built market, new addresses are continuously generated, and the cost of manual collection is huge. Therefore, other address sources are required to be relied on for standard address expansion and updating; for example, an address (a delivery address input by a user of the e-commerce platform, an address on an electronic map, etc.) is automatically fetched from the internet side. Because some addresses in the external source address are the same as the existing addresses in the address base, a redundant address elimination technology is needed to check and filter the redundant addresses of the external source address, and then the addresses without the redundancy problem are added into the address base.

The size of the address library is huge (in the order of tens of millions), and how to accurately eliminate redundant addresses is a technical problem to be solved at present.

Disclosure of Invention

Embodiments of the present application provide a data processing method and an electronic device, which can partially improve or solve the above problems.

In one embodiment of the present application, a data processing method is provided. The data processing method comprises the following steps:

acquiring a first address and a second address;

vectorizing the first address and the second address respectively to obtain a first vector and a second vector;

performing a redundancy operation on the first address if it is determined, based on the first vector and the second vector, that there is address redundancy for the first address and the second address.

In another embodiment of the present application, a data processing method is also provided. The data processing method comprises the following steps:

acquiring at least one candidate comparison address from an address library for a first address to be put into a library;

vectorizing the first address and the at least one candidate comparison address respectively to obtain a first vector of the first address and a second vector of each candidate comparison address;

and adding the first address to the address library under the condition that the first address and all candidate comparison addresses have no address redundancy on the basis of the first vector and second vectors of the candidate comparison addresses.

In yet another embodiment of the present application, an electronic device is also provided. Wherein the electronic device comprises a memory and a processor; wherein the content of the first and second substances,

the memory is used for storing programs;

the processor, coupled with the memory, to execute the program stored in the memory to:

acquiring a first address and a second address;

performing a redundancy operation on the first address if it is determined, based on the first vector and the second vector, that address redundancy exists for the first address and the second address.

According to the technical scheme provided by the embodiment of the application, the semantic similarity between words or between characters in the address can be well obtained by expressing the address as the vector, so that whether address redundancy exists between the two addresses or not can be well determined, and the accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network structure of a redundant discriminant model method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a data processing method according to another embodiment of the present application;

FIG. 5 is a block diagram of a data processing system according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a data processing method according to another embodiment of the present application;

FIG. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application;

FIG. 10 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

One prior art technique is a similar address query based on string matching. With the character string as an object, a character string matching score is calculated, such as a longest common substring, an edit distance, a KMP algorithm (character string matching algorithm), and the like. The method can realize full-text matching or approximate matching; but the weights of different address elements cannot be measured, for example, the character difference on province, city and district class levels and the character difference on building number class levels are treated equally; semantic similarity matching cannot be achieved. For example, "the institute for street ddway 555 kk" in c zone of city b of AA province "and" the institute for street ddway k' in c zone of city b of AA province "; suppose, the k' college is abbreviated as kk college; by adopting the technology, the two addresses are judged to be two different addresses when the editing distance is very far. In practice, the two addresses are in the same place, which is the case for address redundancy.

Another prior art technique is to perform retrieval recalls based on structured information. And carrying out structured analysis on the address, analyzing various address fields such as province and city district (county) streets and the like in the address, and carrying out character string matching at the field level. The method may specify weights for different field matches. The method is very dependent on the accuracy of upstream structured information, namely the accuracy of NER (Named Entity Recognition), and is easy to recognize with coarser granularity such as province, city, district and the like, but the NER difficulty of fine-grained address elements such as POI (Point of Interest), building and unit number and the like is higher, and the NER error can cause serious influence on the downstream retrieval recall task based on the structuring. In addition, the method requires manual setting of the weight of each field level address element, requires expert experience, and has manual deviation. Also, the method does not consider semantic similarity matching.

The present application provides the following embodiments to solve or ameliorate the above problems of the prior art.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, and these operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different. In addition, the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Embodiments provided herein relate to application models. The following description will first describe the training process of the application model to facilitate the description of the following embodiments and to facilitate the understanding of the solution.

Fig. 1 shows a schematic flow chart of a model training method according to an embodiment of the present application. As shown in fig. 1, the method includes:

101. a first sample is obtained, wherein the first sample comprises an address pair and a sample tag.

102. And respectively carrying out vectorization representation on the two addresses in the address pair to obtain a fourth vector and a fifth vector.

103. And training a learning model based on the fourth vector and the fifth vector to obtain an output result.

104. And when the output result and the sample label determine that the training convergence condition is not met, updating parameters in the learning model, and acquiring a second sample from the training sample set to continue training the learning model.

In 101 above, the first sample may be one sample in a training sample set. In specific implementation, the training sample set may be established based on obtaining an address from the network side. Such as shipping addresses collected from the e-commerce platform, addresses for electronic maps, and the like. After these addresses are collected, they may be washed first. The cleaning process may include, but is not limited to: and deleting addresses with incomplete information (namely, the addresses need to contain coarse-grained address elements and fine-grained address elements). Wherein province, city, district, etc. belong to coarse-grained address elements; the class of building numbers, room numbers, etc. belongs to fine-grained address elements. In practical applications, the rule that address information is not complete may be defined by itself, which is not specifically limited in this embodiment.

Then, a large number of collected addresses are mined to construct a sample, thereby forming the training sample set. The training sample set includes, but is not limited to, positive and negative samples; the ratio of positive and negative samples may be 1; or 1. The positive sample may simply be: a sample of addresses of two different locations; negative examples can be simply understood as: a sample consisting of two differently expressed addresses of the same site. The sample label is used for characterizing whether the sample is a positive sample or a negative sample; for example, sample tag 1 characterizes the sample as a negative sample; the sample label 0 characterizes the sample as a positive sample. The training sample set may be characterized in the form of table 1 below.

TABLE 1 training sample set

The more samples in the training sample set, the better the model training effect will be.

In 102, vectorized representation of both addresses in the address pair may be implemented based on a dictionary table. Wherein, the dictionary table can include but is not limited to at least one of the following: a word embedding vocabulary and a word embedding vocabulary. In particular implementations, the dictionary table can be generated using an embedded model. The embedded model is a learning method of distributed vector embedded representation, for example, a Word2Vec learning method. The Word2Vec learning method learns a continuous vector embedded representation of each Word by optimizing the likelihood of the occurrence of a Word in a given context. In this embodiment, the embedding vectors may be trained respectively at a Word level (Word) and a single Word level (char), and the Word embedding vector and the char embedding vector may be trained unsupervised by using a skip gram model in a Word2Vec tool, and the dimension may be 300 dimensions. The word embedding vector can accurately obtain semantic information of words, the char embedding vector can obtain more generalization information, and words which do not appear in the address sample set can be better described. That is, the method provided by this embodiment may further include: and a step of generating a dictionary table. Specifically, the step of generating the dictionary table may specifically include:

acquiring an address sample set;

and training by using a Skigram model to obtain the dictionary table based on the address sample set.

The address sample set includes a plurality of addresses as samples, and the addresses may be obtained from a network side or addresses in an address library. And taking each sample in the address sample set as the input of the Skigram model to obtain a word embedding vector of a plurality of words and/or a word embedding vector of a plurality of words. Obtaining a word embedding word list based on the word embedding vectors of a plurality of words; and obtaining a word embedding word list based on the word embedding vectors of the plurality of words.

For example, two word lists are generated through the above steps, which are respectively: word embedding vocabularies and char embedding vocabularies. The word embedding word list stores 300-dimensional embedding vectors corresponding to each word; the word embedding word table stores 300-dimensional embedding vectors corresponding to each word.

Based on the dictionary table, the two addresses in the address pair can be subjected to vector change expression, namely encoding (encoding). The encoding process will be described below by taking one of the two addresses as an example, and the other address is the same. Assuming that the address pair contains the fourth address; accordingly, the vectorizing of the fourth address may include:

1021. acquiring a dictionary table;

1022. representing the fourth address as the fourth vector based on the dictionary table.

The dictionary table 1021 may include one or more of a word embedded vocabulary table and a word embedded vocabulary table. The word embedding word list comprises word embedding vectors corresponding to a plurality of words; the word embedding word list comprises word embedding vectors corresponding to a plurality of words.

Scheme I,

The dictionary table comprises a word embedding word table; accordingly, "representing the fourth address as the fourth vector based on the dictionary table" includes:

s11, performing word segmentation on the fourth address to obtain at least one segmented word;

s12, obtaining word embedding vectors of all split words based on the word embedding word list;

and S13, embedding a vector according to the word of each segmented word to obtain the fourth vector.

Specifically, the specific implementation process of S13 can be characterized by the following formula:

wherein, the sensor _ emb _i An embedding vector (i.e., fourth vector) representation representing the ith address; j =1,2.. M; m is the number of the splitting words contained in the ith address, word _ emb _j A word embedding vector representing j disjunct words.

The calculation formula is used for obtaining a fourth vector by summing word embedding vectors of all split words contained in a fourth address; in practical application, the fourth vector can be obtained by averaging word embedding vectors of all the split words contained in the fourth address; this embodiment is not particularly limited thereto.

Scheme II,

The dictionary table comprises a word embedding word table; accordingly, the representing the fourth address as the fourth vector based on the dictionary table may include:

and S21, obtaining word embedding vectors of all words in the fourth address based on the word embedding word list.

And S22, embedding vectors according to words of all words in the fourth address to obtain the fourth vector.

Specifically, the specific implementation process of S22 can be characterized by the following formula:

wherein, the sense _ emb _i An embedding vector (i.e., fourth vector) representation representing the ith address; k =1,2.. K; k is the number of the words contained in the ith address, char _ emb _k A word-embedded vector representing k words.

The calculation formula obtains a fourth vector by summing word embedding vectors of all words contained in the fourth address; in practical application, the fourth vector can be obtained by averaging word embedding vectors of all words contained in the fourth address; this embodiment is not particularly limited to this.

Scheme III,

The dictionary table includes: a word embedding vocabulary and a word embedding vocabulary. Accordingly, the representing the fourth address as the fourth vector based on the dictionary table may include:

and S31, performing word segmentation on the fourth address to obtain at least one segmented word.

And S32, obtaining a word embedding vector of each segmented word based on the word embedding word list.

And S33, respectively obtaining the word embedding vectors of all the words in each split word based on the word embedding word list.

And S34, regenerating word embedded vectors for each split word according to the word embedded vectors of the split words and the word embedded vectors of all the characters in the split words.

And S35, embedding the vector according to the regenerated words for each split word to obtain the fourth vector.

Specifically, the specific implementation process of S34 and S35 can be characterized by the following formula:

wherein, the sensor _ emb _i An embedding vector (i.e., a fourth vector) representation of the ith address. The fourth address has M disjunct words, wherein the word embedding vector of the jth disjunct word is word _ emb _j (ii) a The split word comprises K characters, and the character embedding vector of the kth character is char _ emb _k . In short, the fourth address is segmented to obtain one or more segmented words. Each split word consists of a plurality of words, and a new word embedding vector of the word is generated by vector connection operation between the word embedding vector of the word and the word embedding vector of the word after the word embedding vector of the word is summed; and finally, summing the new word embedding vectors of all the words in the fourth address to obtain a fourth vector.

Similarly, the summation operation may be replaced by an averaging operation.

In the above 104, before training of the learning model, the parameters in the learning model are generally simple initialization values. After the sample content is input into the learning model, the output result output by the learning model and the sample label are different, and then the difference (or distance) between the sample label and the output result output by the learning model is a measure for determining whether the parameters in the learning model are appropriate, and the difference can be represented by a loss function. In other words, the loss function is used to represent the difference between the learning model and the ideal model under the current parameters, so as to make proper adjustment to the parameters of the training model.

If the difference between the learning model under the current parameters and the ideal model is calculated to be large through the loss function and the training convergence condition is not met, the parameters in the learning model can be updated, and the next iteration is carried out after the updating. Wherein, the process of updating the parameters in the learning model can be simply understood as the following process: and according to the output result and the sample label, obtaining each layer of gradient contained in the learning model through layer-by-layer recursive calculation, and then updating the parameters in the learning model according to each layer of gradient. Specifically, the process of updating the parameter can refer to relevant contents in the prior art, and is not described herein again.

In the technical scheme provided by the embodiment, two address pairs in a sample are vectorized and expressed; training the learning model based on vectors corresponding to the two addresses respectively, and influencing parameter updating in the learning model during one-time training iteration through a result output by the learning model and a sample label until the learning model finishes training to obtain a redundancy discrimination model for calculating the similarity of the two addresses when a training convergence condition is determined to be reached according to the result output by the learning model and the sample label; the redundancy judgment model provided by the embodiment is adopted to carry out address redundancy judgment, so that the accuracy is high; and is highly efficient.

Further, the model training method provided in this embodiment further includes the following steps:

105. and when the output result and the sample label determine that a training convergence condition is reached, the learning model completes training to obtain a redundancy discrimination model for calculating the similarity of the two addresses.

How to apply the redundancy judging model to the address redundancy judging scenario will be described in detail in the following embodiments.

Further, step 103 in the above embodiment, training the learning model based on the fourth vector and the fifth vector to obtain the output result may specifically be implemented by the following steps:

1031. performing a join operation on the fourth vector and the fifth vector to obtain a sixth vector;

1032. and taking the sixth vector as the input of the learning model, and executing the learning model to obtain the output result.

1031, the process of joining two vectors may be simply understood as a vector splicing process, for example, the head-to-tail splicing of two vectors. Assuming that the fourth vector is a 600-dimensional vector and the fifth vector is a 600-dimensional vector; and connecting the fourth vector and the fifth vector to obtain a sixth vector with 1200 dimensions.

The learning model in this embodiment can be selected from: a single-layer fully-connected Neural Network, a CNN (Convolutional Neural Network), an LSTM (Long Short-Term Memory), or a deeper Neural Network; this embodiment is not particularly limited to this.

The hypothesis learning model is realized by adopting a single-layer fully-connected neural network, the single-layer fully-connected neural network can comprise N (N can be 128, 512, 1024 and the like) intermediate layers of hidden neurons, and the final output layer is a sigmoid function.

The invention uses a single-layer fully-connected neural network as a redundancy discrimination model, the network input is the connection of sentence vector representation of two addresses, namely 1200-dimensional input, the input is passed through a middle layer containing N (128, 512, 1024 and the like can be selected) hidden neurons, and finally the output layer is a sigmoid function; the schematic diagram of the network structure is shown in fig. 2.

Fig. 3 is a schematic flowchart illustrating a data processing method according to an embodiment of the present application. As shown in fig. 3, the method includes:

201. a first address and a second address are obtained.

202. Vectorizing the first address and the second address respectively to obtain a first vector and a second vector.

203. And performing redundancy operation on the first address under the condition that the first address and the second address have address redundancy based on the first vector and the second vector.

In the above 201, the first address may be an address to be put in a library, or may be an existing address in an address library; similarly, the second address may be an address to be put in a library, or may be an existing address in an address library. When the first address and the second address are both addresses to be binned, the above process may be understood as: before warehousing, firstly, redundancy check is carried out on all addresses to be warehoused so as to reduce the calculation amount of redundancy check carried out on the addresses in the address base. When the first address is an address to be put into a library and the second address is an existing address in the address library, the above process can be understood as follows: a redundancy check process that determines whether to permit binning of the first address. When the first address and the second address are both existing addresses in the address library, the above process may be understood as: redundancy checks within the address bank.

If the first address is an address to be put in storage, the first address may be an address manually input by a user through a client, an address acquired from a network side, or the like.

Assuming that the second address is an existing address in the address library; the second address may be one of at least one candidate comparison address selected from the address library for the first address. That is, the method provided in the embodiment of the present application may further include the following steps:

performing structured resolution on the first address;

and recalling the at least one candidate comparison address from an address library according to the structured analysis result.

In specific implementation, the first address can be structurally analyzed by using the NER technology, and information of address elements with relatively coarse granularity, such as provinces, cities, districts, streets, roads and the like, is mainly used for recalling candidate comparison addresses in the following process, so that the redundant calculation of the whole database can be reduced, and the overall processing efficiency is improved.

Since the address library is structured and has address element fields of provinces, cities, counties, roads and the like, the address can be recalled at a coarse-grained level (accurate to the road) based on the structured analysis result of the previous step, namely, the address in the recalled library is ensured to have the same information of provinces, cities, districts and roads as the first address.

In 202, the process of vectorizing the first address and the second address is the same. The following describes a process of vectorizing the first address. That is, "vectorizing the first address to obtain a first vector" includes:

2021. acquiring a dictionary table;

2022. representing the first address as the first vector based on the dictionary table.

In 2021, the dictionary table may include one or more of a word embedding vocabulary and a word embedding vocabulary. The word embedding word list comprises word embedding vectors corresponding to a plurality of words; the word embedding word list contains word embedding vectors corresponding to a plurality of words.

The first address is represented as the first vector based on the dictionary table, which can be implemented by the following three schemes.

Scheme I,

The dictionary table comprises a word embedding word table; accordingly, "representing the first address as the first vector based on the dictionary table" includes:

s11, segmenting words of the first address to obtain at least one segmented word;

s12, obtaining word embedding vectors of all the split words based on the word embedding word list;

and S13, embedding a vector according to the word of each segmented word to obtain the first vector.

For specific implementation contents of the above S11 to S13, reference may be made to corresponding contents in the above embodiments, and details are not described here.

Scheme II,

The dictionary table comprises a word embedding word table; accordingly, the representing the first address as the first vector based on the dictionary table may include:

and S21, obtaining word embedding vectors of all words in the first address based on the word embedding word list.

And S22, embedding vectors according to words of all words in the first address to obtain the first vector.

For specific implementation of the foregoing S21 to S22, reference may be made to corresponding contents in the foregoing embodiments, and details are not described here.

Scheme III,

The dictionary table includes: a word-embedded vocabulary and a word-embedded vocabulary. Accordingly, the representing the first address as the first vector based on the dictionary table may include:

s31, segmenting the first address to obtain at least one segmented word.

And S35, embedding vectors according to the words regenerated for the split words to obtain the first vectors.

For specific implementation contents of the above S31 to S35, reference may be made to corresponding contents in the above embodiments, and details are not described here.

In 203, "performing a redundancy operation on the first address when it is determined that the first address and the second address have address redundancy based on the first vector and the second vector" may specifically be implemented by:

2031. performing a join operation on the first vector and the second vector to obtain a third vector;

2032. taking the third vector as an input of a redundant discriminant model, executing the redundant discriminant model to output a probability that the first address and the second address are redundant;

2033. when the probability is greater than a threshold value and the first address and the second address are both data in an address base, deleting the first address from the address base;

2033. and when the probability is greater than the threshold value, the first address is data to be stored in a warehouse, and the second address is data in an address library, rejecting the warehousing request of the first address.

The redundant discriminant model in this embodiment can be obtained by using the model training method provided in the above embodiments. In 2033, the first address rejected for storage may be discarded directly or stored as a sample address in the sample repository.

According to the technical scheme provided by the embodiment, the semantic similarity between words or between words in the address can be well obtained by representing the address as the vector, so that whether address redundancy exists between the two addresses or not can be well determined, and the accuracy is high.

Further, the method provided by this embodiment may further include:

204. when the probability is smaller than the threshold value, acquiring a third address from the address library so as to continuously calculate the probability of the redundancy of the first address and the third address;

205. and when the address base has no address with the probability of redundancy with the first address being larger than the threshold value and the first address is data to be put into a database, adding the first address into the address base.

Further, the present embodiment may further include the following steps:

206. acquiring an address sample set;

207. and training by using a Skigram model to obtain the dictionary table based on the address sample set.

For the specific implementation contents of the foregoing 206 to 207, reference may be made to the foregoing embodiments, and details are not described herein again.

Further, the present embodiment may further include the following steps:

208. obtaining a first sample, wherein the first sample comprises an address pair and a sample label;

209. vectorizing two addresses in the address pair respectively to obtain a fourth vector and a fifth vector;

210. training a learning model according to the fourth vector and the fifth vector to obtain an output result;

211. and when the output result and the sample label determine that the training convergence condition is reached, the learning model completes training to obtain the redundancy discrimination model.

Further, the present embodiment may further include the following steps:

212. updating parameters in the learning model when the output result and the sample label determine that the training convergence condition is not reached; and obtaining a second sample from the training sample set to continue training the learning model.

For the specific implementation of the above-mentioned components 208 to 212, reference may be made to the above-mentioned embodiments, and details are not described herein.

Fig. 4 shows a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 4, the method includes:

301. and acquiring at least one candidate comparison address from the address library as the first address to be put in the library.

302. And respectively performing vectorization representation on the first address and the at least one candidate comparison address to obtain a first vector of the first address and a second vector of each candidate comparison address.

303. Adding the first address to the address bank if it is determined that there is no address redundancy between the first address and all candidate comparison addresses based on the first vector and a second vector of each candidate comparison address.

In 301, the first address may be input by the user through the client, or automatically acquired from the network side. In specific implementation, the first address can be structurally analyzed by using the NER technology, and information of address elements with coarser granularity, such as urban streets, roads and the like, in the province and city district is mainly utilized for recalling the candidate redundant address in the address library, so that the redundant calculation of the whole library can be reduced, and the overall processing efficiency is improved. Since the address base is structured and has address element fields such as province, city, district, and road, addresses having the same information of province, city, district, and road as the first address can be recalled in the address base as candidate comparison addresses based on the structured analysis result of the first address.

In an implementation solution, in the foregoing 303, in a case that it is determined that there is no address redundancy between the first address and all candidate comparison addresses based on the first vector and a second vector of each candidate comparison address, adding the first address to the address library includes:

3031. the first vector is respectively connected with the second vector of each candidate comparison address to obtain at least one third vector;

3032. respectively taking the at least one third vector as the input of a redundancy judgment model, and executing the redundancy judgment model to output the redundancy probability of the first address and each candidate comparison address;

3033. and when the probability of the first address and the redundancy of each candidate comparison address is smaller than a threshold value, adding the first address to the address library.

The above process is described below with reference to a specific example. Assume that there are 3 candidate comparison addresses, which are the first candidate comparison address, the second candidate comparison address and the third candidate comparison address. Correspondingly, the second vector of the first candidate match address is (d) ₁₁ ，d ₁₂ ，……d _1n ) (ii) a The second vector of the second candidate match address is (d) ₂₁ ，d ₂₂ ，……d _2n ) (ii) a The second vector of the third candidate comparison address is (d) ₃₁ ，d ₃₂ ，……d _3n ). In addition, the first vector is (d) ₀₁ ，d ₀₂ ，……d _0n )。

The above step 302 can be simply understood as the following process: first vector (d) ₀₁ ，d ₀₂ ，……d _0n ) The second vector of the address to be compared with the first candidate is (d) ₁₁ ，d ₁₂ ，……d _1n ) Performing a join operation to obtain a first third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₁₁ ，d ₁₂ ，……d _1n )；

First vector (d) ₀₁ ，d ₀₂ ，……d _0n ) The second vector of the address to be compared with the second candidate is (d) ₂₁ ，d ₂₂ ，……d _2n ) Performing a join operation to obtain a second third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₂₁ ，d ₂₂ ，……d _2n )；

First vector (d) ₀₁ ，d ₀₂ ，……d _0n ) The second vector of the address to be compared with the third candidate is (d) ₃₁ ，d ₃₂ ，……d _3n ) Performing a join operation to obtain a third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₃₁ ，d ₃₂ ，……d _3n )。

The above step 303; the first third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₁₁ ，d ₁₂ ，……d _1n ) Executing the redundancy discriminant model to output the probability f1 of the redundancy of the first address and the first candidate comparison address as the input of the redundancy discriminant model;

the second third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₂₁ ，d ₂₂ ，……d _2n ) Executing the redundancy judgment model to output the redundancy probability f2 of the first address and the second candidate comparison address as the input of the redundancy judgment model;

a third vector (d) ₀₁ ，d ₀₂ ，……d _0n ，d ₃₁ ，d ₃₂ ，……d _3n ) Executing the redundancy discriminant model to output the probability f3 of the redundancy of the first address and the third candidate comparison address as the input of the redundancy discriminant model;

f1< D, f2< D, and f3< D, which indicate that there is no address redundancy problem between the first address and each candidate comparison address, the first address may be added to the address bank.

f1> D, f2> D or f3> D, which indicates that at least one candidate comparison address in the address base is the same as the first address, so that the problem of address redundancy exists, and the first address cannot be added into the address base. The first address may be discarded or may be stored as a training sample in a sample repository.

According to the technical scheme provided by the embodiment, the address is expressed as the vector, so that the semantic similarity between words or between words in the address can be well acquired, the condition whether address redundancy exists between the two addresses or not can be well determined, and the accuracy is high.

The technical solution provided in this embodiment can also be implemented by using the system architecture shown in fig. 5. Specifically, as shown in fig. 5, the data processing system includes:

the client 401 is configured to send a redundant address check request to the server in response to a redundant address check event triggered by a user;

the server 402 is configured to obtain a first address and a detection target set when receiving a redundant address check request sent by a client, where the detection target set includes at least one candidate comparison address; vectorizing the first address and the at least one candidate comparison address respectively to obtain a first vector of the first address and a second vector of each candidate comparison address; and according to the first vector and the second vector of each candidate comparison address, performing warehousing operation or redundancy operation on the first address and feeding back an operation result to the client.

The client may be a desktop computer, a notebook computer, a smart phone, a tablet computer, an intelligent wearable device, and the like, which is not specifically limited in this embodiment. The server can be a server, a cloud and the like. For convenience of understanding of the solution, the following describes the technical solution of the present application with respect to a server in a data processing system as an execution subject; namely, the server can also implement the methods in the corresponding embodiments described below.

Fig. 6 shows a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 6, the method includes:

501. under the condition of receiving a redundant address check request sent by a client, acquiring a first address and a check target set, wherein the check target set comprises at least one candidate comparison address.

502. And respectively performing vectorization representation on the first address and the at least one candidate comparison address to obtain a first vector of the first address and a second vector of each candidate comparison address.

503. And executing warehousing operation or redundancy operation on the first address according to the first vector and a second vector of each candidate comparison address and feeding back an operation result to the client.

In 501, the redundant address check request may be generated after a user inputs a first address on an interactive page provided by a client and triggers a storage event for the first address; or the user triggers the data maintenance on the address library through the client, and the like, which is not specifically limited in this embodiment.

The checking target set comprises candidate comparison addresses which can be all addresses in an address library; or may be a partial address. For example, by performing structure analysis on the first address, the first address is recalled from the address library according to the structure analysis result.

For the content of the above 502, reference may be made to the above embodiments, and details are not repeated herein.

In an implementation solution, the step 503 "performing a binning operation or a redundancy operation on the first address according to the first vector and the second vector of each candidate comparison address, and feeding back an operation result to the client" may include:

5031. and the first vector is respectively connected with the second vector of each candidate comparison address to obtain at least one third vector.

5032. And taking the at least one third vector as the input of a redundancy judgment model respectively, and executing the redundancy judgment model to output the redundancy probability of the first address and each candidate comparison address respectively.

5033. And when the first address and the probability of each candidate comparison address redundancy are smaller than a threshold value, adding the first address to the address library and feeding back a response of the stored address to the client.

5034. And when the probability of the first address and one candidate comparison address redundancy is greater than the threshold value, deleting or rejecting the first address from an address library to be put in storage, and feeding back a response of deleting or rejecting the first address to the client.

For the specific implementation processes of the 5031-5034, reference may be made to the corresponding contents in the above embodiments, and details are not described here.

Fig. 7 shows a schematic structural diagram of a model training apparatus according to an embodiment of the present application. As shown in fig. 7, the model training device includes: an acquisition module 11, a vectorization module 12 and a training module 13. The obtaining module 11 is configured to obtain a first sample, where the first sample includes an address pair and a sample tag; the vectorization module 12 is configured to separately vectorize two addresses in the address pair to obtain a fourth vector and a fifth vector; the training module 13 is configured to train a learning model based on the fourth vector and the fifth vector to obtain an output result; and when the condition that the training convergence condition is not reached is determined according to the output result and the sample label, updating parameters in the learning model, and obtaining a second sample from the training sample set so as to continue training the learning model.

In the technical solution provided by this embodiment, two address pairs in a sample are vectorized; training the learning model based on vectors corresponding to the two addresses respectively, and influencing parameter updating in the learning model during one-time training iteration through a result output by the learning model and a sample label until the learning model finishes training to obtain a redundancy discrimination model for calculating similarity of the two addresses when the condition that a training convergence condition is reached is determined according to the result output by the learning model and the sample label; the redundancy judgment model provided by the embodiment is adopted to carry out address redundancy judgment, so that the accuracy is high; and is highly efficient.

Further, the training module 13 is further configured to:

and when the output result and the sample label determine that a training convergence condition is reached, the learning model completes training to obtain a redundancy discrimination model for calculating the probability of two address redundancies.

Further, the training module 13 is further configured to:

performing a binding operation on the fourth vector and the fifth vector to obtain a sixth vector;

and taking the sixth vector as the input of the learning model, and executing the learning model to obtain the output result.

Here, it should be noted that: the model training device provided in the above embodiment may implement the technical solutions described in the above embodiment of the model training method, and the specific implementation principles of the modules or units may refer to the corresponding contents in the above embodiment of the method, which are not described herein again.

Fig. 8 shows a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 8, the data processing apparatus includes: a first obtaining module 21, a vectorization module 22 and an execution module 23. The first obtaining module 21 is configured to obtain a first address and a second address; the vectorization module 22 is configured to perform vectorization representation on the first address and the second address respectively to obtain a first vector and a second vector; the execution module 23 is configured to perform a redundancy operation on the first address if it is determined that there is address redundancy between the first address and the second address based on the first vector and the second vector.

Further, the vectorization module 22 is further configured to: acquiring a dictionary table; representing the first address as the first vector based on the dictionary table.

Further, the dictionary table comprises a word embedding word table, and the word embedding word table contains word embedding vectors corresponding to a plurality of words; and the vectorization module 22 is further configured to:

performing word segmentation on the first address to obtain at least one segmented word;

obtaining a word embedding vector of each split word based on the word embedding word list;

and embedding a vector according to the words of each split word to obtain the first vector.

Further, the dictionary table includes a word embedding word table, and the word embedding word table includes word embedding vectors corresponding to a plurality of words respectively; and the vectorization module 22 is further configured to:

obtaining word embedding vectors of all words in the first address based on the word embedding word list;

and embedding vectors according to words of all words in the first address to obtain the first vector.

Further, the dictionary table includes: a word embedding vocabulary and a word embedding vocabulary; the word embedding word list comprises word embedding vectors corresponding to a plurality of words, and the word embedding word list comprises word embedding vectors corresponding to a plurality of words; and the vectorization module 22 is further configured to:

obtaining word embedding vectors of all the split words based on the word embedding word list;

respectively obtaining word embedding vectors of all words in each segmented word based on the word embedding word list;

regenerating word embedded vectors for the split words according to the word embedded vectors of the split words and the word embedded vectors of all the characters in the split words;

and embedding the vector according to the regenerated word for each split word to obtain the first vector.

Further, the data processing apparatus provided in this embodiment further includes: the device comprises a second acquisition module and a first training module. The second obtaining module is used for obtaining an address sample set; the first training module is used for training to obtain the dictionary table by using a Skigram model based on the address sample set.

Further, the execution module 23 is further configured to:

performing a join operation on the first vector and the second vector to obtain a third vector;

taking the third vector as an input of a redundant discriminant model, and executing the redundant discriminant model to output the probability that the first address and the second address are redundant;

when the probability is larger than a threshold value and the first address and the second address are both data in an address base, deleting the first address from the address base;

and when the probability is greater than the threshold value, the first address is data to be stored in a warehouse, and the second address is data in an address library, rejecting the warehousing request of the first address.

Further, the executing module 23 is further configured to:

when the probability is smaller than the threshold value, acquiring a third address from the address library so as to continuously calculate the probability of the redundancy of the first address and the third address;

and when the address base has no address with the probability of redundancy with the first address being larger than the threshold value and the first address is data to be put into a database, adding the first address into the address base.

Further, the data processing apparatus provided in this embodiment may further include: a third acquisition module and a second training module. The third obtaining module is used for obtaining a first sample, wherein the first sample comprises an address pair and a sample label; the vectorization module is used for respectively vectorizing and representing two addresses in the address pair to obtain a fourth vector and a fifth vector; the second training module is used for training a learning model according to the fourth vector and the fifth vector to obtain an output result; and when the output result and the sample label determine that a training convergence condition is reached, the learning model completes training to obtain the redundancy discrimination model.

Still further, the second training module is further configured to:

updating parameters in the learning model when the output result and the sample label determine that the training convergence condition is not reached; and obtaining a second sample from the training sample set to continue training the learning model.

Further, the second address is one of at least one candidate alignment address; correspondingly, the data processing apparatus provided in this embodiment further includes: the system comprises an analysis module and a recall module. The analysis module is used for carrying out structural analysis on the first address; the recall module is used for recalling the at least one candidate comparison address from an address library according to the structured analysis result.

Here, it should be noted that: the data processing apparatus provided in the foregoing embodiment may implement the technical solutions described in the foregoing data processing method embodiments, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing method embodiments, which is not described herein again.

Fig. 9 shows a schematic structural diagram of a data processing apparatus according to another embodiment of the present application. As shown in fig. 9, the data processing apparatus includes: an acquisition module 31, a vectorization module 32 and an addition module 33. The obtaining module 31 is configured to obtain at least one candidate comparison address from an address library for a first address to be put in a library; the vectorization module 32 is configured to perform vectorization representation on the first address and the at least one candidate comparison address respectively to obtain a first vector of the first address and a second vector of each candidate comparison address; the adding module 33 is configured to add the first address to the address library if it is determined that there is no address redundancy between the first address and all candidate comparison addresses based on the first vector and a second vector of each candidate comparison address.

Further, the adding module 33 is further configured to:

the first vector is respectively connected with the second vector of each candidate comparison address to obtain at least one third vector;

respectively taking the at least one third vector as the input of a redundancy judgment model, and executing the redundancy judgment model to output the redundancy probability of the first address and each candidate comparison address;

and when the first address and the probability of each candidate comparison address redundancy are smaller than a threshold value, adding the first address to the address library.

Here, it should be noted that: the data processing apparatus provided in the foregoing embodiment may implement the technical solutions described in the foregoing data processing method embodiments, and the specific implementation principles of the modules or units may refer to the corresponding contents in the foregoing method embodiments, which are not described herein again.

Fig. 10 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application. As shown in fig. 10, the data processing apparatus includes: an acquisition module 41, a vectorization module 42, and an execution module 43. The obtaining module 41 is configured to obtain a first address and a detection target set when receiving a redundant address check request sent by a client, where the detection target set includes at least one candidate comparison address; the vectorization module 42 is configured to perform vectorization representation on the first address and the at least one candidate comparison address respectively to obtain a first vector of the first address and a second vector of each candidate comparison address; the executing module 43 is configured to execute a warehousing operation or a redundancy operation on the first address according to the first vector and the second vector of each candidate comparison address, and feed back an operation result to the client.

Further, the executing module 43 is further configured to:

when the probability of the first address and the redundancy of each candidate comparison address is smaller than a threshold value, the first address is added to the address library and a response of the input is fed back to the client;

and when the probability of the first address and one candidate comparison address redundancy is greater than the threshold value, deleting or rejecting the first address from an address library to be put in storage, and feeding back a response of deleting or rejecting the first address to the client.

Fig. 11 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 61 and a processor 62. The memory 61 may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory 61 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The processor 62, coupled to the memory 61, is configured to execute the program stored in the memory 61, so as to:

obtaining a first sample, wherein the first sample comprises an address pair and a sample label;

vectorizing two addresses in the address pair respectively to obtain a fourth vector and a fifth vector;

training a learning model based on the fourth vector and the fifth vector to obtain an output result;

and when the condition that the training convergence condition is not reached is determined according to the output result and the sample label, updating parameters in the learning model, and obtaining a second sample from the training sample set so as to continue training the learning model.

In the technical solution provided by this embodiment, two address pairs in a sample are vectorized; training the learning model based on vectors corresponding to the two addresses respectively, and influencing parameter updating in the learning model during one-time training iteration through a result output by the learning model and a sample label until the learning model finishes training to obtain a redundancy discrimination model for calculating the similarity of the two addresses when a training convergence condition is determined to be reached according to the result output by the learning model and the sample label; the redundancy judgment model provided by the embodiment is adopted to carry out address redundancy judgment, so that the accuracy is high; and is highly efficient.

When the processor 62 executes the program in the memory 61, in addition to the above functions, other functions may be implemented, and reference may be specifically made to the description of the foregoing embodiments.

Further, as shown in fig. 11, the electronic device further includes: display 64, communication components 63, power components 65, audio components 66, and the like. Only some of the components are schematically shown in fig. 11, and it is not meant that the electronic device includes only the components shown in fig. 11.

An embodiment of the application further provides the electronic equipment. The structure of the electronic device provided in this embodiment is similar to the structure of the electronic device in the above embodiment, and is shown in fig. 11. The electronic device includes a memory and a processor. The memory may be configured to store other various data to support operations on the electronic device. The processor, coupled with the memory, to execute the program stored in the memory to:

acquiring a first address and a second address;

According to the technical scheme provided by the embodiment, the semantic similarity between words or between words in the address can be well obtained by expressing the address as the vector, so that whether address redundancy exists in the two addresses or not can be well determined, and the accuracy is high.

When the processor executes the program in the memory, the processor may implement other functions in addition to the above functions, which may be specifically referred to the description of the foregoing embodiments.

An embodiment of the application further provides the electronic equipment. The structure of the electronic device provided in this embodiment is similar to the structure of the electronic device in the above embodiment, and is shown in fig. 11. The electronic device includes: a memory and a processor. Wherein the memory may be configured to store other various data to support operations on the electronic device. The processor, coupled to the memory, to execute the program stored in the memory to:

acquiring at least one candidate comparison address from an address library for a first address to be put in a library;

An embodiment of the application further provides the server side equipment. The structure of the server device provided in this embodiment is similar to that of the electronic device embodiment, and is shown in fig. 11. The server-side device comprises a memory, a processor and a communication component. Wherein the memory may be configured to store other various data to support operations on the server device. Examples of such data include instructions for any application or method operating on the server device. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The communication component coupled with the processor;

under the condition that the communication assembly receives a redundant address check request sent by a client, a first address and a check target set are obtained, wherein the check target set comprises at least one candidate comparison address;

and according to the first vector and the second vector of each candidate comparison address, performing warehousing operation or redundancy operation on the first address and feeding back an operation result to the client through the communication component.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps or functions of the data processing method provided in the foregoing embodiments when executed by a computer.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of data processing, comprising:

acquiring a first address and a second address;

when the probability is greater than a threshold value and the first address and the second address are both data in an address base, deleting the first address from the address base;

when the probability is larger than the threshold value, the first address is data to be stored in a warehouse, and the second address is data in an address library, the warehousing request of the first address is rejected;

2. The method of claim 1, wherein vectorizing the first address to obtain a first vector comprises:

acquiring a dictionary table;

representing the first address as the first vector based on the dictionary table.

3. The method of claim 2, wherein the dictionary table comprises a word embedding vocabulary having word embedding vectors corresponding to respective words of the plurality of words; and

representing the first address as the first vector based on the dictionary table, including:

and embedding a vector according to the word of each split word to obtain the first vector.

4. The method of claim 2, wherein the dictionary table comprises a word embedding vocabulary having word embedding vectors corresponding to respective ones of a plurality of words; and

5. The method of claim 2, wherein the dictionary table comprises: a word embedding vocabulary and a word embedding vocabulary; the word embedding word list comprises word embedding vectors corresponding to a plurality of words, and the word embedding word list comprises word embedding vectors corresponding to a plurality of words; and

regenerating word embedded vectors for each split word according to the word embedded vectors of the split words and the word embedded vectors of all the characters in the split words;

and obtaining the first vector according to the word embedding vector regenerated for each split word.

6. The method of any of claims 2 to 5, further comprising:

acquiring an address sample set;

7. The method of claim 1, further comprising:

training a learning model according to the fourth vector and the fifth vector to obtain an output result;

and when the output result and the sample label determine that a training convergence condition is reached, the learning model completes training to obtain the redundancy discrimination model.

8. The method of claim 7, further comprising:

9. The method of any one of claims 1 to 5, wherein the second address is one of at least one candidate alignment address;

and, the method further comprises:

performing structured resolution on the first address;

10. A method of data processing, comprising:

11. An electronic device comprising a memory and a processor; wherein, the first and the second end of the pipe are connected with each other,

the memory is used for storing programs;

the processor, coupled to the memory, to execute the program stored in the memory to:

acquiring a first address and a second address;

taking the third vector as an input of a redundant discriminant model, executing the redundant discriminant model to output a probability that the first address and the second address are redundant;

when the probability is larger than the threshold value, the first address is data to be put in storage, and the second address is data in an address base, rejecting the input request of the first address;

and when the address base has no address with the probability of redundancy with the first address larger than the threshold value and the first address is to-be-put data, adding the first address into the address base.