CN113642310A

CN113642310A - Terminal data similarity measurement method

Info

Publication number: CN113642310A
Application number: CN202110798955.6A
Authority: CN
Inventors: 林木兴; 丁明
Original assignee: Guangzhou Xuanwu Wireless Technology Co Ltd
Current assignee: Guangzhou Xuanwu Wireless Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-12
Anticipated expiration: 2041-07-14
Also published as: CN113642310B

Abstract

The invention discloses a method for measuring the similarity of terminal data, which comprises the following steps: determining a calculation characteristic according to the terminal data; respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics; and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar. The method and the device identify the repeated terminal data through the weighted sum of the distance similarity and the text similarity, and improve the quality of the acquired terminal data.

Description

Terminal data similarity measurement method

Technical Field

The invention relates to the technical field of data processing, in particular to a terminal data similarity measurement method.

Background

In a new retail age, marketing operation of the fast moving industry is increasingly digitized, the demand of the fast moving industry on terminal management is mainly focused on how to solve the management problem of each business object of personnel, terminals, products and channels in the sales process through artificial intelligence, and the retail terminal is used as a main bearing body of a human goods yard of a fast moving retail enterprise, is a feeler closest to a consumer end of the fast moving retail enterprise and is a main entrance for the enterprise to acquire data of the consumer end, so that the management of the retail terminal is an especially important link of the fast moving retail enterprise.

Generally, a main channel for acquiring terminal information of a fast-moving retail enterprise is to visit an entry system by a salesman, the terminal store information of the entry system is greatly influenced by the salesman, due to the fact that the mobility of the salesman is high, and the salesman has performance indexes for developing the terminal stores, repeated terminals can be submitted among different salesmans, the salesman can also submit the repeated terminals for service counterfeiting, and finally, the terminal data in the enterprise system has a lot of repeated redundant false data.

Disclosure of Invention

The invention aims to provide a method for measuring the similarity of terminal data so as to solve the problem of low efficiency of acquiring the terminal data.

In order to achieve the above object, the present invention provides a method for measuring terminal data similarity, including:

determining a calculation characteristic according to the terminal data;

respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics;

and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar.

Preferably, the computing features include:

calculating the characteristics of the similarity of the terminal data distances, including longitude and latitude;

and calculating the characteristics of text similarity of the terminal data, including terminal name, address, type and contact.

Preferably, the calculating the distance similarity of the terminal data includes:

coding the longitude and latitude characteristics by adopting a Geohash algorithm to obtain a Geohash code;

determining the current terminal S by searching the Geohash code through index_iTerminal data set S ═ S (S) formed with adjacent terminals₀,S₁,...,S_n) Wherein n is the number of terminals;

the terminal data set S ═ (S)₀,S₁,...,S_n) Respectively with the current terminal S_iCalculating the distance to obtain a distance set;

and inputting the distance set into a preset distance similarity function to obtain the distance similarity of the terminal data.

Preferably, the text similarity of the terminal data includes:

acquiring the current terminal S by using the crust participle_iThe word segmentation result of each text feature in the text data set is (S) with the terminal data set S₀,S₁,...,S_n) The word segmentation result of each text characteristic of each terminal;

and calculating the similarity of the word segmentation result by adopting a Levenshtein Distance algorithm, and acquiring the text similarity of the terminal data.

Preferably, the terminal similarity function similarity is as follows: similarity ═ alpha₁f(d(l₁,l₂))+α₂fuzzy(n₁,n₂)+α₃fuzzy(a₁,a₂)+α₄fuzzy(t₁,t₂)+α₅fuzzy(p₁,p₂)；

Wherein alpha is₁+α₂+α₃+α₄+α ₅1 represents the weight of the different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)₁,l₂) Representing said longitude and latitude of the two terminals to be compared, (n)₁,n₂) The terminal names representing two terminals to be compared, (a)₁,a₂) (t) representing said addresses of the two terminals to be compared₁,t₂) Indicating said type of the two terminals to be compared, (p)₁,p₂) Representing the contacts of the two terminals to be compared.

Preferably, the preset hyper-parameter threshold is 0.7.

The invention also provides a computer terminal device comprising one or more processors and a memory. A memory coupled to the processor for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the method for measuring similarity of terminal data according to any of the embodiments described above.

The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for measuring similarity of terminal data according to any of the embodiments described above.

The invention determines the calculation characteristics according to the terminal data; respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics; and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar. The method and the device identify the repeated terminal data through the weighted sum of the distance similarity and the text similarity, and improve the quality of the acquired terminal data.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for measuring similarity of terminal data according to an embodiment of the present invention;

FIG. 2 is a flowchart of an edit distance calculation according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of an edit distance algorithm provided by yet another embodiment of the present invention;

fig. 4 is an overall flow chart provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not used as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, the present invention provides a method for measuring terminal data similarity, including:

and S101, determining calculation characteristics according to the terminal data.

Specifically, terminal data in different fields of the fast selling industry are collected, terminal data features are extracted, features used for calculating distance similarity, including longitude and latitude and features used for calculating text similarity, including terminal names, addresses, types and contacts, are extracted respectively, and data missing from the features are cleaned.

And S102, respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics.

Encoding the longitude and latitude characteristics by adopting a Geohash algorithm to obtain a Geohash code, and searching the Geohash code by an index to determine the current terminal S_iTerminal data set S ═ S (S) formed with adjacent terminals₀,S₁,...,S_n) Where n is the number of terminals, and the terminal data set S ═ S (S)₀,S₁,...,S_n) Respectively with the current terminal S_iAnd calculating the distance, acquiring a distance set, inputting the distance set into a preset distance similarity function, and acquiring the distance similarity of the terminal data.

Specifically, the two-dimensional longitude and latitude of each terminal are coded into a character string with the length of 12 through a Geohash algorithm, an index is established, the terminals are distributed in grids in different distance ranges in a Geohash grid coding mode, and the generated Geohash codes and the current terminal S are searched through the index_iA terminal falling in the same Geohash coding region and a terminal data set S, S ═ of adjacent Geohash coding regions (S₀,S₁,...,S_n) N is the number of terminals with the current terminal Si falling in the same or adjacent Geohash coding region, and S (S) is set for the terminal set₀,S₁,,S_n) Is divided intoRespectively with S_iCalculating the distance to obtain the relative distance of each terminal in S to the terminal S_iDistance set D ═ D (D)₀,D₁,...,D_n) Defining a distance similarity function f, f being a domain on (0, k) with respect to the terminal distance D_iK is the maximum search distance, the value range is (0,1), the distance set D is converted into distance similarity, and the two terminals are similar when the distance is smaller.

Referring to fig. 2 and fig. 3, comparing the lengths of two strings, where the longer string is a row and the shorter string is a column, ensuring that n is greater than or equal to m, initializing LD matrix LD [ m +1, n +1], setting line to 1, calculating LD matrix elements in the first line row, if LD [ line, n-m-line ] > (1-Sim) n, determining that the two strings are similar, otherwise, further, if LD [ line, n-m-line ] + m-line is less than or equal to (1-Sim) n, determining that the two strings are similar, otherwise, line < m, line + +, and then recalculating.

The distance similarity function is defined as:

f＝8*10^-7*x²-1.8*10^-3*x+1；

where f is a decreasing function of a defined field on (0,1000) with respect to the terminal distance x, the maximum search distance k is 1000m, and the value field is (0, 1).

Obtaining current terminal S by using crust segmentation_iThe word segmentation result of each text feature in the text is compared with the terminal data set S (S ═ S)₀,S₁,...,S_n) And calculating the similarity of the word segmentation result of each text characteristic of each terminal by adopting a Levenshtein Distance algorithm to obtain the text similarity of the terminal data.

Specifically, according to the extracted text feature data including the terminal name, address, type and contact person, a word segmentation is performed by using a crust word segmentation method, and keywords are filtered, wherein a keyword library in this embodiment is as follows: { convenience store, convenience, supermarket, department store, shop, market, business, shop, grocery store }, and then calculating the similarity of each feature after word segmentation by an edit distance algorithm.

Segmenting the extracted text features by the aid of the crust segmentation, filtering out keywords in segmentation results through a custom-built keyword library, and obtaining the terminalS_iThe word segmentation result of each text characteristic and the terminal set S ═ S (S)₀,S₁,...,S_n) The word segmentation result of each text characteristic of each terminal. Calculating the terminal S by adopting a Levenshtein Distance algorithm (edit Distance)_iThe word segmentation result of each text characteristic is equal to the terminal set S (S)₀,S₁,...,S_n) The text similarity of the word segmentation result of each text characteristic of each terminal.

S103, inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, if the terminal similarity measure is larger than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar.

Referring to fig. 4, the distance similarity and the text similarity are respectively calculated according to the features determined by the database, and the terminal similarity and the similar terminal information are finally obtained.

The terminal similarity function similarity, as follows: similarity ═ alpha₁f(d(l₁,l₂))+α₂fuzzy(n₁,n₂)+α₃fuzzy(a₁,a₂)+α₄fuzzy(t₁,t₂)+α₅fuzzy(p₁,p₂)；

Wherein alpha is₁+α₂+α₃+α₄+α ₅1 represents the weight of different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)₁,l₂) Representing the latitude and longitude of the two terminals to be compared, (n)₁,n₂) Indicating the terminal names of the two terminals to be compared, (a)₁,a₂) Indicating the addresses of the two terminals to be compared, (t)₁,t₂) Indicates the type of the two terminals to be compared, (p)₁,p₂) Representing the contacts of the two terminals for comparison.

And performing weighted summation on the distance similarity obtained by calculation and the text similarity obtained by calculation to obtain a result, namely terminal similarity measurement, and setting a hyper-parameter threshold q, wherein the terminal similarity is greater than the threshold q, namely the terminal similarity is judged to be similar, and the preset hyper-parameter threshold is 0.7.

The difference of the terminal characteristics is large between different data sets, so the characteristics of each data characteristic need to be noticed during characteristic extraction, different hyper-parameters (k, alpha) are set aiming at different terminal data sets and characteristic characteristics, and the model effect is better.

In the terminal management of the fast-moving industry, the phenomena of redundant duplication and the like exist in the terminal data, the manual check method is high in cost and poor in benefit, and an enterprise cannot find a proper measuring method easily so as to rapidly, effectively and automatically duplicate the redundant duplicated terminal data in the database. The algorithm model is fully automatic, extra data processing and model training time is not needed, the duplicate removal speed of a single record is high, the average calculation time of each record is within 200ms, the algorithm model is flexibly deployed and can be real-time or asynchronous, compared with the existing terminal duplicate removal method, the terminal duplicate removal calculation method based on the terminal similarity measurement method is high in efficiency, million-level data can be completed in one day, and the precision can reach 90%. By the terminal similarity measurement method, repeated terminals in the fast-elimination retail terminal database are effectively and accurately judged, and the data quality of the terminals is guaranteed.

The invention provides a computer terminal device comprising one or more processors and a memory. The memory is coupled to the processor and configured to store one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the terminal data similarity metric method as in any of the above embodiments.

The processor is used for controlling the overall operation of the computer terminal equipment so as to complete all or part of the steps of the terminal data similarity measurement method. The memory is used to store various types of data to support the operation at the computer terminal device, which data may include, for example, instructions for any application or method operating on the computer terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

In an exemplary embodiment, the computer terminal Device may be implemented by one or more Application Specific 1 integrated circuits (AS 1C), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor or other electronic components, and is configured to perform the above-mentioned terminal data similarity measuring method and achieve the technical effects consistent with the above-mentioned methods.

In another exemplary embodiment, a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the terminal data similarity measure method in any one of the above embodiments, is also provided. For example, the computer readable storage medium may be the above-mentioned memory including program instructions, which are executable by a processor of a computer terminal device to perform the above-mentioned terminal data similarity measure method, and achieve the technical effects consistent with the above-mentioned method.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for measuring terminal data similarity is characterized by comprising the following steps:

determining a calculation characteristic according to the terminal data;

2. The method according to claim 1, wherein the calculating the characteristic comprises:

3. The method according to claim 2, wherein the calculating the distance similarity of the terminal data comprises:

4. The method for measuring the similarity of terminal data according to claim 3, wherein the text similarity of the terminal data comprises:

5. The method according to claim 4, wherein the terminal similarity function similarity is as follows:

similar＝α₁f(d(l₁,l₂))+α₂fuzzy(n₁,n₂)+α₃fuzzy(a₁,a₂)+α₄fuzzy(t₁,t₂)+α₅fuzzy(p₁,p₂)；

wherein alpha is₁+α₂+α₃+α₄+α₅1 represents the weight of the different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)₁,l₂) Representing said longitude and latitude of the two terminals to be compared, (n)₁,n₂) The terminal names representing two terminals to be compared, (a)₁,a₂) (t) representing said addresses of the two terminals to be compared₁,t₂) Indicating said type of the two terminals to be compared, (p)₁,p₂) Representing the contacts of the two terminals to be compared.

6. The method according to claim 5, further comprising setting the pre-set hyper-parameter threshold to 0.7.

7. A computer terminal device, comprising:

one or more processors;

a memory coupled to the processor for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the terminal data similarity metric method of any of claims 1-6.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the terminal data similarity measure method according to any one of claims 1 to 6.