CN113642310A - Terminal data similarity measurement method - Google Patents

Terminal data similarity measurement method Download PDF

Info

Publication number
CN113642310A
CN113642310A CN202110798955.6A CN202110798955A CN113642310A CN 113642310 A CN113642310 A CN 113642310A CN 202110798955 A CN202110798955 A CN 202110798955A CN 113642310 A CN113642310 A CN 113642310A
Authority
CN
China
Prior art keywords
similarity
terminal
terminal data
distance
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110798955.6A
Other languages
Chinese (zh)
Other versions
CN113642310B (en
Inventor
林木兴
丁明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xuanwu Wireless Technology Co Ltd
Original Assignee
Guangzhou Xuanwu Wireless Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xuanwu Wireless Technology Co Ltd filed Critical Guangzhou Xuanwu Wireless Technology Co Ltd
Priority to CN202110798955.6A priority Critical patent/CN113642310B/en
Publication of CN113642310A publication Critical patent/CN113642310A/en
Application granted granted Critical
Publication of CN113642310B publication Critical patent/CN113642310B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Biology (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for measuring the similarity of terminal data, which comprises the following steps: determining a calculation characteristic according to the terminal data; respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics; and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar. The method and the device identify the repeated terminal data through the weighted sum of the distance similarity and the text similarity, and improve the quality of the acquired terminal data.

Description

Terminal data similarity measurement method
Technical Field
The invention relates to the technical field of data processing, in particular to a terminal data similarity measurement method.
Background
In a new retail age, marketing operation of the fast moving industry is increasingly digitized, the demand of the fast moving industry on terminal management is mainly focused on how to solve the management problem of each business object of personnel, terminals, products and channels in the sales process through artificial intelligence, and the retail terminal is used as a main bearing body of a human goods yard of a fast moving retail enterprise, is a feeler closest to a consumer end of the fast moving retail enterprise and is a main entrance for the enterprise to acquire data of the consumer end, so that the management of the retail terminal is an especially important link of the fast moving retail enterprise.
Generally, a main channel for acquiring terminal information of a fast-moving retail enterprise is to visit an entry system by a salesman, the terminal store information of the entry system is greatly influenced by the salesman, due to the fact that the mobility of the salesman is high, and the salesman has performance indexes for developing the terminal stores, repeated terminals can be submitted among different salesmans, the salesman can also submit the repeated terminals for service counterfeiting, and finally, the terminal data in the enterprise system has a lot of repeated redundant false data.
Disclosure of Invention
The invention aims to provide a method for measuring the similarity of terminal data so as to solve the problem of low efficiency of acquiring the terminal data.
In order to achieve the above object, the present invention provides a method for measuring terminal data similarity, including:
determining a calculation characteristic according to the terminal data;
respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics;
and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar.
Preferably, the computing features include:
calculating the characteristics of the similarity of the terminal data distances, including longitude and latitude;
and calculating the characteristics of text similarity of the terminal data, including terminal name, address, type and contact.
Preferably, the calculating the distance similarity of the terminal data includes:
coding the longitude and latitude characteristics by adopting a Geohash algorithm to obtain a Geohash code;
determining the current terminal S by searching the Geohash code through indexiTerminal data set S ═ S (S) formed with adjacent terminals0,S1,...,Sn) Wherein n is the number of terminals;
the terminal data set S ═ (S)0,S1,...,Sn) Respectively with the current terminal SiCalculating the distance to obtain a distance set;
and inputting the distance set into a preset distance similarity function to obtain the distance similarity of the terminal data.
Preferably, the text similarity of the terminal data includes:
acquiring the current terminal S by using the crust participleiThe word segmentation result of each text feature in the text data set is (S) with the terminal data set S0,S1,...,Sn) The word segmentation result of each text characteristic of each terminal;
and calculating the similarity of the word segmentation result by adopting a Levenshtein Distance algorithm, and acquiring the text similarity of the terminal data.
Preferably, the terminal similarity function similarity is as follows: similarity ═ alpha1f(d(l1,l2))+α2fuzzy(n1,n2)+α3fuzzy(a1,a2)+α4fuzzy(t1,t2)+α5fuzzy(p1,p2);
Wherein alpha is1234+α 51 represents the weight of the different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)1,l2) Representing said longitude and latitude of the two terminals to be compared, (n)1,n2) The terminal names representing two terminals to be compared, (a)1,a2) (t) representing said addresses of the two terminals to be compared1,t2) Indicating said type of the two terminals to be compared, (p)1,p2) Representing the contacts of the two terminals to be compared.
Preferably, the preset hyper-parameter threshold is 0.7.
The invention also provides a computer terminal device comprising one or more processors and a memory. A memory coupled to the processor for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the method for measuring similarity of terminal data according to any of the embodiments described above.
The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for measuring similarity of terminal data according to any of the embodiments described above.
The invention determines the calculation characteristics according to the terminal data; respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics; and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar. The method and the device identify the repeated terminal data through the weighted sum of the distance similarity and the text similarity, and improve the quality of the acquired terminal data.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for measuring similarity of terminal data according to an embodiment of the present invention;
FIG. 2 is a flowchart of an edit distance calculation according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of an edit distance algorithm provided by yet another embodiment of the present invention;
fig. 4 is an overall flow chart provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not used as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, the present invention provides a method for measuring terminal data similarity, including:
and S101, determining calculation characteristics according to the terminal data.
Specifically, terminal data in different fields of the fast selling industry are collected, terminal data features are extracted, features used for calculating distance similarity, including longitude and latitude and features used for calculating text similarity, including terminal names, addresses, types and contacts, are extracted respectively, and data missing from the features are cleaned.
And S102, respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics.
Encoding the longitude and latitude characteristics by adopting a Geohash algorithm to obtain a Geohash code, and searching the Geohash code by an index to determine the current terminal SiTerminal data set S ═ S (S) formed with adjacent terminals0,S1,...,Sn) Where n is the number of terminals, and the terminal data set S ═ S (S)0,S1,...,Sn) Respectively with the current terminal SiAnd calculating the distance, acquiring a distance set, inputting the distance set into a preset distance similarity function, and acquiring the distance similarity of the terminal data.
Specifically, the two-dimensional longitude and latitude of each terminal are coded into a character string with the length of 12 through a Geohash algorithm, an index is established, the terminals are distributed in grids in different distance ranges in a Geohash grid coding mode, and the generated Geohash codes and the current terminal S are searched through the indexiA terminal falling in the same Geohash coding region and a terminal data set S, S ═ of adjacent Geohash coding regions (S0,S1,...,Sn) N is the number of terminals with the current terminal Si falling in the same or adjacent Geohash coding region, and S (S) is set for the terminal set0,S1,,Sn) Is divided intoRespectively with SiCalculating the distance to obtain the relative distance of each terminal in S to the terminal SiDistance set D ═ D (D)0,D1,...,Dn) Defining a distance similarity function f, f being a domain on (0, k) with respect to the terminal distance DiK is the maximum search distance, the value range is (0,1), the distance set D is converted into distance similarity, and the two terminals are similar when the distance is smaller.
Referring to fig. 2 and fig. 3, comparing the lengths of two strings, where the longer string is a row and the shorter string is a column, ensuring that n is greater than or equal to m, initializing LD matrix LD [ m +1, n +1], setting line to 1, calculating LD matrix elements in the first line row, if LD [ line, n-m-line ] > (1-Sim) n, determining that the two strings are similar, otherwise, further, if LD [ line, n-m-line ] + m-line is less than or equal to (1-Sim) n, determining that the two strings are similar, otherwise, line < m, line + +, and then recalculating.
The distance similarity function is defined as:
f=8*10-7*x2-1.8*10-3*x+1;
where f is a decreasing function of a defined field on (0,1000) with respect to the terminal distance x, the maximum search distance k is 1000m, and the value field is (0, 1).
Obtaining current terminal S by using crust segmentationiThe word segmentation result of each text feature in the text is compared with the terminal data set S (S ═ S)0,S1,...,Sn) And calculating the similarity of the word segmentation result of each text characteristic of each terminal by adopting a Levenshtein Distance algorithm to obtain the text similarity of the terminal data.
Specifically, according to the extracted text feature data including the terminal name, address, type and contact person, a word segmentation is performed by using a crust word segmentation method, and keywords are filtered, wherein a keyword library in this embodiment is as follows: { convenience store, convenience, supermarket, department store, shop, market, business, shop, grocery store }, and then calculating the similarity of each feature after word segmentation by an edit distance algorithm.
Segmenting the extracted text features by the aid of the crust segmentation, filtering out keywords in segmentation results through a custom-built keyword library, and obtaining the terminalSiThe word segmentation result of each text characteristic and the terminal set S ═ S (S)0,S1,...,Sn) The word segmentation result of each text characteristic of each terminal. Calculating the terminal S by adopting a Levenshtein Distance algorithm (edit Distance)iThe word segmentation result of each text characteristic is equal to the terminal set S (S)0,S1,...,Sn) The text similarity of the word segmentation result of each text characteristic of each terminal.
S103, inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, if the terminal similarity measure is larger than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar.
Referring to fig. 4, the distance similarity and the text similarity are respectively calculated according to the features determined by the database, and the terminal similarity and the similar terminal information are finally obtained.
The terminal similarity function similarity, as follows: similarity ═ alpha1f(d(l1,l2))+α2fuzzy(n1,n2)+α3fuzzy(a1,a2)+α4fuzzy(t1,t2)+α5fuzzy(p1,p2);
Wherein alpha is1234+α 51 represents the weight of different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)1,l2) Representing the latitude and longitude of the two terminals to be compared, (n)1,n2) Indicating the terminal names of the two terminals to be compared, (a)1,a2) Indicating the addresses of the two terminals to be compared, (t)1,t2) Indicates the type of the two terminals to be compared, (p)1,p2) Representing the contacts of the two terminals for comparison.
And performing weighted summation on the distance similarity obtained by calculation and the text similarity obtained by calculation to obtain a result, namely terminal similarity measurement, and setting a hyper-parameter threshold q, wherein the terminal similarity is greater than the threshold q, namely the terminal similarity is judged to be similar, and the preset hyper-parameter threshold is 0.7.
The difference of the terminal characteristics is large between different data sets, so the characteristics of each data characteristic need to be noticed during characteristic extraction, different hyper-parameters (k, alpha) are set aiming at different terminal data sets and characteristic characteristics, and the model effect is better.
In the terminal management of the fast-moving industry, the phenomena of redundant duplication and the like exist in the terminal data, the manual check method is high in cost and poor in benefit, and an enterprise cannot find a proper measuring method easily so as to rapidly, effectively and automatically duplicate the redundant duplicated terminal data in the database. The algorithm model is fully automatic, extra data processing and model training time is not needed, the duplicate removal speed of a single record is high, the average calculation time of each record is within 200ms, the algorithm model is flexibly deployed and can be real-time or asynchronous, compared with the existing terminal duplicate removal method, the terminal duplicate removal calculation method based on the terminal similarity measurement method is high in efficiency, million-level data can be completed in one day, and the precision can reach 90%. By the terminal similarity measurement method, repeated terminals in the fast-elimination retail terminal database are effectively and accurately judged, and the data quality of the terminals is guaranteed.
The invention provides a computer terminal device comprising one or more processors and a memory. The memory is coupled to the processor and configured to store one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the terminal data similarity metric method as in any of the above embodiments.
The processor is used for controlling the overall operation of the computer terminal equipment so as to complete all or part of the steps of the terminal data similarity measurement method. The memory is used to store various types of data to support the operation at the computer terminal device, which data may include, for example, instructions for any application or method operating on the computer terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In an exemplary embodiment, the computer terminal Device may be implemented by one or more Application Specific 1 integrated circuits (AS 1C), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor or other electronic components, and is configured to perform the above-mentioned terminal data similarity measuring method and achieve the technical effects consistent with the above-mentioned methods.
In another exemplary embodiment, a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the terminal data similarity measure method in any one of the above embodiments, is also provided. For example, the computer readable storage medium may be the above-mentioned memory including program instructions, which are executable by a processor of a computer terminal device to perform the above-mentioned terminal data similarity measure method, and achieve the technical effects consistent with the above-mentioned method.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A method for measuring terminal data similarity is characterized by comprising the following steps:
determining a calculation characteristic according to the terminal data;
respectively calculating the distance similarity of the terminal data and the text similarity of the terminal data according to the calculation characteristics;
and inputting the distance similarity of the terminal data and the text similarity of the terminal data into a terminal similarity function for weighted summation to obtain a terminal similarity measure, and if the terminal similarity measure is greater than a preset hyper-parameter threshold, judging that the two terminal data are similar, otherwise, judging that the two terminal data are not similar.
2. The method according to claim 1, wherein the calculating the characteristic comprises:
calculating the characteristics of the similarity of the terminal data distances, including longitude and latitude;
and calculating the characteristics of text similarity of the terminal data, including terminal name, address, type and contact.
3. The method according to claim 2, wherein the calculating the distance similarity of the terminal data comprises:
coding the longitude and latitude characteristics by adopting a Geohash algorithm to obtain a Geohash code;
determining the current terminal S by searching the Geohash code through indexiTerminal data set S ═ S (S) formed with adjacent terminals0,S1,...,Sn) Wherein n is the number of terminals;
the terminal data set S ═ (S)0,S1,...,Sn) Respectively with the current terminal SiCalculating the distance to obtain a distance set;
and inputting the distance set into a preset distance similarity function to obtain the distance similarity of the terminal data.
4. The method for measuring the similarity of terminal data according to claim 3, wherein the text similarity of the terminal data comprises:
acquiring the current terminal S by using the crust participleiThe word segmentation result of each text feature in the text data set is (S) with the terminal data set S0,S1,...,Sn) The word segmentation result of each text characteristic of each terminal;
and calculating the similarity of the word segmentation result by adopting a Levenshtein Distance algorithm, and acquiring the text similarity of the terminal data.
5. The method according to claim 4, wherein the terminal similarity function similarity is as follows:
similar=α1f(d(l1,l2))+α2fuzzy(n1,n2)+α3fuzzy(a1,a2)+α4fuzzy(t1,t2)+α5fuzzy(p1,p2);
wherein alpha is123451 represents the weight of the different features, the function fuzzy () represents the text similarity function calculated by the edit distance algorithm, f represents the preset distance similarity function, (l)1,l2) Representing said longitude and latitude of the two terminals to be compared, (n)1,n2) The terminal names representing two terminals to be compared, (a)1,a2) (t) representing said addresses of the two terminals to be compared1,t2) Indicating said type of the two terminals to be compared, (p)1,p2) Representing the contacts of the two terminals to be compared.
6. The method according to claim 5, further comprising setting the pre-set hyper-parameter threshold to 0.7.
7. A computer terminal device, comprising:
one or more processors;
a memory coupled to the processor for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the terminal data similarity metric method of any of claims 1-6.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the terminal data similarity measure method according to any one of claims 1 to 6.
CN202110798955.6A 2021-07-14 2021-07-14 Terminal data similarity measurement method Active CN113642310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110798955.6A CN113642310B (en) 2021-07-14 2021-07-14 Terminal data similarity measurement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110798955.6A CN113642310B (en) 2021-07-14 2021-07-14 Terminal data similarity measurement method

Publications (2)

Publication Number Publication Date
CN113642310A true CN113642310A (en) 2021-11-12
CN113642310B CN113642310B (en) 2022-04-19

Family

ID=78417365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110798955.6A Active CN113642310B (en) 2021-07-14 2021-07-14 Terminal data similarity measurement method

Country Status (1)

Country Link
CN (1) CN113642310B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115408379A (en) * 2022-10-25 2022-11-29 广州市玄武无线科技股份有限公司 Terminal repeating data determination method, device, equipment and computer storage medium
CN116128438A (en) * 2022-12-27 2023-05-16 江苏巨楷科技发展有限公司 Intelligent community management system based on big data record information
WO2024031943A1 (en) * 2022-08-10 2024-02-15 中国银联股份有限公司 Store deduplication processing method and apparatus, device, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2523300A (en) * 1999-04-07 2000-10-12 Reclaim Technologies And Sservices, Ltd. A system for identification of selectively related database records
EP1939797A1 (en) * 2006-12-23 2008-07-02 NTT DoCoMo, Inc. Method and apparatus for automatically determining a semantic classification of context data
CN101299217A (en) * 2008-06-06 2008-11-05 北京搜狗科技发展有限公司 Method, apparatus and system for processing map information
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Electronic map interest point data redundant detecting method and system
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN112487204A (en) * 2020-12-01 2021-03-12 北京理工大学 Data ontology mapping method and system
CN112749542A (en) * 2021-01-19 2021-05-04 北京明略昭辉科技有限公司 Trade name matching method, system, equipment and storage medium
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2523300A (en) * 1999-04-07 2000-10-12 Reclaim Technologies And Sservices, Ltd. A system for identification of selectively related database records
EP1939797A1 (en) * 2006-12-23 2008-07-02 NTT DoCoMo, Inc. Method and apparatus for automatically determining a semantic classification of context data
CN101299217A (en) * 2008-06-06 2008-11-05 北京搜狗科技发展有限公司 Method, apparatus and system for processing map information
CN101388023A (en) * 2008-09-12 2009-03-18 北京搜狗科技发展有限公司 Electronic map interest point data redundant detecting method and system
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN112487204A (en) * 2020-12-01 2021-03-12 北京理工大学 Data ontology mapping method and system
CN112749542A (en) * 2021-01-19 2021-05-04 北京明略昭辉科技有限公司 Trade name matching method, system, equipment and storage medium
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SONGTAO SHANG 等: ""An Improved Focused Web Crawler based on Hybrid Similarity"", 《INTERNATIONAL JOURNAL OF PERFORMABILITY ENGINEERING》 *
潘晓 等: ""基于时间序列的轨迹数据相似性度量方法研究及应用综述"", 《燕山大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024031943A1 (en) * 2022-08-10 2024-02-15 中国银联股份有限公司 Store deduplication processing method and apparatus, device, and storage medium
CN115408379A (en) * 2022-10-25 2022-11-29 广州市玄武无线科技股份有限公司 Terminal repeating data determination method, device, equipment and computer storage medium
CN116128438A (en) * 2022-12-27 2023-05-16 江苏巨楷科技发展有限公司 Intelligent community management system based on big data record information

Also Published As

Publication number Publication date
CN113642310B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN113642310B (en) Terminal data similarity measurement method
US6662189B2 (en) Method of performing data mining tasks for generating decision tree and apparatus therefor
US6834266B2 (en) Methods for estimating the seasonality of groups of similar items of commerce data sets based on historical sales data values and associated error information
CN112801720B (en) Method and device for generating shop category identification model and identifying shop category
CN111507240B (en) Face clustering method, face clustering device, electronic equipment and computer-readable storage medium
CN106649832B (en) Estimation method and device based on missing data
CN107153656A (en) A kind of information search method and device
CN110766428A (en) Data value evaluation system and method
KR102104316B1 (en) Apparatus for predicting stock price of company by analyzing news and operating method thereof
CN109636482B (en) Data processing method and system based on similarity model
CN111652653A (en) Price determination and prediction model construction method, device, equipment and storage medium
CN113468034A (en) Data quality evaluation method and device, storage medium and electronic equipment
CN111148045B (en) User behavior cycle extraction method and device
CN110852076B (en) Method and device for automatic disease code conversion
CN110705297A (en) Enterprise name-identifying method, system, medium and equipment
CN116610821B (en) Knowledge graph-based enterprise risk analysis method, system and storage medium
CN111091416A (en) Method and device for predicting probability of hotel purchase robot
CN112560433B (en) Information processing method and device
RU2480828C1 (en) Method of predicting target value of events based on unlimited number of characteristics
CN108614811B (en) Data analysis method and device
JP5640796B2 (en) Name identification support processing apparatus, method and program
CN108921431A (en) Government and enterprise customers clustering method and device
CN113392289A (en) Search recommendation method and device and electronic equipment
CN116934418B (en) Abnormal order detection and early warning method, system, equipment and storage medium
CN108564422A (en) A kind of system based on matrimony vine data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant