CN115438719A

CN115438719A - Data processing method, device, server and storage medium

Info

Publication number: CN115438719A
Application number: CN202210950879.0A
Authority: CN
Inventors: 韩逸青
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-12-06

Abstract

The disclosure relates to a data processing method, a data processing device, a server and a storage medium, and relates to the technical field of computers. The present disclosure provides for determining duplicate POI data. The method comprises the following steps: determining a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein the first feature of the two features is the feature of the first POI data, and the second feature of the two features is the feature of the second POI data; for the first feature pair, determining feature similarity of the first feature pair to obtain a plurality of feature similarities, and determining data similarity of the first POI data and the second POI data according to the plurality of feature similarities; the first feature pair is any one of a plurality of feature pairs; and determining that the first POI data and the second POI data are repeated when the data similarity is larger than or equal to a preset threshold value.

Description

Data processing method, device, server and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, an apparatus, a server, and a storage medium.

Background

With the development of computer technology, electronic maps are increasingly used. In an electronic map, there is a large amount of address information, which is generally present in the form of point of interest (POI) data. POI data typically includes information such as the name, address, contact phone, location coordinates, etc. of the location entity. One POI data may represent one house, one shop, one mailbox, one bus station, and the like.

In practical applications, POI data in the electronic map are continuously updated, such as adding, deleting, replacing, and the like. Therefore, there is a problem that the POI data acquired by the service end from the electronic map at different times are similar in quantity, and the actual contents represented by the POI data are the same (i.e. the POI data are repeated), thereby affecting the use of the POI data by the user. Therefore, how to determine the repeated POI data is a technical problem which needs to be solved urgently at present.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, a server and a storage medium, which are used for determining repeated POI data. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a data processing method, including: determining a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein the first feature of the two features is the feature of the first POI data, and the second feature of the two features is the feature of the second POI data; for the first feature pair, determining feature similarity of the first feature pair to obtain a plurality of feature similarities, and determining data similarity of the first POI data and the second POI data according to the plurality of feature similarities; the first feature pair is any one of a plurality of feature pairs; and determining that the first POI data and the second POI data are repeated when the data similarity is larger than or equal to a preset threshold value.

Optionally, the method further includes: acquiring first POI data and second POI data; the method comprises the steps of carrying out word segmentation on first POI data to obtain a plurality of first characteristics of the first POI data, and carrying out word segmentation on second POI data to obtain a plurality of second characteristics of the second POI data.

Optionally, the obtaining of the second POI data includes: acquiring second POI data according to the position information of the first POI data; the distance between the position information of the second POI data and the position information of the first POI data is less than or equal to a preset distance.

Optionally, for the first feature pair, determining feature similarity of the first feature pair to obtain multiple feature similarities, including: determining a dimension type of the first feature pair; the dimension types comprise a text type and a numerical type; under the condition that the dimension type is a text type, calculating the feature similarity of the first feature pair according to a first feature similarity algorithm; and under the condition that the dimension type is a numerical type, calculating the feature similarity of the first feature pair according to a second feature similarity algorithm.

Optionally, the first feature similarity algorithm is an edit distance algorithm, and the second feature similarity algorithm is a cosine similarity algorithm.

Optionally, determining the data similarity between the first POI data and the second POI data according to the multiple feature similarities includes: and weighting the feature similarities to obtain the data similarity of the first POI data and the second POI data.

Optionally, the method further includes: and storing the first POI data under the condition that the data similarity is smaller than a preset threshold value.

Optionally, after determining that the first POI data and the second POI data are repeated, the method further includes: the second POI data is deleted and the first POI data is stored.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including a determination unit and a processing unit; a determination unit configured to determine a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein a first feature of the two features is a feature of the first POI data, and a second feature of the two features is a feature of the second POI data; the processing unit is used for determining the feature similarity of the first feature pair to obtain a plurality of feature similarities, and determining the data similarity of the first POI data and the second POI data according to the plurality of feature similarities; the first feature pair is any one of a plurality of feature pairs; the determining unit is further used for determining that the first POI data and the second POI data are repeated when the data similarity is larger than or equal to a preset threshold value.

Optionally, the data processing apparatus further includes an obtaining unit; the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring first POI data and second POI data; the processing unit is further configured to perform word segmentation on the first POI data to obtain a plurality of first features of the first POI data, and perform word segmentation on the second POI data to obtain a plurality of second features of the second POI data.

Optionally, the obtaining unit is specifically configured to: acquiring second POI data according to the position information of the first POI data; the distance between the position information of the second POI data and the position information of the first POI data is less than or equal to a preset distance.

Optionally, the processing unit is specifically configured to: determining a dimension type of the first feature pair; the dimension type comprises a text type and a numerical type; under the condition that the dimension type is a text type, calculating the feature similarity of the first feature pair according to a first feature similarity algorithm; and under the condition that the dimension type is a numerical type, calculating the feature similarity of the first feature pair according to a second feature similarity algorithm.

Optionally, the processing unit is specifically configured to: and weighting the feature similarities to obtain the data similarity of the first POI data and the second POI data.

Optionally, the processing unit is further configured to: and storing the first POI data under the condition that the data similarity is smaller than a preset threshold value.

Optionally, after determining that the first POI data overlaps with the second POI data, the processing unit is further configured to: the second POI data is deleted and the first POI data is stored.

According to a third aspect of the embodiments of the present disclosure, there is provided a server, including: a processor, a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the data processing method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions which, when executed by a processor of a server, enable the server to perform the data processing method of the first aspect as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the data processing method as described in the first aspect above.

The technical scheme provided by the disclosure at least brings the following beneficial effects: the data processing device determines a plurality of feature pairs; since each feature pair includes two features having the same dimension, a first feature of the two features is a feature of the first point of interest POI data, and a second feature of the two features is a feature of the second POI data. Therefore, each feature pair comprises the feature of the first POI data and the feature of the second POI data under the same dimension, so that the features of the first POI data and the features of the second POI data are more matched, and the feature similarity between subsequent features can be conveniently determined. For any first feature pair in the plurality of feature pairs, the data processing device determines feature similarity of the first feature pair to obtain a plurality of feature similarities. Subsequently, the data processing device determines the data similarity between the first POI data and the second POI data according to the feature similarities, and if the data similarity is larger than or equal to a preset threshold value, the data processing device indicates that the first POI data and the second POI data are repeated. By the method, repeated POI data can be successfully determined, and further the repeated POI data can be subjected to targeted processing.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a block diagram of a data processing system in accordance with an exemplary embodiment;

FIG. 2 is one of the flow diagrams of a data processing method according to an exemplary embodiment;

FIG. 3 is a second flowchart illustration of a data processing method according to an exemplary embodiment;

FIG. 4 is a diagram illustrating a segmentation flow according to an exemplary embodiment;

FIG. 5 is a third flowchart illustration of a method of data processing according to an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating a flow of acquiring data in accordance with an exemplary embodiment;

FIG. 7 is a fourth flowchart illustrating a method of data processing in accordance with an exemplary embodiment;

FIG. 8 is a schematic illustration of a flow chart for determining feature pair similarity in accordance with an exemplary embodiment;

FIG. 9 is a fifth flowchart illustrating a method of data processing in accordance with an exemplary embodiment;

FIG. 10 is a logical architecture of a data processing method shown in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating a data processing apparatus in accordance with an exemplary embodiment;

fig. 12 is a schematic diagram of a structure of an electronic device shown in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

In addition, in the description of the embodiments of the present disclosure, "/" indicates an OR meaning, for example, A/B may indicate A or B, unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

It should be noted that the user information (including but not limited to user device information, user personal information, user behavior information, etc.) and data (including but not limited to program code, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Before explaining the embodiments of the present disclosure in detail, some related arts to which the embodiments of the present disclosure relate will be described.

With the development of computer technology, electronic maps are increasingly used. In an electronic map, there is a large amount of POI data. POI data is used to reflect specific geographic location points that are of interest to the user or of practical use to the user. In the electronic map, one POI data may represent one house, one shop, one mailbox, one bus station, and the like.

In practical applications, POI data in electronic maps are continuously adjusted, such as adding, deleting, replacing, etc. The server side may acquire a large amount of similar POI data at different periods, and if the server side stores all the POI data, large area of POI data may be repeated, which may affect the use of the server side (for example, since the same store has a plurality of POI data representing it, the distribution efficiency of the server side to the user is affected). Therefore, how to determine the repeated POI data and filter the repeated POI data becomes an urgent technical problem to be solved in the related art.

The data processing method provided by the embodiment of the disclosure is used for solving the technical problems in the related art. The data processing method provided by the embodiment of the disclosure can be applied to a data processing system, and fig. 1 shows a schematic structural diagram of the data processing system. As shown in fig. 1, the data processing system 10 includes a data processing apparatus 11 and a server 12. The data processing device 11 is connected to a server 12. The data processing device 11 and the server 12 may be connected by a wired method or a wireless method, which is not limited in the embodiment of the present invention.

The data processing device 11 is configured to acquire first POI data of interest and second POI data, perform word segmentation on the first POI data to obtain a plurality of first features of the first POI data, and perform word segmentation on the second POI data to obtain a plurality of second features of the second POI data. The data processing device 11 is further configured to determine a plurality of feature pairs, and for a first feature pair, determine a feature similarity of the first feature pair, so as to obtain a plurality of feature similarities. The data processing device 11 is further configured to determine a data similarity between the first POI data and the second POI data according to the feature similarities, and determine that the first POI data and the second POI data are repeated when the data similarity is greater than or equal to a preset threshold.

The data processing means 11 may be implemented in a server 12 for various multimedia resource applications. The server 12 may be a server of a multimedia resource sharing platform application, such as a server of a short video sharing platform application. The server 12 is deployed with a POI database, such as an ElasticSearch (ES) database, in which a large amount of POI data is stored.

In different application scenarios, the data processing apparatus 11 and the server 12 may be independent devices or may be integrated in the same device, which is not specifically limited in this embodiment of the present invention.

When the data processing device 11 and the server 12 are integrated into the same device, the data transmission method between the data processing device 11 and the server 12 is data transmission between internal modules of the device. In this case, the data transfer flow between the two is the same as the "data transfer flow between the data processing device 11 and the server 12" in the case where they are independent of each other.

In the following embodiments provided in the embodiments of the present disclosure, an example is described in which the data processing apparatus 11 and the server 12 are independently provided.

Fig. 2 is a flow diagram illustrating a data processing method, according to some example embodiments. In some embodiments, the data processing method can be applied to the data processing device and the server shown in fig. 1, and can also be applied to other similar devices.

As shown in fig. 2, a data processing method provided by the embodiment of the present disclosure includes the following steps S201 to S204.

S201, the data processing device determines a plurality of feature pairs.

Each feature pair comprises two features with the same dimension, wherein a first feature of the two features is a feature of the first POI data, and a second feature of the two features is a feature of the second POI data.

As a possible implementation manner, for a feature of any one first POI data (hereinafter referred to as a first feature), the data processing apparatus acquires a dimension identification of the first feature. The data processing apparatus further determines a plurality of feature pairs by traversing the dimension identifiers of the features (hereinafter referred to as second features) of the second POI data, determining a second feature having the same dimension identifier as the first feature from the plurality of second features, and regarding the first feature and the second feature having the same two dimensions as one feature pair.

Illustratively, the plurality of first features include a name feature vector and an address feature vector of the first POI data, and the corresponding dimension identifiers are identifier 1 and identifier 2, respectively; the plurality of second features comprise a second POI data name feature vector and an address feature vector, and corresponding dimension identifications are identification 1 and identification 2 respectively; the data processing device takes the name feature vector of the first POI data and the name feature vector of the second POI data as one feature pair, and takes the address feature vector of the first POI data and the address feature vector of the second POI data as the other feature pair.

S202, for the first feature pair, the data processing device determines feature similarity of the first feature pair to obtain a plurality of feature similarities.

Wherein the first pair of features is any one of a plurality of pairs of features.

As a possible implementation manner, the data processing apparatus calculates feature similarity between two feature vectors in each feature pair according to a preset similarity algorithm, so as to obtain a plurality of feature similarities.

Note that the similarity calculation method is set in advance in the data processing device by the operation and maintenance staff. For example, the similarity algorithm may be a cosine similarity algorithm or an edit distance algorithm, and the specific similarity algorithm is not limited in the embodiments of the present disclosure.

In some embodiments, the data processing apparatus may also calculate feature similarity between two feature vectors using different similarity algorithms for different types of feature vectors. The specific implementation manner of this step may specifically refer to the subsequent description of the embodiment of the present disclosure, and is not described here any more.

S203, the data processing device determines the data similarity of the first POI data and the second POI data according to the feature similarities.

As a possible implementation manner, the data processing apparatus determines a sum of the feature similarity as a data similarity of the first POI data and the second POI data.

As another possible implementation manner, the data processing apparatus weights the feature similarities to obtain the data similarity between the first POI data and the second POI data.

It should be noted that, the weight of each feature similarity is set in advance in the data processing device by an operation and maintenance worker. For example, the weight of the feature similarity for the name feature may be set to 0.5, and the weight of the feature similarity for the address feature may be set to 0.9.

It can be understood that the data processing device weights the feature similarities to obtain the data similarity between the first POI data and the second POI data, and the feature similarities are reasonably utilized, so that the finally obtained data similarity is more accurate.

And S204, under the condition that the data similarity is larger than or equal to a preset threshold value, the data processing device determines that the first POI data and the second POI data are repeated.

As a possible implementation manner, after determining the data similarity between the first POI data and the second POI data, the data processing apparatus compares the data similarity with a preset threshold, and determines whether the data similarity is greater than or equal to the preset threshold. The data processing apparatus determines that the first POI data is overlapped with the second POI data in a case where the data similarity is greater than or equal to a preset threshold.

It should be noted that the preset threshold is set in the data processing device by the operation and maintenance staff in advance.

For example, if the data similarity between the first POI data and the second POI data is 0.91 and the preset threshold is 0.9, the data processing apparatus determines that the first POI data and the second POI data are repeated.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the data processing device determines a plurality of feature pairs; since each feature pair includes two features having the same dimension, a first feature of the two features is a feature of the first point of interest POI data, and a second feature of the two features is a feature of the second POI data. Therefore, each feature pair comprises the features of the first POI data and the features of the second POI data under the same dimension, so that the features of the first POI data and the features of the second POI data are more matched, and the feature similarity between the subsequent features is convenient to determine. For any first feature pair in the plurality of feature pairs, the data processing device determines feature similarity of the first feature pair to obtain a plurality of feature similarities. Subsequently, the data processing device determines the data similarity between the first POI data and the second POI data according to the feature similarities, and if the data similarity is larger than or equal to a preset threshold value, the data processing device indicates that the first POI data and the second POI data are repeated. By the method, repeated POI data can be successfully determined, and further the repeated POI data can be subjected to targeted processing.

In one design, in order to obtain the characteristics of the first POI data and the characteristics of the second POI data, as shown in fig. 3, the data processing method provided in the embodiment of the present disclosure further includes the following steps S301 to S302.

S301, the data processing device acquires first POI data and second POI data.

As one possible implementation, the data processing apparatus acquires the first POI data and the second POI data from an ES database of the server.

As another possible implementation manner, the data processing apparatus acquires first POI data to be written, and acquires second POI data from an ES database of the server.

At this time, the first POI data is data inputted from the outside, and the second POI data is data stored in the server. For example, the first POI data is data input by operation and maintenance personnel in real time into the data processing device.

It is understood that a POI refers to any geographically meaningful location point on an electronic map, such as a store, school, hospital, gas station, etc.; the POI data is data related to the POI, for example, the POI is a school, and the POI data may include a name, a telephone number, a longitude and latitude, address information, teacher and resource information, a school profile, and the like of the school. The embodiment of the present disclosure does not limit the specific content of the first POI data and the specific content of the second POI data.

S302, the data processing device performs word segmentation on the first POI data to obtain a plurality of first characteristics of the first POI data, and performs word segmentation on the second POI data to obtain a plurality of second characteristics of the second POI data.

As a possible implementation manner, the data processing device inputs the first POI data into a preset segmentation packet, and outputs a plurality of first features of the first POI data; similarly, the data processing device inputs the second POI data into a preset segmentation packet, and outputs a plurality of second features of the second POI data.

It should be noted that the segmentation packet is set in the data processing device by the operation and maintenance staff in advance, and is used for performing segmentation processing on the content of the input POI data to obtain a plurality of segmentation words, and each segmentation word is used as one dimensional feature of the POI data to obtain a multi-dimensional feature included in the POI data. For example, the content of the POI data includes the name, longitude and latitude, address information, and other information of a school, but these information are mixed and difficult to distinguish, and the data processing apparatus can extract the name participles and the address participles by the participle packet, and obtain the multidimensional characteristics of the POI data by using each extracted participle as one dimensional characteristic of the POI data.

And the specific word segmentation obtained by performing word segmentation processing on the word segmentation packet is related to the word segmentation capability of the word segmentation packet. The preset participles comprise a Jieba packet and a self-maintenance participle packet. The Jieba package is an existing Chinese word segmentation library and has wide application in natural language processing, and the Jieba package mainly performs word segmentation on input character strings by taking existing words in the Jieba library as bases. And self-maintenance word segmentation bags are used for building a segmentation library by the operation and maintenance personnel according to daily operation feedback, and performing offline training on the built segmentation library to obtain a segmentation bag used for supplementing vocabularies which cannot be recognized by the Jieba bag, such as a certain net red brand of new promotion.

As shown in fig. 4, a segmentation flow diagram is shown. The data processing device may input the first POI data into the Jieba packet and the self-maintenance participle packet, respectively, and use a set of participle results (such as name participle and address participle) of the Jieba packet and the self-maintenance participle packet as a final participle result of the first POI data. Similarly, the data processing device respectively inputs the second POI data into the Jieba packet and the self-maintenance word segmentation packet, and takes a set of word segmentation results of the Jieba packet and the self-maintenance word segmentation packet as a final word segmentation result of the second POI data.

The plurality of first features may be embodied in the form of a multi-dimensional feature vector, which is transformed from a multi-dimensional feature array. For example, for address segmentation (e.g., longitude and latitude, global Positioning System (GPS) coordinates, etc.), the data processing apparatus performs feature conversion on the address segmentation by using a longitude and latitude hash (geohash) algorithm to obtain a feature array of one dimension. After the data processing device converts the feature data into a feature vector, an address feature vector can be obtained.

In one design, as shown in fig. 5, in order to acquire the second POI data, the foregoing S301 provided in the embodiment of the present disclosure specifically includes the following S3011 to S3012:

s3011, the data processing apparatus acquires position information of the first POI data.

As one possible implementation, after acquiring the first POI data, the data processing apparatus extracts the position information from the first POI data.

S3012, the data processing device obtains second POI data according to the position information of the first POI data.

And the distance between the position information of the second POI data and the position information of the first POI data is smaller than or equal to a preset distance.

As a possible implementation manner, after acquiring the position information of the first POI data, the data processing apparatus queries, from the ES database, POI data having a distance from the position information of the first POI data smaller than or equal to a preset distance according to the position information of the first POI data, and acquires the queried POI data.

In practical applications, the data processing apparatus may query POI data satisfying the search condition through a GEO-ES search engine in the ES database. Specifically, the data processing apparatus inputs the search condition to the GEO-ES search engine to obtain POI data satisfying the condition. For example, the data processing apparatus obtains POI data satisfying the condition in the ES database using the position information of the first POI data and a distance from the position information of not more than 3km as a retrieval condition, and acquires the POI data satisfying the condition. That is, the second POI data may be plural, and the number of the second POI data is not limited in the embodiment of the present disclosure.

Illustratively, as shown in fig. 6, the data processing apparatus, upon receiving POI data (i.e., first POI data) to be written into the ES database, acquires, by the GEO-ES search engine, POI data (i.e., second POI data) within 3km from the first POI data. Further, the data processing device determines data similarity between the first POI data and each second POI data, and sorts the plurality of second POI data according to the sequence of the data similarity from large to small to obtain a sorting result.

As can be understood, the data processing apparatus acquires, from the position information of the first POI data, the second POI data whose distance from the position information of the first POI data is less than or equal to a preset distance. Therefore, POI data with high repeatability probability with the first POI data can be screened out, the duplication elimination work of the data processing device is more targeted, and the computing resources of the data processing device are saved.

In one design, as shown in fig. 7, in order to obtain a plurality of feature similarities, the step S202 provided in the embodiment of the present disclosure specifically includes the following steps S2021 to S2023:

s2021, the data processing apparatus determines a dimension type of the first feature pair.

Wherein the dimension types include a text type and a numerical type.

As a possible implementation manner, the data processing apparatus obtains a word segmentation corresponding to the first feature or the second feature in the first feature pair. If the word segmentation content is a text, the data processing device determines the dimension type of the first feature pair as a text type; if the word segmentation content is a numerical value, the data processing device determines the dimension type of the first feature pair as a numerical value type.

For example, the name segmentation reflects the name of a school, and the content of the school is usually text, so if the segmentation corresponding to the feature in the first feature pair is the name segmentation, the data processing device determines the dimension type of the first feature pair as the text type. The address participles reflect address information of a school, and the content of the address participles can be text (such as streets, house numbers and the like) or numerical values (such as longitude and latitude, coordinates and the like). Under the condition that the content of the address word is a text, the data processing device determines the dimension type of the first feature pair as a text type; in a case where the content of the address segmented word is a numerical value, the data processing apparatus determines the dimension type of the first feature pair as a numerical value type.

S2022, in the case that the dimension type is the text type, the data processing apparatus calculates the feature similarity of the first feature pair according to the first feature similarity algorithm.

As one possible implementation, in the case where the dimension type is a text type, the data processing apparatus calculates a feature similarity between two feature vectors in the first feature pair according to a first feature similarity algorithm.

S2023, if the dimension type is the numerical type, the data processing apparatus calculates the feature similarity of the first feature pair according to the second feature similarity algorithm.

As one possible implementation, in a case where the dimension type is a numerical type, the data processing apparatus calculates the feature similarity between two feature vectors in the first feature pair according to the second feature similarity algorithm.

It should be noted that the first feature similarity algorithm and the second feature similarity algorithm are both set in advance in the data processing device by the operation and maintenance staff. The first feature similarity algorithm may be the same as or different from the second feature similarity algorithm. For example, the first feature similarity algorithm may be any one of a cosine similarity algorithm, an edit distance algorithm, and an euclidean distance algorithm, and the second feature similarity algorithm may be any one of a cosine similarity algorithm, an edit distance algorithm, and a euclidean distance algorithm.

Preferably, the first feature similarity algorithm is an edit distance algorithm, and the second feature similarity algorithm is a cosine similarity algorithm. That is, in the case where the dimension type is a text type, the data processing apparatus calculates an edit distance between two feature vectors in the first feature pair according to the first feature similarity algorithm, and takes the calculated edit distance as the feature similarity of the first feature pair. And under the condition that the dimension type is a numerical value type, the data processing device calculates a cosine value between the included angles of the two feature vectors in the first feature pair according to a second feature similarity algorithm, and determines the feature similarity of the first feature pair according to the calculated cosine value.

Illustratively, as shown in fig. 8, for a feature pair (e.g., name feature, address feature of a text class) whose dimension type is a text type, the data processing apparatus calculates an edit distance; for feature pairs (e.g., coordinates, phone numbers) with a dimension type of a numerical type, the data processing device calculates cosine similarity. Further, the data processing apparatus determines the data similarity between the first POI data and the second POI data according to the plurality of feature similarities, and finally outputs the determination information (whether the first POI data and the second POI data overlap or do not overlap).

It can be understood that the edit distance algorithm is more suitable for measuring the similarity between texts, and the cosine similarity is more suitable for measuring the similarity between numerical values. Therefore, the feature similarity determined by the preferred scheme is more accurate.

In one design, as shown in fig. 9, after step S204, the data processing method provided in the embodiment of the present disclosure further includes the following step S205.

S205, the data processing apparatus deletes the second POI data and stores the first POI data.

As a possible implementation manner, after determining that the first POI data is overlapped with the second POI data, the data processing apparatus deletes the second POI data, and stores the first POI data in the ES database.

As can be appreciated, deleting the second POI data and storing the first POI data after determining that the first POI data is duplicated with the second POI data reduces the duplication rate of the POI data in the ES database.

In some embodiments, the data processing apparatus may further store the determined duplicate POI data in the same folder, resulting in a merged record log.

In one design, the data processing apparatus directly stores the first POI data into the ES database in a case where the data similarity is smaller than a preset threshold.

It will be appreciated that for non-repeating POI data, the data processing apparatus stores it directly for convenient use by the user.

As shown in fig. 10, a logic architecture of the data processing method provided by the embodiment of the present disclosure is shown, and the logic architecture includes a business layer, a service place, and a data layer. The data layer is used for providing data support and mainly comprises an ES database, a GEO-ES retrieval engine, a Jieba library and the like. The service layer is used for providing spatial index service, word segmentation service, fusion service (such as addition or deletion) and the like. The service layer is used for processing the service and realizing the data processing flow according to the service provided by the service layer.

The foregoing embodiments mainly introduce the solutions provided by the embodiments of the present disclosure from the perspective of apparatuses (devices). It is understood that, in order to implement the method, the apparatus or device includes hardware structures and/or software modules for executing the respective method flows, and the hardware structures and/or software modules for executing the respective method flows may form an electronic device. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for performing the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The present disclosure may perform functional module division on the apparatus or device according to the above method examples, for example, the apparatus or device may divide each functional module corresponding to each function, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.

Fig. 11 is a schematic configuration diagram of a data processing apparatus according to an exemplary embodiment. Referring to fig. 11, a data processing apparatus 40 provided in the embodiment of the present disclosure includes a determining unit 401 and a processing unit 402.

A determining unit 401 for determining a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein the first feature of the two features is the feature of the first POI data, and the second feature of the two features is the feature of the second POI data; a processing unit 402, configured to determine, for a first feature pair, feature similarity of the first feature pair to obtain multiple feature similarities, and determine, according to the multiple feature similarities, data similarity between first POI data and second POI data; the first feature pair is any one of a plurality of feature pairs; the determining unit 401 is further configured to determine that the first POI data and the second POI data are repeated when the data similarity is greater than or equal to a preset threshold.

Optionally, the data processing apparatus further includes an obtaining unit 403; an acquisition unit 403 for acquiring first POI data and acquiring second POI data; the processing unit 402 is further configured to perform word segmentation on the first POI data to obtain a plurality of first features of the first POI data, and perform word segmentation on the second POI data to obtain a plurality of second features of the second POI data.

Optionally, the obtaining unit 403 is specifically configured to: acquiring second POI data according to the position information of the first POI data; the distance between the position information of the second POI data and the position information of the first POI data is less than or equal to a preset distance.

Optionally, the processing unit 402 is specifically configured to: determining a dimension type of the first feature pair; the dimension type comprises a text type and a numerical type; under the condition that the dimension type is a text type, calculating the feature similarity of the first feature pair according to a first feature similarity algorithm; and under the condition that the dimension type is a numerical type, calculating the feature similarity of the first feature pair according to a second feature similarity algorithm.

Optionally, the processing unit 402 is specifically configured to: and weighting the feature similarities to obtain the data similarity of the first POI data and the second POI data.

Optionally, the processing unit 402 is further configured to: and storing the first POI data under the condition that the data similarity is smaller than a preset threshold value.

Optionally, after determining that the first POI data overlaps with the second POI data, the processing unit 402 is further configured to: the second POI data is deleted and the first POI data is stored.

Fig. 12 is a schematic structural diagram of a server provided by the present disclosure. As shown in fig. 12, the server 50 may include at least one processor 501 and a memory 502 for storing processor-executable instructions, wherein the processor 501 is configured to execute the instructions in the memory 502 to implement the data processing method in the above embodiment.

Additionally, the server 50 may also include a communication bus 503 and at least one communication interface 504.

The processor 501 may be a Central Processing Unit (CPU), a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs according to the present disclosure.

The communication bus 503 may include a path that conveys information between the aforementioned components.

The communication interface 504 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The memory 502 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 502 is used for storing instructions for executing the disclosed solution, and is controlled by the processor 501. The processor 501 is configured to execute instructions stored in the memory 502, thereby implementing functions in the data processing method of the present disclosure.

As an example, in connection with fig. 12, the determining unit 401 and the processing unit 402 in the data processing apparatus 40 implement the same functions as the processor 501 in fig. 12.

In particular implementations, processor 501 may include one or more CPUs, such as CPU0 and CPU1 in fig. 12, as one embodiment.

In particular implementations, server 50 may include multiple processors, such as processor 501 and processor 507 in FIG. 12, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores that process data (e.g., computer program instructions).

In particular implementations, server 50 may also include an output device 505 and an input device 506, as one embodiment. An output device 505, which is in communication with the processor 501, may display information in a variety of ways. For example, the output device 505 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 506 is in communication with the processor 501 and may accept input from a user object in a variety of ways. For example, the input device 506 may be a mouse, keyboard, touch screen device, or sensing device, among others.

Those skilled in the art will appreciate that the configuration shown in fig. 12 is not intended to be limiting with respect to server 50 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In addition, the present disclosure also provides a computer-readable storage medium, in which instructions, when executed by a processor of a server, enable the server to perform the data processing method provided as the above embodiment.

In addition, the present disclosure also provides a computer program product comprising computer instructions, which when run on a server, cause the server to execute the data processing method as provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of data processing, the method comprising:

determining a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein a first feature of the two features is a feature of the first POI data, and a second feature of the two features is a feature of the second POI data;

for a first feature pair, determining feature similarity of the first feature pair to obtain a plurality of feature similarities, and determining data similarity of the first POI data and the second POI data according to the plurality of feature similarities; the first feature pair is any one of the plurality of feature pairs;

and determining that the first POI data and the second POI data are repeated when the data similarity is larger than or equal to a preset threshold value.

2. The data processing method of claim 1, wherein the method further comprises:

acquiring the first POI data and acquiring the second POI data;

performing word segmentation on the first POI data to obtain a plurality of first characteristics of the first POI data, and performing word segmentation on the second POI data to obtain a plurality of second characteristics of the second POI data.

3. The data processing method of claim 2, wherein the obtaining second POI data comprises:

acquiring the second POI data according to the position information of the first POI data; and the distance between the position information of the second POI data and the position information of the first POI data is smaller than or equal to a preset distance.

4. The data processing method of claim 1, wherein for a first feature pair, determining feature similarities of the first feature pair to obtain a plurality of feature similarities comprises:

determining a dimension type of the first feature pair; the dimension types comprise a text type and a numerical type;

under the condition that the dimension type is the text type, calculating the feature similarity of the first feature pair according to a first feature similarity algorithm;

and under the condition that the dimension type is the numerical value type, calculating the feature similarity of the first feature pair according to a second feature similarity algorithm.

5. The data processing method of claim 4, wherein the first feature similarity algorithm is an edit distance algorithm and the second feature similarity algorithm is a cosine similarity algorithm.

6. The data processing method according to any one of claims 1 to 5, wherein the determining the data similarity between the first POI data and the second POI data according to the plurality of feature similarities comprises:

and weighting the feature similarities to obtain the data similarity of the first POI data and the second POI data.

7. The data processing method according to any one of claims 1 to 5, wherein the method further comprises:

and storing the first POI data under the condition that the data similarity is smaller than a preset threshold value.

8. A data processing method according to any one of claims 1 to 5, wherein after determining that the first POI data and the second POI data are repetitive, the method further comprises:

and deleting the second POI data and storing the first POI data.

9. A data processing apparatus characterized by comprising a determination unit and a processing unit;

the determining unit is used for determining a plurality of feature pairs; each feature pair comprises two features with the same dimension, wherein a first feature of the two features is a feature of first POI data, and a second feature of the two features is a feature of second POI data;

the processing unit is configured to determine, for a first feature pair, feature similarity of the first feature pair to obtain a plurality of feature similarities, and determine, according to the plurality of feature similarities, data similarity between the first POI data and the second POI data; the first feature pair is any one of the plurality of feature pairs;

the determining unit is further configured to determine that the first POI data and the second POI data are repeated when the data similarity is greater than or equal to a preset threshold.

10. A server, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the data processing method of any one of claims 1-8.

11. A computer-readable storage medium having instructions stored thereon, wherein the instructions in the computer-readable storage medium, when executed by a processor of a server, enable the server to perform the data processing method of any one of claims 1-8.

12. A computer program product, characterized in that the computer program product comprises computer instructions which, when executed by a processor, implement the data processing method according to any one of claims 1-8.