CN113420595B - Data processing method, device, electronic equipment and storage medium - Google Patents

Data processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113420595B
CN113420595B CN202110556281.9A CN202110556281A CN113420595B CN 113420595 B CN113420595 B CN 113420595B CN 202110556281 A CN202110556281 A CN 202110556281A CN 113420595 B CN113420595 B CN 113420595B
Authority
CN
China
Prior art keywords
interest
point
interest point
points
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110556281.9A
Other languages
Chinese (zh)
Other versions
CN113420595A (en
Inventor
孙凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110556281.9A priority Critical patent/CN113420595B/en
Publication of CN113420595A publication Critical patent/CN113420595A/en
Application granted granted Critical
Publication of CN113420595B publication Critical patent/CN113420595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure shows a data processing method, a device, an electronic device and a storage medium, wherein interest point information of a plurality of data sources is firstly obtained; then calculating the similarity of the two interest points in a preset dimension according to the interest point information; inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained by training in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained by training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity; clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set; and selecting target interest points representing the entity from the interest point set. According to the method and the device, the target interest points which are high in accuracy and correspond to the entities can be obtained, and the robustness and accuracy of the multiple data sources in the fusion process are improved.

Description

Data processing method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a data processing method, a data processing device, electronic equipment and a storage medium.
Background
The interest point (POI, point of Information), which is also called information point, is a landmark or scenic spot on the electronic map, and is used for marking government departments, commercial institutions (gas stations, department stores, supermarkets, restaurants, hotels, convenience stores, hospitals, etc.), tourist attractions (parks, public toilets, etc.), ancient points of interest, transportation facilities (various stations, parking lots), etc. represented by the interest point.
In the related technology, under the scene that a user searches for interest points or analyzes information according to the interest points, the coverage rate of the interest points is not high enough when a single data source is used, so that target interest points required by the user are not in a database; when a plurality of data sources are adopted, the problem that the same interest point has repeated description occurs.
Disclosure of Invention
The disclosure provides a data processing method, a device, an electronic device and a storage medium, so as to at least solve the problem that in the related art, when only a single data source is used, the coverage rate of interest points is not high enough, so that target interest points required by users are not in a database, and when a plurality of data sources are adopted, the same interest point is repeatedly described. The technical scheme of the present disclosure is as follows:
According to a first aspect of the present disclosure, there is provided a data processing method, the method comprising:
Acquiring interest point information of a plurality of data sources;
According to the interest point information, calculating the similarity of two interest points in preset dimensions, wherein the preset dimensions comprise at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
Inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained by training in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained by training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
Clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
and selecting target interest points representing the entity from the interest point set.
In an optional implementation manner, the point of interest information includes name information of points of interest, the preset dimension includes a name dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension according to the point of interest information includes:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
Identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
And if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an alternative implementation, the step of calculating the similarity between the name information of the first point of interest and the name information of the second point of interest includes:
acquiring a first vector corresponding to the name information of the first interest point;
Acquiring a second vector corresponding to the name information of the second interest point;
And calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the point of interest information includes geographic location information of points of interest, the preset dimension includes a geographic dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension according to the point of interest information includes:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographic position relation between the first interest point and a third interest point according to whether the area range indicated by the geographic position information of the first interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the first geographic position relation comprises that the first interest point is positioned inside or outside the third interest point;
Determining a second geographic position relation between the second interest point and the third interest point according to whether the area range indicated by the geographic position information of the second interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the second geographic position relation comprises that the second interest point is positioned inside or outside the third interest point;
And obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relation and the second geographic position relation.
In an optional implementation manner, the step of obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relationship and the second geographic position relationship includes:
If the first and second points of interest are both located inside the third point of interest and the geographic distance is greater than a first preset threshold, or the first and second points of interest are both located outside the third point of interest and the geographic distance is greater than the first preset threshold and less than a second preset threshold, then calculating the similarity of the first and second points of interest in the geographic dimension according to the following formula, Wherein Sim (a, b) represents a similarity of the first point of interest and the second point of interest in a geographic dimension, a represents a geographic location indicated by geographic location information of the first point of interest, b represents a geographic location indicated by geographic location information of the second point of interest, and dist (a, b) represents a geographic distance between the first point of interest and the second point of interest;
and if the first interest point and the second interest point are both positioned in the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of an interest point, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
If the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of interest points, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point in the feature dimension is a second value;
And if the characteristic information of the first interest point is different from the characteristic information of the second interest point, determining that the similarity of the first interest point and the second interest point in the characteristic dimension is a third value.
In an optional implementation manner, the step of clustering the plurality of interest points according to the similarity probability of the two interest points to obtain the interest point set includes:
if the similarity probability of the two interest points is larger than a third preset threshold value, establishing an association relationship between the two interest points;
And aggregating the interest points with the association relation in the interest points into the interest point set.
In an alternative implementation, the step of selecting a target point of interest representing the entity from the set of points of interest includes:
calculating the sum of similarity probabilities of a first interest point and second interest points, wherein the first interest point is any interest point in the interest point set, and the second interest points are any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and after the step of selecting the target interest point representing the entity from the interest point set, the method further includes:
Identifying an entity represented by name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
If the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
If the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, determining that the first target interest point and the second target interest point are similar target interest points;
A plurality of similar target interest points form a target interest point set;
Selecting similar target interest points meeting a preset condition from the target interest point set as first similar target interest points, determining the first similar target interest points as upper nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset condition is that the entity information of the similar target interest points has the largest proportion in name information.
In an optional implementation manner, after the step of selecting a similar target interest point satisfying a preset condition from the target interest point set as a first similar target interest point, and determining the first similar target interest point as a superior node of other similar target interest points in the target interest point set in the tree structure, the method further includes:
Determining similar target interest points meeting the preset conditions in the other similar target interest points as second similar target interest points;
Determining the first similar target interest point as a superior node of the second similar target interest point;
And determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point except the first similar target interest point and the second similar target interest point in the target interest point set.
In an optional implementation manner, the interest point information includes geographic location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and after the step of selecting the target interest point representing the entity from the interest point set, the method further includes:
and if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
According to a second aspect of the present disclosure there is provided a data processing apparatus, the apparatus comprising:
the information acquisition module is configured to acquire interest point information of a plurality of data sources;
The similarity calculation module is configured to calculate the similarity of the two interest points on the preset dimension according to the interest point information, wherein the preset dimension comprises at least one of the following steps: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
the probability calculation module is configured to input the similarity of the two interest points in one or more preset dimensions into a classification model trained in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained based on the similarity of the two sample interest points in one or more preset dimensions and the label training of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
the clustering module is configured to cluster the plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
and the target selection module is configured to select a target interest point representing the entity from the interest point set.
In an optional implementation manner, the interest point information includes name information of an interest point, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
Identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
And if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an alternative implementation, the similarity calculation module is specifically configured to:
acquiring a first vector corresponding to the name information of the first interest point;
Acquiring a second vector corresponding to the name information of the second interest point;
And calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the interest point information includes geographic location information of an interest point, the preset dimension includes a geographic dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographic position relation between the first interest point and a third interest point according to whether the area range indicated by the geographic position information of the first interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the first geographic position relation comprises that the first interest point is positioned inside or outside the third interest point;
Determining a second geographic position relation between the second interest point and the third interest point according to whether the area range indicated by the geographic position information of the second interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the second geographic position relation comprises that the second interest point is positioned inside or outside the third interest point;
And obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relation and the second geographic position relation.
In an alternative implementation, the similarity calculation module is specifically configured to:
If the first and second points of interest are both located inside the third point of interest and the geographic distance is greater than a first preset threshold, or the first and second points of interest are both located outside the third point of interest and the geographic distance is greater than the first preset threshold and less than a second preset threshold, then calculating the similarity of the first and second points of interest in the geographic dimension according to the following formula, Wherein Sim (a, b) represents a similarity of the first point of interest and the second point of interest in a geographic dimension, a represents a geographic location indicated by geographic location information of the first point of interest, b represents a geographic location indicated by geographic location information of the second point of interest, and dist (a, b) represents a geographic distance between the first point of interest and the second point of interest;
and if the first interest point and the second interest point are both positioned in the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of interest points, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
If the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of interest points, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point in the feature dimension is a second value;
And if the characteristic information of the first interest point is different from the characteristic information of the second interest point, determining that the similarity of the first interest point and the second interest point in the characteristic dimension is a third value.
In an alternative implementation, the clustering module is specifically configured to:
if the similarity probability of the two interest points is larger than a third preset threshold value, establishing an association relationship between the two interest points;
And aggregating the interest points with the association relation in the interest points into the interest point set.
In an alternative implementation, the target selection module is specifically configured to:
calculating the sum of similarity probabilities of a first interest point and second interest points, wherein the first interest point is any interest point in the interest point set, and the second interest points are any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and the apparatus further includes a first mounting module configured to:
Identifying an entity represented by name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
If the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
If the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, determining that the first target interest point and the second target interest point are similar target interest points;
A plurality of similar target interest points form a target interest point set;
Selecting similar target interest points meeting a preset condition from the target interest point set as first similar target interest points, determining the first similar target interest points as upper nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset condition is that the entity information of the similar target interest points has the largest proportion in name information.
In an alternative implementation, the first mounting module is further configured to:
Determining similar target interest points meeting the preset conditions in the other similar target interest points as second similar target interest points;
Determining the first similar target interest point as a superior node of the second similar target interest point;
And determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point except the first similar target interest point and the second similar target interest point in the target interest point set.
In an optional implementation manner, the interest point information includes geographic location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and the apparatus further includes a second mounting module configured to:
and if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
A processor;
A memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the data processing method according to the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the data processing method according to the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product, which when executed by a processor of an electronic device, causes the electronic device to perform the data processing method according to the first aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
The technical scheme of the disclosure provides a data processing method, a device, electronic equipment and a storage medium, wherein interest point information of a plurality of data sources is firstly obtained; and then calculating the similarity of the two interest points in preset dimensions according to the interest point information, wherein the preset dimensions comprise at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension; inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained by training in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained by training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity; clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set; and selecting a target interest point representing the entity from the interest point set. According to the technical scheme, firstly, the similarity of two interest points in preset dimensions is calculated, then the similarity of at least one preset dimension is input into a classification model, the similarity probability of the two interest points is obtained, the interest points are clustered based on the similarity probability, and the target interest points representing the entity are determined according to the clustering result. Because the similarity probability integrates the similarity of two interest points in at least one preset dimension, the similarity probability is adopted for clustering, the accuracy of a clustering result can be improved, the problem that the similarity of the two interest points in a high confidence dimension is not high when the interest points of the same entity are simply represented by using the similarity clustering, but the similarity in other dimensions is good, so that the clustering is inaccurate is solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flow chart illustrating a method of data processing according to an exemplary embodiment.
FIG. 2 is a flow diagram illustrating preprocessing of data for multiple data sources according to an exemplary embodiment.
FIG. 3 is a flowchart illustrating obtaining similarity in name dimensions according to an example embodiment.
FIG. 4 is a flowchart illustrating obtaining similarity in geographic dimensions, according to an example embodiment.
FIG. 5 is a flowchart illustrating obtaining similarity in address dimensions according to an exemplary embodiment.
FIG. 6 is a flowchart illustrating obtaining similarity in feature dimensions according to an example embodiment.
Fig. 7 is a flowchart illustrating obtaining a tree structure according to an exemplary embodiment.
Fig. 8 is a flow diagram illustrating a tree structure application in an entity retrieval process according to an exemplary embodiment.
Fig. 9 is a block diagram of a data processing apparatus according to an exemplary embodiment.
Fig. 10 is a block diagram of an electronic device, according to an example embodiment.
Fig. 11 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The inventors found that: under the scene that a user searches for the interest points or analyzes information according to the interest points, the coverage rate of the interest points is not high enough when a single data source is used, so that target interest points required by the user are not in a database; when a plurality of data sources are adopted, the problem that the same interest point has repeated description or the association of the interest point and the entity is inaccurate occurs.
In order to solve the above-mentioned problem, fig. 1 is a flowchart of a data processing method according to an exemplary embodiment, and an execution subject of the embodiment may be an electronic device such as a server.
As shown in fig. 1, the method may include the following steps.
In step S11, point of interest information of a plurality of data sources is acquired.
In a specific implementation, referring to fig. 2, the interest point information of a plurality of data sources such as map data and collected data may be first acquired, and then a series of standardized processing procedures such as consistency check, missing value check, abnormal value check, repeated value check, etc. may be performed on the interest point information of the plurality of data sources. Based on the interest point information acquired by a plurality of ways, unifying the interest point information in a plurality of aspects such as a coordinate system, a name, classification, a boundary and the like. The unified data can meet the following conditions: longitude and latitude are the same coordinate system; the name does not contain punctuation marks and special characters; the classification rules are consistent; the boundaries of the planar regions are closed and the forward boundaries of the regions are defined as counterclockwise. By unifying the interest point information of the plurality of data sources, the subsequent data processing process can be more efficient and accurate.
The interest point information may include at least one of name information, geographical location information, address information, feature information, and the like of the interest point. The points of interest may have a certain area of interest, also called an area of interest (AOI), in which a plurality of points of interest may be contained.
In step S12, according to the interest point information, a similarity of the two interest points in a preset dimension is calculated, where the preset dimension includes at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension.
In a specific implementation, the similarity of two interest points in the geographic dimension can be calculated according to the geographic position information of the interest points; the similarity of the two interest points in the name dimension can be calculated according to the name information of the interest points; the similarity of the two interest points in the address dimension can be calculated according to the address information of the interest points; the similarity of the two interest points in the feature dimension can be calculated according to the feature information of the interest points, and the like. The following embodiments will describe in detail the similarity of two points of interest in the geographic dimension, the name dimension, the address dimension and the feature dimension, respectively.
The feature information includes, but is not limited to, category information, telephone information, picture information, selling price information, and the like.
In step S13, the similarity of the two interest points in one or more preset dimensions is input into a classification model trained in advance, so as to obtain the similarity probability of the two interest points, wherein the classification model is obtained based on the similarity of the two sample interest points in one or more preset dimensions and the label training of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity.
In a specific implementation, the similarity of the two interest points calculated in the step S12 in at least one preset dimension may be input into a classification model, and the classification model outputs the similarity probability of the two interest points. The classification model may be obtained by training models such as XGBoost based on the similarity of two sample interest points in one or more preset dimensions and whether the two sample interest points represent labels of the same entity. The XGBoost model is a machine learning algorithm implemented under the Gradient Boosting framework, regularization terms are added to prevent overfitting, and a classification model obtained through training the XGBoost model is adopted, so that more accurate results can be obtained.
Specifically, the classification model may be obtained by: the similarity of the two sample interest points in one or more preset dimensions is input into a model to be trained such as XGBoost model, a loss function value is calculated according to an output result and whether the two sample interest points represent labeling labels of the same entity, the minimum loss function value is taken as a target, and parameters in the model to be trained are optimized through a back propagation algorithm, so that a classification model is finally obtained.
According to the classification model, the similarity probability of two interest points is calculated, the original classification problem is converted into the regression problem, and the adjustment is conveniently carried out through parameters such as the clustering radius and the like in the clustering process, so that a clustering result with high recall rate or high accuracy rate is obtained.
In this embodiment, the similarity probability of two interest points calculated by using the classification model integrates the similarity of the two interest points in at least one preset dimension, and clustering is performed by using the similarity probability, so that the accuracy of a clustering result can be improved, and the problem that when the similarity cluster is simply used for representing the interest points of the same entity, the similarity of the two interest points in a high confidence dimension is not high, but the similarity in other dimensions is better, so that the clustering is inaccurate is avoided. According to the embodiment, clustering is carried out according to the similarity probability, the interest points can be accurately recalled, a more accurate clustering result is obtained, and therefore accuracy of target interest points is improved. The high confidence dimension may be a dimension with higher accuracy and stability of dimension data, such as a geographic dimension.
In step S14, a plurality of points of interest are clustered according to the similarity probability of the two points of interest, so as to obtain a set of points of interest.
In a specific implementation, a plurality of interest points can be clustered by adopting a DBSCAN clustering algorithm according to the similarity probability of two interest points to obtain an interest point set. The aggregation radius in the DBSCAN clustering algorithm may be set according to actual requirements for recall rate and accuracy, and specific values of the aggregation radius are not limited in this embodiment. When the polymerization radius is set to be larger, a polymerization result with higher recall rate can be obtained; when the polymerization radius setting is small, a polymerization result with higher accuracy can be obtained. The aggregation radius is the reference distance adopted in aggregation, and the reference distance is inversely proportional to the reference similarity probability, namely the larger the reference similarity probability is, the smaller the reference distance is.
In an optional implementation manner, if the similarity probability of the two interest points is greater than a third preset threshold, establishing an association relationship between the two interest points; and then aggregating the interest points with the association relations in the interest points into an interest point set. The third preset threshold may be set according to actual requirements for recall and accuracy, and the specific value of the third preset threshold is not limited in this embodiment. When the third preset threshold value is set to be larger, an aggregation result with higher accuracy can be obtained; when the third preset threshold is set smaller, an aggregation result with higher recall rate can be obtained.
In step S15, a target point of interest representing an entity is selected from the set of points of interest.
In an alternative implementation manner, the sum of similarity probabilities of a first interest point and second interest points may be calculated first, where the first interest point is any interest point in the interest point set, and the second interest point is any interest point in the interest point set except the first interest point; and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as a target interest point. For example, the fourth preset threshold may be the maximum value of the summation result, i.e., the first point of interest with the maximum summation result is determined as the target point of interest. The fourth preset threshold may be set according to actual requirements, and the specific numerical value of the fourth preset threshold is not limited in this embodiment.
After determining the target point of interest, name information, geographical location information, and address information of the target point of interest may be determined as name information, geographical location information, and address information of the entity. The feature information of the entity, such as category, telephone, etc., may be determined by voting on the feature information of the points of interest in the set of points of interest, for example, the category information with the highest vote number may be determined as the category information of the entity.
By selecting the target interest points representing the entity from the interest point set, the unique description on the same entity in the result of fusing a plurality of data sources can be realized, and the problem that the target interest points are inaccurate or repeated is avoided on the premise that the coverage rate of the interest points is met.
According to the data processing method provided by the embodiment of the disclosure, firstly, the similarity of two interest points in preset dimensions is calculated, then the similarity of at least one preset dimension is input into a classification model, the similarity probability of the two interest points is obtained, the interest points are clustered based on the similarity probability, and the target interest points representing the entity are determined according to the clustering result. Because the similarity probability integrates the similarity of two interest points in at least one preset dimension, the similarity probability is adopted for clustering, the accuracy of a clustering result can be improved, the problem that the similarity of the two interest points in a high confidence dimension is not high when the interest points of the same entity are simply represented by using the similarity clustering, but the similarity in other dimensions is good, so that the clustering is inaccurate is solved. Under the condition of meeting the coverage rate of the interest points, the problems that the accuracy rate of the target interest points is not high, the target interest points are repeated or not representative enough and the like are solved, and the high efficiency and the accuracy of using the interest point information by the user are ensured.
In order to obtain the similarity of the two points of interest in the name dimension, in an alternative implementation, the point of interest information includes name information of the points of interest, the preset dimension includes the name dimension, and the two points of interest include the first point of interest and the second point of interest, referring to fig. 3, in step S12, specifically may include:
In step S31, an entity characterized by the name information of the first interest point is identified, and first entity information of the first interest point is obtained.
In a specific implementation, named entity Recognition (NAMED ENTITY Reconnaissance, NER) may be performed on the name information of the first point of interest to obtain first entity information.
Wherein, NER can identify named entities in the text to be processed. The named entity can be an entity such as a place name, an organization name and the like in the name information of the interest point. For example, a named entity in a headquarter parking lot is "headquarter". The entity in Beijing XXX roast duck store YYY store comprises Beijing, XXX, roast duck store and YYY store, and corresponds to place name, institution name, type noun and place noun respectively, and the institution name 'XXX' can be used as a naming entity.
In a specific implementation, a pre-trained long-short-time memory neural network model can be used for predicting the probability that each word in the name information is an entity word, and the entity word with the highest probability can be selected as the entity information of the interest point.
In step S32, an entity characterized by the name information of the second interest point is identified, and second entity information of the second interest point is obtained.
In a specific implementation, NER identification may be performed on name information of the second interest point, to obtain second entity information.
In step S33, if the first entity information is the same as the second entity information, the similarity between the name information of the first interest point and the name information of the second interest point is calculated, and the similarity between the first interest point and the second interest point in the name dimension is obtained.
In this embodiment, under the condition that the first entity information is the same as the second entity information, the similarity of the two interest points in the name dimension is calculated, so that the calculation amount can be reduced, and a more accurate similarity result can be obtained.
In order to calculate the similarity between the name information of the first interest point and the name information of the second interest point, in an alternative implementation manner, a first vector corresponding to the name information of the first interest point may be obtained first; acquiring a second vector corresponding to the name information of the second interest point; and then calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In a specific implementation, a word2vec model may be used to process name information of the first interest point and name information of the second interest point respectively, so as to obtain a first vector and a second vector. The distance between the first vector and the second vector may be a cosine distance, and the cosine similarity of the first interest point and the second interest point in the name dimension may be calculated according to the following formula: Wherein sim represents similarity of the first interest point and the second interest point in a name dimension, a represents a first vector, B represents a second vector, n represents a dimension of a vector space in which the first vector and the second vector are located, i represents any one dimension in the n-dimensional vector space, a i represents a component of the first vector in an i-th dimension, B i represents a component of the second vector in the i-th dimension, and n is a positive integer.
It should be noted that, the calculation of the similarity in the name dimension is not limited to the above-mentioned cosine similarity scheme, and other schemes that can calculate the similarity between phrases or the similarity between short texts can be replaced. For example, a word2vec model may be first used to obtain a first vector and a second vector, and then a Word Move Distance (WMD) may be calculated to obtain similarity of two points of interest in a name dimension.
In order to obtain the similarity of the two points of interest in the geographic dimension, in an alternative implementation, the point of interest information includes geographic location information of the points of interest, the preset dimension includes the geographic dimension, and the two points of interest include the first point of interest and the second point of interest, referring to fig. 4, in step S12, specifically may include:
in step S41, a geographic distance between the first interest point and the second interest point is calculated according to the geographic position information of the first interest point and the geographic position information of the second interest point.
In step S42, a first geographic location relationship between the first interest point and the third interest point is determined according to whether the region indicated by the geographic location information of the first interest point is within the region indicated by the geographic location information of the third interest point, where the first geographic location relationship includes that the first interest point is located inside or outside the third interest point.
When the area range indicated by the geographic position information of the first interest point is within the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the first interest point and the third interest point is that the first interest point is located inside the third interest point.
And when the area range indicated by the geographic position information of the first interest point is not in the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the first interest point and the third interest point is that the first interest point is positioned outside the third interest point.
In step S43, a second geographical location relationship between the second interest point and the third interest point is determined according to whether the area indicated by the geographical location information of the second interest point is within the area indicated by the geographical location information of the third interest point, where the second geographical location relationship includes that the second interest point is located inside or outside the third interest point.
And when the area range indicated by the geographic position information of the second interest point is within the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the second interest point and the third interest point is that the second interest point is positioned inside the third interest point.
And when the area range indicated by the geographic position information of the second interest point is not in the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the second interest point and the third interest point is that the second interest point is positioned outside the third interest point.
In step S44, the similarity of the first interest point and the second interest point in the geographic dimension is obtained according to the geographic distance, the first geographic position relationship and the second geographic position relationship.
In a specific implementation, if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than the first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold and less than the second preset threshold, the similarity of the first interest point and the second interest point in the geographic dimension is calculated according to the following formula,Wherein Sim (a, b) represents a similarity of the first interest point and the second interest point in a geographic dimension, a represents a geographic position indicated by geographic position information of the first interest point, b represents a geographic position indicated by geographic position information of the second interest point, and dist (a, b) represents a geographic distance between the first interest point and the second interest point. By adopting the formula, the similarity of the first interest point and the second interest point in the geographic dimension can be obtained efficiently, and the accuracy of similarity calculation is ensured.
The first preset threshold may be, for example, 1 meter, and the specific value may be set according to the actual requirement, which is not limited in this embodiment. The second preset threshold may be, for example, 50 meters, and the specific value may be set according to the actual requirement, which is not limited in this embodiment.
If the first interest point and the second interest point are both located inside the third interest point and the geographic distance is smaller than or equal to a first preset threshold value, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold value, the similarity of the first interest point and the second interest point in the geographic dimension is determined to be a first value.
The first value may be determined according to actual requirements, and may be equal to 1, for example.
In this way, by determining the similarity of two interest points with the geographic distance smaller than or equal to the first preset threshold (for example, 1 meter) in the geographic dimension as a fixed value, namely, a first value, the influence of calculation errors can be reduced, and the accuracy of similarity calculation can be improved.
Specifically, if the first interest point and the second interest point are located in the same interest point, i.e., the third interest point, and the geographic distance is greater than the first preset threshold, the similarity of the first interest point and the second interest point in the geographic dimension may be calculated according to the following formula,
If the first interest point and the second interest point are located outside any interest point, namely the third interest point, and the geographic distance is greater than the first preset threshold value and less than the second preset threshold value, the similarity of the first interest point and the second interest point in the geographic dimension can be calculated according to the following formula,
If the first interest point and the second interest point are located in the same interest point, namely the third interest point, and the geographic distance is smaller than or equal to a first preset threshold value, the similarity of the first interest point and the second interest point in the geographic dimension can be determined to be a first value.
If the first interest point and the second interest point are located outside any interest point, namely the third interest point, and the geographic distance is smaller than or equal to a first preset threshold value, the similarity of the first interest point and the second interest point in the geographic dimension can be determined to be a first value.
In practical application, whether the first interest point is located in the other interest points may be determined according to the geographic location information, and if the first interest point is located in the third interest point, the similarity between the first interest point and the second interest point (any interest point in the third interest point except the first interest point) in the geographic dimension is calculated. The method for searching the interest point pairs in the interest points can reduce the calculated amount on the basis of ensuring the accuracy.
If the first interest point is not located inside any interest point, the similarity in the geographic dimension between the first interest point and a second interest point (which is also not located inside any interest point) within a second preset threshold radius range is calculated. Therefore, the calculated amount can be reduced on the basis of ensuring the accuracy by calculating the similarity between the two interest points with the distance smaller than the second preset threshold.
In order to obtain the similarity of the two points of interest in the address dimension, in an alternative implementation, the point of interest information includes address information of the points of interest, the preset dimension includes the address dimension, and the two points of interest include the first point of interest and the second point of interest, referring to fig. 5, in step S12, specifically may include:
In step S51, if the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained.
In a specific implementation, a word2vec model may be used to process the address information of the first interest point and the address information of the second interest point respectively, so as to obtain a third vector and a fourth vector.
In step S52, a distance between the third vector and the fourth vector is calculated, so as to obtain a similarity of the first interest point and the second interest point in the address dimension.
The distance between the third vector and the fourth vector may be a cosine distance, which is not limited in this embodiment.
In a specific implementation, the similarity of two interest points in the address dimension within the same geocode block range can be calculated by taking the geocode block in which the interest point is located as the range.
Similar to similarity in the name dimension, similarity calculation in the address dimension is not limited to the above scheme, and other schemes capable of calculating the similarity between phrases or the similarity between short texts can be replaced.
In order to obtain the similarity of the two points of interest in the feature dimension, in an alternative implementation, the point of interest information includes feature information of the points of interest, the preset dimension includes the feature dimension, and the two points of interest include the first point of interest and the second point of interest, referring to fig. 6, in step S12, specifically may include:
In step S61, if the feature information of the first interest point and the feature information of the second interest point are the same, the similarity between the first interest point and the second interest point in the feature dimension is determined to be the second value.
The second value may be determined according to actual requirements, for example, may be equal to 1, which is not limited in this embodiment.
In step S62, if the feature information of the first interest point and the feature information of the second interest point are different, it is determined that the similarity between the first interest point and the second interest point in the feature dimension is a third value.
The third value may be smaller than the second value, and the third value may be determined according to actual requirements, for example, the third value may be equal to 0, which is not limited in this embodiment.
The inventors have found that when the use of points of interest is performed, the points of interest obtained by the user often do not coincide with the points of interest that the user is interested in. For example, when the user is located at south gate of the home, the point of interest of the home south gate is often obtained, but the user may prefer the point of interest of the home to be more geographically wide than the home south gate, and the home includes the home south gate, and thus may be referred to as an upper level point of interest or an upper level point of interest of the home south gate.
In order to solve the above problem, in an alternative implementation manner, the point of interest information includes name information of the point of interest, the plurality of target points of interest form a tree structure, and the tree structure includes a first target point of interest and a second target point of interest, and after step S15, referring to fig. 7, the method may further include:
In step S71, an entity characterized by the name information of the first target point of interest is identified, and the entity information of the first target point of interest is obtained.
For example, NER recognition may be performed on name information of the first target point of interest, to obtain entity information of the first target point of interest.
In step S72, the entity characterized by the name information of the second target point of interest is identified, and the entity information of the second target point of interest is obtained.
For example, NER recognition may be performed on name information of the second target point of interest, to obtain entity information of the second target point of interest.
In step S73, if the entity information of the first target point of interest is the same as the entity information of the second target point of interest, the similarity between the name information of the first target point of interest and the name information of the second target point of interest is calculated.
In a specific implementation, the similarity between the name information of the first target point of interest and the name information of the second target point of interest may be calculated by referring to the method shown in fig. 3, which is not described herein again.
In step S74, if the similarity between the name information of the first target point of interest and the name information of the second target point of interest is greater than or equal to the fifth preset threshold, the first target point of interest and the second target point of interest are determined to be similar target points of interest.
The fifth preset threshold may be, for example, 0.6. The fifth preset threshold may be set according to actual requirements, and the specific numerical value of the fifth preset threshold is not limited in this embodiment.
In step S75, a plurality of similar target points of interest constitute a set of target points of interest.
In step S76, selecting a similar target interest point satisfying a preset condition from the target interest point set, as a first similar target interest point, and determining the first similar target interest point as an upper node of other similar target interest points in the target interest point set in the tree structure, where the preset condition is that the entity information of the similar target interest point has the largest ratio in the name information.
The upper node may be an upper node or an upper N node, where N is a positive integer.
In a specific implementation, the ratio of entity information of each similar target interest point in the target interest point set in the name information can be calculated, the similar target interest point with the largest ratio is used as a first similar target interest point, the first similar target interest point is used as an upper node of other similar target interest points in the target interest point set in the tree structure, and other similar target interest points in the target interest point set are used as lower nodes of the first similar target interest point.
In actual use, membership in the name dimension may be provided to the user through a tree structure of the name dimension, for example: the AA university Changping school district is mounted to the AA university, namely in a tree structure, the AA university is an upper node of the AA university Changping school district, and the AA university Changping school district is a lower node of the AA university.
In this implementation manner, the method may further include:
In step S77, a similar target point of interest satisfying the above-mentioned preset condition among other similar target points of interest is determined as a second similar target point of interest.
Wherein the other similar target points of interest may include similar target points of interest in the set of target points of interest other than the first similar target point of interest.
In step S78, the first similar target point of interest is determined as the previous level node of the second similar target point of interest.
In step S79, the second similar target interest point is determined as the upper node of the third similar target interest point, where the third similar target interest point is any similar target interest point in the target interest point set except the first similar target interest point and the second similar target interest point.
The upper node may be an upper node or an upper N node, where N is a positive integer.
In a specific implementation, the ratio of the entity information of each similar target interest point in the other similar target interest points in the name information can be calculated, the similar target interest point with the largest ratio is taken as a second similar target interest point, the second similar target interest point is taken as a next-stage node of the first similar target interest point, and the second similar target interest point is taken as an upper-stage node of a third similar target interest point.
For example: the target interest point set comprises the following similar target interest points: the entity information of the similar target interest points is AA university, AA university Changping district, AA university Changping district east district and AA university Changping district west district.
Firstly, the ratio of entity information of each similar target interest point in the target interest point set in the name information can be calculated, the similar target interest point with the largest ratio, namely the AA university is used as the first similar target interest point, and the AA university is used as the upper node of other similar target interest points such as the Changping district of the AA university, the Changping east district of the AA university and the Changping west district of the AA university.
Then, the ratio of the entity information of each similar target interest point in the other similar target interest points such as the area of AA university Changping, the area of AA university Changping and the area of AA university Changping in the name information can be calculated, the similar target interest point with the largest ratio, namely the area of AA university Chang flat, is taken as the second similar target interest point, the area of AA university Chang flat is taken as the first similar target interest point, namely the next-level node of AA university, and the area of AA university Chang flat is taken as the third similar target interest point, namely the area of AA university Changping, and the upper-level node of the area of AA university Changping.
Therefore, the east area of the AA university Changping and the west area of the AA university Changping are mounted to the AA university, the Changping area of the AA university is mounted to the AA university, namely in a tree structure, the AA university is a father node of the Changping area of the AA university, the Changping area of the AA university is a child node of the AA university, the Changping area of the AA university is a father node of the east area of the Changping area of the AA university and the west area of the Changping area of the AA university, and the east area of the Changping area of the AA university and the west area of the Changping area of the AA university are child nodes of the Changping area of the AA university.
In this implementation, a tree structure of interest points may be formed in the name dimension. When the content is operated, the nodes needing to be concerned can be selected in a targeted manner according to the hierarchical relation in the tree structure, so that the accuracy and the effectiveness of the content operation are improved.
In another optional implementation manner, the interest point information includes geographic location information of the interest points, the plurality of target interest points form a tree structure, and the tree structure includes the first target interest point and the second target interest point, and after step S15, the method may further include:
And if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
The upper node may be an upper node or an upper N node, where N is a positive integer.
This can form a tree-like structure of points of interest in the geographic dimension. When the content is operated, the nodes needing to be concerned can be selected in a targeted manner according to the hierarchical relation in the tree structure, so that the accuracy and the effectiveness of the content operation are improved. Through the tree structure of the geographic dimension, membership of the point of interest in the geographic dimension may be provided to the user, for example: the unknown lake was mounted to AA university.
By forming the tree structure, referring to fig. 8, we can help us to obtain relevant interest points more accurately in the entity retrieval process, not only can improve the working efficiency of content operation, but also can integrate relevant information of interest points, reduce the duty ratio of noise data therein, and can organize content aiming at larger interest points during user operation, thereby improving the accuracy and effectiveness of operation.
FIG. 9 is a block diagram of a data processing apparatus according to an example embodiment. Referring to fig. 9, may include:
An information acquisition module 91 configured to acquire point of interest information of a plurality of data sources;
a similarity calculation module 92 configured to calculate, according to the interest point information, a similarity of two interest points in a preset dimension, where the preset dimension includes at least one of: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
The probability calculation module 93 is configured to input the similarity of the two interest points in one or more preset dimensions into a classification model trained in advance to obtain the similarity probability of the two interest points, where the classification model is obtained based on the similarity of the two sample interest points in one or more preset dimensions and whether the two sample interest points represent the label training of the same entity, and the similarity probability is used to represent the probability that the two interest points represent the same entity;
A clustering module 94, configured to cluster a plurality of interest points according to the similarity probability of the two interest points, so as to obtain an interest point set;
a target selection module 95 is configured to select a target point of interest representing the entity from the set of points of interest.
In an optional implementation manner, the interest point information includes name information of an interest point, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
Identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
And if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an alternative implementation, the similarity calculation module is specifically configured to:
acquiring a first vector corresponding to the name information of the first interest point;
Acquiring a second vector corresponding to the name information of the second interest point;
And calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the interest point information includes geographic location information of an interest point, the preset dimension includes a geographic dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographic position relation between the first interest point and a third interest point according to whether the area range indicated by the geographic position information of the first interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the first geographic position relation comprises that the first interest point is positioned inside or outside the third interest point;
Determining a second geographic position relation between the second interest point and the third interest point according to whether the area range indicated by the geographic position information of the second interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the second geographic position relation comprises that the second interest point is positioned inside or outside the third interest point;
And obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relation and the second geographic position relation.
In an alternative implementation, the similarity calculation module is specifically configured to:
If the first and second points of interest are both located inside the third point of interest and the geographic distance is greater than a first preset threshold, or the first and second points of interest are both located outside the third point of interest and the geographic distance is greater than the first preset threshold and less than a second preset threshold, then calculating the similarity of the first and second points of interest in the geographic dimension according to the following formula, Wherein Sim (a, b) represents a similarity of the first point of interest and the second point of interest in a geographic dimension, a represents a geographic location indicated by geographic location information of the first point of interest, b represents a geographic location indicated by geographic location information of the second point of interest, and dist (a, b) represents a geographic distance between the first point of interest and the second point of interest;
and if the first interest point and the second interest point are both positioned in the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of interest points, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
If the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of interest points, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point in the feature dimension is a second value;
And if the characteristic information of the first interest point is different from the characteristic information of the second interest point, determining that the similarity of the first interest point and the second interest point in the characteristic dimension is a third value.
In an alternative implementation, the clustering module is specifically configured to:
if the similarity probability of the two interest points is larger than a third preset threshold value, establishing an association relationship between the two interest points;
And aggregating the interest points with the association relation in the interest points into the interest point set.
In an alternative implementation, the target selection module is specifically configured to:
calculating the sum of similarity probabilities of a first interest point and second interest points, wherein the first interest point is any interest point in the interest point set, and the second interest points are any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and the apparatus further includes a first mounting module configured to:
Identifying an entity represented by name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
If the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
If the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, determining that the first target interest point and the second target interest point are similar target interest points;
A plurality of similar target interest points form a target interest point set;
Selecting similar target interest points meeting a preset condition from the target interest point set as first similar target interest points, determining the first similar target interest points as upper nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset condition is that the entity information of the similar target interest points has the largest proportion in name information.
In an alternative implementation, the first mounting module is further configured to:
Determining similar target interest points meeting the preset conditions in the other similar target interest points as second similar target interest points;
Determining the first similar target interest point as a superior node of the second similar target interest point;
And determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point except the first similar target interest point and the second similar target interest point in the target interest point set.
In an optional implementation manner, the interest point information includes geographic location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and the apparatus further includes a second mounting module configured to:
and if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 10 is a block diagram of an electronic device 800 shown in the present disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 10, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the data processing method described in any of the embodiments. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the data processing methods described in any embodiment.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the data processing method of any of the embodiments. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising readable program code executable by the processor 820 of the apparatus 800 to perform the data processing method of any of the embodiments. Alternatively, the program code may be stored in a storage medium of apparatus 800, which may be a non-transitory computer readable storage medium, such as ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Fig. 11 is a block diagram of an electronic device 1900 shown in the present disclosure. For example, electronic device 1900 may be provided as a server.
Referring to FIG. 11, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the data processing method of any of the embodiments.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as WindowsServerTM, macOSXTM, unixTM, linuxTM, freeBSDTM or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (25)

1. A method of data processing, the method comprising:
Acquiring interest point information of a plurality of data sources;
according to the interest point information, calculating the similarity of the two interest points in a preset dimension;
Inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained by training in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained by training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
Clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
Selecting a target interest point representing the entity from the interest point set;
The interest point information comprises name information of interest points, a plurality of target interest points form a tree structure, the tree structure comprises a first target interest point and a second target interest point, and after the step of selecting the target interest points, the method further comprises the steps of:
Identifying an entity represented by name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
If the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
If the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, determining that the first target interest point and the second target interest point are similar target interest points;
A plurality of similar target interest points form a target interest point set;
Selecting similar target interest points meeting a preset condition from the target interest point set as first similar target interest points, determining the first similar target interest points as upper nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset condition is that the entity information of the similar target interest points has the largest proportion in name information.
2. The data processing method according to claim 1, wherein the point-of-interest information includes name information of points of interest, the preset dimension includes a name dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension based on the point-of-interest information includes:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
Identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
And if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
3. The data processing method according to claim 2, wherein the step of calculating a similarity between the name information of the first point of interest and the name information of the second point of interest includes:
acquiring a first vector corresponding to the name information of the first interest point;
Acquiring a second vector corresponding to the name information of the second interest point;
And calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
4. The data processing method according to claim 1, wherein the point of interest information includes geographic position information of points of interest, the preset dimension includes a geographic dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension according to the point of interest information includes:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographic position relation between the first interest point and a third interest point according to whether the area range indicated by the geographic position information of the first interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the first geographic position relation comprises that the first interest point is positioned inside or outside the third interest point;
Determining a second geographic position relation between the second interest point and the third interest point according to whether the area range indicated by the geographic position information of the second interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the second geographic position relation comprises that the second interest point is positioned inside or outside the third interest point;
And obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relation and the second geographic position relation.
5. The method of claim 4, wherein the step of obtaining the similarity of the first point of interest and the second point of interest in a geographic dimension based on the geographic distance, the first geographic location relationship, and the second geographic location relationship comprises:
If the first and second points of interest are both located inside the third point of interest and the geographic distance is greater than a first preset threshold, or the first and second points of interest are both located outside the third point of interest and the geographic distance is greater than the first preset threshold and less than a second preset threshold, then calculating the similarity of the first and second points of interest in the geographic dimension according to the following formula, Wherein the saidRepresenting similarity of the first interest point and the second interest point in a geographic dimension, wherein a represents a geographic position indicated by geographic position information of the first interest point, b represents a geographic position indicated by geographic position information of the second interest point, andRepresenting a geographic distance between the first point of interest and the second point of interest;
and if the first interest point and the second interest point are both positioned in the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
6. The data processing method according to claim 1, wherein the point-of-interest information includes address information of points of interest, the preset dimension includes an address dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension based on the point-of-interest information includes:
If the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
7. The data processing method according to claim 1, wherein the point-of-interest information includes feature information of points of interest, the preset dimension includes a feature dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating a similarity of the two points of interest in the preset dimension according to the point-of-interest information includes:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point in the feature dimension is a second value;
And if the characteristic information of the first interest point is different from the characteristic information of the second interest point, determining that the similarity of the first interest point and the second interest point in the characteristic dimension is a third value.
8. The method of claim 1, wherein the step of clustering the plurality of points of interest according to the similarity probability of the two points of interest to obtain the set of points of interest comprises:
if the similarity probability of the two interest points is larger than a third preset threshold value, establishing an association relationship between the two interest points;
And aggregating the interest points with the association relation in the interest points into the interest point set.
9. The data processing method of claim 1, wherein the step of selecting a target point of interest from the set of points of interest that represents the entity comprises:
calculating the sum of similarity probabilities of a first interest point and second interest points, wherein the first interest point is any interest point in the interest point set, and the second interest points are any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
10. The data processing method according to claim 1, further comprising, after the step of selecting, from the set of target points of interest, similar target points of interest satisfying a preset condition as a first similar target point of interest, and determining the first similar target point of interest as a top node of other similar target points of interest in the set of target points of interest in the tree structure:
Determining similar target interest points meeting the preset conditions in the other similar target interest points as second similar target interest points;
Determining the first similar target interest point as a superior node of the second similar target interest point;
And determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point except the first similar target interest point and the second similar target interest point in the target interest point set.
11. The data processing method according to any one of claims 1 to 9, wherein the point of interest information includes geographical location information of points of interest, a plurality of the target points of interest form a tree structure, the tree structure includes a first target point of interest and a second target point of interest, and after the step of selecting the target point of interest representing the entity from the set of points of interest, further includes:
and if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
12. A data processing apparatus, the apparatus comprising:
the information acquisition module is configured to acquire interest point information of a plurality of data sources;
The similarity calculation module is configured to calculate the similarity of the two interest points in a preset dimension according to the interest point information;
the probability calculation module is configured to input the similarity of the two interest points in one or more preset dimensions into a classification model trained in advance to obtain the similarity probability of the two interest points, wherein the classification model is obtained based on the similarity of the two sample interest points in one or more preset dimensions and the label training of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
the clustering module is configured to cluster the plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
A target selection module configured to select a target point of interest representing the entity from the set of points of interest;
the interest point information comprises name information of interest points, a plurality of target interest points form a tree structure, the tree structure comprises a first target interest point and a second target interest point, and the device further comprises a first mounting module which is configured to:
Identifying an entity represented by name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
If the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
If the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, determining that the first target interest point and the second target interest point are similar target interest points;
A plurality of similar target interest points form a target interest point set;
Selecting similar target interest points meeting a preset condition from the target interest point set as first similar target interest points, determining the first similar target interest points as upper nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset condition is that the entity information of the similar target interest points has the largest proportion in name information.
13. The data processing apparatus according to claim 12, wherein the point of interest information includes name information of points of interest, the preset dimension includes a name dimension, the two points of interest include a first point of interest and a second point of interest, and the similarity calculation module is specifically configured to:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
Identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
And if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
14. The data processing apparatus according to claim 13, wherein the similarity calculation module is specifically configured to:
acquiring a first vector corresponding to the name information of the first interest point;
Acquiring a second vector corresponding to the name information of the second interest point;
And calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
15. The data processing apparatus according to claim 12, wherein the point of interest information includes geographic location information of points of interest, the preset dimension includes a geographic dimension, the two points of interest include a first point of interest and a second point of interest, and the similarity calculation module is specifically configured to:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographic position relation between the first interest point and a third interest point according to whether the area range indicated by the geographic position information of the first interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the first geographic position relation comprises that the first interest point is positioned inside or outside the third interest point;
Determining a second geographic position relation between the second interest point and the third interest point according to whether the area range indicated by the geographic position information of the second interest point is positioned in the area range indicated by the geographic position information of the third interest point, wherein the second geographic position relation comprises that the second interest point is positioned inside or outside the third interest point;
And obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relation and the second geographic position relation.
16. The data processing apparatus according to claim 15, wherein the similarity calculation module is specifically configured to:
If the first and second points of interest are both located inside the third point of interest and the geographic distance is greater than a first preset threshold, or the first and second points of interest are both located outside the third point of interest and the geographic distance is greater than the first preset threshold and less than a second preset threshold, then calculating the similarity of the first and second points of interest in the geographic dimension according to the following formula, Wherein the saidRepresenting similarity of the first interest point and the second interest point in a geographic dimension, wherein a represents a geographic position indicated by geographic position information of the first interest point, b represents a geographic position indicated by geographic position information of the second interest point, andRepresenting a geographic distance between the first point of interest and the second point of interest;
and if the first interest point and the second interest point are both positioned in the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
17. The data processing apparatus according to claim 12, wherein the point of interest information includes address information of points of interest, the preset dimension includes an address dimension, the two points of interest include a first point of interest and a second point of interest, and the similarity calculation module is specifically configured to:
If the address information of the first interest point and the address information of the second interest point belong to the same geocode block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
18. The data processing apparatus according to claim 12, wherein the point of interest information includes feature information of points of interest, the preset dimension includes a feature dimension, the two points of interest include a first point of interest and a second point of interest, and the similarity calculation module is specifically configured to:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point in the feature dimension is a second value;
And if the characteristic information of the first interest point is different from the characteristic information of the second interest point, determining that the similarity of the first interest point and the second interest point in the characteristic dimension is a third value.
19. The data processing apparatus according to claim 12, wherein the clustering module is specifically configured to:
if the similarity probability of the two interest points is larger than a third preset threshold value, establishing an association relationship between the two interest points;
And aggregating the interest points with the association relation in the interest points into the interest point set.
20. The data processing apparatus according to claim 12, wherein the object selection module is specifically configured to:
calculating the sum of similarity probabilities of a first interest point and second interest points, wherein the first interest point is any interest point in the interest point set, and the second interest points are any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
21. The data processing apparatus of claim 12, wherein the first mounting module is further configured to:
Determining similar target interest points meeting the preset conditions in the other similar target interest points as second similar target interest points;
Determining the first similar target interest point as a superior node of the second similar target interest point;
And determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point except the first similar target interest point and the second similar target interest point in the target interest point set.
22. The data processing apparatus according to any one of claims 12 to 20, wherein the point of interest information includes geographical location information of points of interest, a plurality of the target points of interest forming a tree structure, the tree structure including a first target point of interest and a second target point of interest, the apparatus further comprising a second mounting module configured to:
and if the area range indicated by the geographic position information of the first target interest point is within the area range indicated by the geographic position information of the second target interest point, taking the second target interest point as an upper node of the first target interest point in the tree structure.
23. An electronic device, the electronic device comprising:
A processor;
A memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the data processing method of any of claims 1 to 11.
24. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the data processing method of any of claims 1 to 11.
25. A computer program product comprising a computer program which, when executed by a processor, implements the data processing method according to any one of claims 1 to 11.
CN202110556281.9A 2021-05-21 2021-05-21 Data processing method, device, electronic equipment and storage medium Active CN113420595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110556281.9A CN113420595B (en) 2021-05-21 2021-05-21 Data processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110556281.9A CN113420595B (en) 2021-05-21 2021-05-21 Data processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113420595A CN113420595A (en) 2021-09-21
CN113420595B true CN113420595B (en) 2024-07-12

Family

ID=77712691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110556281.9A Active CN113420595B (en) 2021-05-21 2021-05-21 Data processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113420595B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905456B (en) * 2023-01-06 2023-06-02 浪潮电子信息产业股份有限公司 Data identification method, system, equipment and computer readable storage medium
CN116257515A (en) * 2023-05-16 2023-06-13 之江实验室 Geographic interest point deduplication method, device and medium based on machine learning
CN117591904B (en) * 2024-01-18 2024-04-16 中睿信数字技术有限公司 Freight car clustering method based on density clustering

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376205A (en) * 2018-09-07 2019-02-22 顺丰科技有限公司 Excavate method, apparatus, equipment and the storage medium of address point of interest relationship
CN111209354A (en) * 2018-11-22 2020-05-29 北京搜狗科技发展有限公司 Method and device for judging repetition of map interest points and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489507B (en) * 2019-08-16 2023-03-31 腾讯科技(深圳)有限公司 Method, device, computer equipment and storage medium for determining similarity of interest points
CN111954175B (en) * 2020-08-25 2022-08-02 腾讯科技(深圳)有限公司 Method for judging visiting of interest point and related device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376205A (en) * 2018-09-07 2019-02-22 顺丰科技有限公司 Excavate method, apparatus, equipment and the storage medium of address point of interest relationship
CN111209354A (en) * 2018-11-22 2020-05-29 北京搜狗科技发展有限公司 Method and device for judging repetition of map interest points and electronic equipment

Also Published As

Publication number Publication date
CN113420595A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN113420595B (en) Data processing method, device, electronic equipment and storage medium
US11048983B2 (en) Method, terminal, and computer storage medium for image classification
US20210117726A1 (en) Method for training image classifying model, server and storage medium
CN107102746B (en) Candidate word generation method and device and candidate word generation device
JP6300295B2 (en) Friend recommendation method, server therefor, and terminal
CN109800325A (en) Video recommendation method, device and computer readable storage medium
CN109274732B (en) Geographic position obtaining method and device, electronic equipment and storage medium
CN110019645B (en) Index library construction method, search method and device
CN104850238B (en) The method and apparatus being ranked up to candidate item caused by input method
CN105701254A (en) Information processing method and device and device for processing information
CN112417318B (en) Method and device for determining states of interest points, electronic equipment and medium
CN110874145A (en) Input method and device and electronic equipment
CN109961094B (en) Sample acquisition method and device, electronic equipment and readable storage medium
CN110929176A (en) Information recommendation method and device and electronic equipment
CN111209354A (en) Method and device for judging repetition of map interest points and electronic equipment
CN113128437A (en) Identity recognition method and device, electronic equipment and storage medium
CN114880480A (en) Question-answering method and device based on knowledge graph
CN112328911A (en) Site recommendation method, device, equipment and storage medium
US20130144904A1 (en) Method and system for providing query using an image
CN113609380B (en) Label system updating method, searching device and electronic equipment
US11651280B2 (en) Recording medium, information processing system, and information processing method
CN108241678B (en) Method and device for mining point of interest data
EP3812951A1 (en) Augmenting biligual training corpora by replacing named entities
CN112328809A (en) Entity classification method, device and computer readable storage medium
CN111797746A (en) Face recognition method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant