CN113420595A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113420595A
CN113420595A CN202110556281.9A CN202110556281A CN113420595A CN 113420595 A CN113420595 A CN 113420595A CN 202110556281 A CN202110556281 A CN 202110556281A CN 113420595 A CN113420595 A CN 113420595A
Authority
CN
China
Prior art keywords
interest
interest point
point
information
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110556281.9A
Other languages
Chinese (zh)
Inventor
孙凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110556281.9A priority Critical patent/CN113420595A/en
Publication of CN113420595A publication Critical patent/CN113420595A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The disclosure shows a data processing method, a device, an electronic device and a storage medium, firstly obtaining interest point information of a plurality of data sources; then, according to the interest point information, calculating the similarity of the two interest points on a preset dimension; then, the similarity of the two interest points on one or more preset dimensions is input into a classification model obtained through pre-training, so that the similarity probability of the two interest points is obtained, the classification model is obtained through training based on the similarity of the two sample interest points on one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity; clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set; and selecting a target interest point representing the entity from the interest point set. According to the scheme, the target interest points which are high in accuracy and correspond to the entities can be obtained, and the robustness and the accuracy of the multiple data sources in the fusion process are improved.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
A Point of interest (POI), also called an Information Point, is a landmark or a scenic spot on an electronic map, and is used to mark places such as government departments represented by the POI, commercial institutions of various industries (gas stations, department stores, supermarkets, restaurants, hotels, convenience stores, hospitals, etc.), tourist attractions (parks, public toilets, etc.), historic sites, transportation facilities (various stations, parking lots), etc.
In the related technology, under the scene that a user searches interest points or analyzes information according to the interest points, the coverage rate of the interest points is not high enough when a single data source is used, so that the target interest points required by the user are not in a database; when a plurality of data sources are adopted, the problem that the same interest point is repeatedly described occurs.
Disclosure of Invention
The present disclosure provides a data processing method, an apparatus, an electronic device, and a storage medium, to at least solve the problem in the related art that when only a single data source is used, the coverage rate of interest points is not high enough, which results in that target interest points required by a user are not in a database, and when a plurality of data sources are used, the same interest point is repeatedly described. The technical scheme of the disclosure is as follows:
according to a first aspect of the present disclosure, there is provided a data processing method, the method comprising:
obtaining interest point information of a plurality of data sources;
according to the interest point information, calculating the similarity of the two interest points in a preset dimension, wherein the preset dimension comprises at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained through pre-training to obtain the similarity probability of the two interest points, wherein the classification model is obtained through training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
and selecting a target interest point representing the entity from the interest point set.
In an optional implementation manner, the interest point information includes name information of interest points, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an optional implementation manner, the step of calculating a similarity between the name information of the first point of interest and the name information of the second point of interest includes:
acquiring a first vector corresponding to the name information of the first interest point;
acquiring a second vector corresponding to the name information of the second interest point;
and calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the interest point information includes geographical location information of interest points, the preset dimension includes a geographical dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographical position relationship between the first interest point and a third interest point according to whether an area range indicated by the geographical position information of the first interest point is located in an area range indicated by the geographical position information of the third interest point, wherein the first geographical position relationship comprises that the first interest point is located inside or outside the third interest point;
determining a second geographical position relationship between the second interest point and the third interest point according to whether the area range indicated by the geographical position information of the second interest point is located in the area range indicated by the geographical position information of the third interest point, wherein the second geographical position relationship comprises that the second interest point is located inside or outside the third interest point;
and according to the geographic distance, the first geographic position relation and the second geographic position relation, obtaining the similarity of the first interest point and the second interest point on the geographic dimension.
In an optional implementation manner, the step of obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relationship, and the second geographic position relationship includes:
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than a first preset threshold value, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold value and less than a second preset threshold value, calculating the similarity of the first interest point and the second interest point in the geographic dimension according to the following formula,
Figure BDA0003077266990000031
wherein Sim (a, b) represents the similarity of the first interest point and the second interest point in the geographic dimension, wherein a represents the geographic position indicated by the geographic position information of the first interest point, and b represents the geographic position indicated by the geographic position information of the second interest pointThe dist (a, b) represents a geographic distance between the first point of interest and the second point of interest;
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of interest points, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
if the address information of the first interest point and the address information of the second interest point belong to the same geocoding block, acquiring a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of interest points, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the step of calculating a similarity of the two interest points in the preset dimension according to the interest point information includes:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a second value;
and if the feature information of the first interest point is different from the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a third value.
In an optional implementation manner, the clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set includes:
if the similarity probability of the two interest points is greater than a third preset threshold value, establishing an association relationship between the two interest points;
aggregating the interest points with the incidence relation in the plurality of interest points into the interest point set.
In an optional implementation manner, the step of selecting a target interest point representing the entity from the interest point set includes:
calculating the sum of the similarity probability of a first interest point and each second interest point, wherein the first interest point is any interest point in the interest point set, and the second interest point is any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and after the step of selecting a target interest point representing the entity from the interest point set, the method further includes:
identifying an entity represented by the name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
if the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
if the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold value, determining that the first target interest point and the second target interest point are similar target interest points;
the similar target interest points form a target interest point set;
selecting similar target interest points meeting preset conditions from the target interest point set as first similar target interest points, and determining the first similar target interest points as superior nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset conditions are that the occupation ratio of entity information of the similar target interest points in name information is maximum.
In an optional implementation manner, after the step of selecting a similar target interest point satisfying a preset condition from the target interest point set as a first similar target interest point, and determining the first similar target interest point as a superior node of other similar target interest points in the target interest point set in the tree structure, the method further includes:
determining the similar target interest points meeting the preset condition from the other similar target interest points as second similar target interest points;
determining the first similar target interest point as a previous level node of the second similar target interest point;
and determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point in the target interest point set except the first similar target interest point and the second similar target interest point.
In an optional implementation manner, the interest point information includes geographical location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, and after the step of selecting a target interest point representing the entity from the interest point set, the method further includes:
and if the area range indicated by the geographical position information of the first target interest point is located in the area range indicated by the geographical position information of the second target interest point, taking the second target interest point as a superior node of the first target interest point in the tree structure.
According to a second aspect of the present disclosure, there is provided a data processing apparatus, the apparatus comprising:
the information acquisition module is configured to acquire interest point information of a plurality of data sources;
a similarity calculation module configured to calculate, according to the interest point information, a similarity of the two interest points in a preset dimension, where the preset dimension includes at least one of: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
a probability calculation module configured to input the similarity of the two interest points in one or more preset dimensions into a classification model obtained through pre-training to obtain a similarity probability of the two interest points, where the classification model is obtained through training based on the similarity of the two sample interest points in one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
the clustering module is configured to cluster the interest points according to the similarity probability of the two interest points to obtain an interest point set;
a target selection module configured to select a target point of interest representing the entity from the set of points of interest.
In an optional implementation manner, the interest point information includes name information of an interest point, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an optional implementation manner, the similarity calculation module is specifically configured to:
acquiring a first vector corresponding to the name information of the first interest point;
acquiring a second vector corresponding to the name information of the second interest point;
and calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the interest point information includes geographical location information of an interest point, the preset dimension includes a geographical dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographical position relationship between the first interest point and a third interest point according to whether an area range indicated by the geographical position information of the first interest point is located in an area range indicated by the geographical position information of the third interest point, wherein the first geographical position relationship comprises that the first interest point is located inside or outside the third interest point;
determining a second geographical position relationship between the second interest point and the third interest point according to whether the area range indicated by the geographical position information of the second interest point is located in the area range indicated by the geographical position information of the third interest point, wherein the second geographical position relationship comprises that the second interest point is located inside or outside the third interest point;
and according to the geographic distance, the first geographic position relation and the second geographic position relation, obtaining the similarity of the first interest point and the second interest point on the geographic dimension.
In an optional implementation manner, the similarity calculation module is specifically configured to:
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than a first preset threshold value, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold value and less than a second preset threshold value, calculating the similarity of the first interest point and the second interest point in the geographic dimension according to the following formula,
Figure BDA0003077266990000061
wherein Sim (a, b) represents the similarity of the first interest point and the second interest point in the geographic dimension, a represents the geographic position indicated by the geographic position information of the first interest point, b represents the geographic position indicated by the geographic position information of the second interest point, and dist (a, b) represents the geographic distance between the first interest point and the second interest point;
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of an interest point, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the address information of the first interest point and the address information of the second interest point belong to the same geocoding block, acquiring a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of an interest point, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a second value;
and if the feature information of the first interest point is different from the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a third value.
In an optional implementation, the clustering module is specifically configured to:
if the similarity probability of the two interest points is greater than a third preset threshold value, establishing an association relationship between the two interest points;
aggregating the interest points with the incidence relation in the plurality of interest points into the interest point set.
In an optional implementation manner, the target selecting module is specifically configured to:
calculating the sum of the similarity probability of a first interest point and each second interest point, wherein the first interest point is any interest point in the interest point set, and the second interest point is any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, and the tree structure includes a first target interest point and a second target interest point, the apparatus further includes a first mounting module configured to:
identifying an entity represented by the name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
if the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
if the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold value, determining that the first target interest point and the second target interest point are similar target interest points;
the similar target interest points form a target interest point set;
selecting similar target interest points meeting preset conditions from the target interest point set as first similar target interest points, and determining the first similar target interest points as superior nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset conditions are that the occupation ratio of entity information of the similar target interest points in name information is maximum.
In an alternative implementation, the first mounting module is further configured to:
determining the similar target interest points meeting the preset condition from the other similar target interest points as second similar target interest points;
determining the first similar target interest point as a previous level node of the second similar target interest point;
and determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point in the target interest point set except the first similar target interest point and the second similar target interest point.
In an optional implementation manner, the interest point information includes geographical location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, the apparatus further includes a second mounting module configured to:
and if the area range indicated by the geographical position information of the first target interest point is located in the area range indicated by the geographical position information of the second target interest point, taking the second target interest point as a superior node of the first target interest point in the tree structure.
According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data processing method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product, wherein the instructions of the computer program product, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method according to the first aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the technical scheme of the disclosure provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the method comprises the steps of firstly acquiring interest point information of a plurality of data sources; then, according to the interest point information, calculating the similarity of the two interest points in a preset dimension, wherein the preset dimension comprises at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension; then, the similarity of the two interest points on one or more preset dimensions is input into a classification model obtained through pre-training, so that the similarity probability of the two interest points is obtained, the classification model is obtained through training based on the similarity of the two sample interest points on one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity; clustering the interest points according to the similarity probability of the two interest points to obtain an interest point set; and selecting a target interest point representing the entity from the interest point set. According to the technical scheme, the similarity of two interest points on a preset dimension is calculated firstly, then the similarity of at least one preset dimension is input into a classification model to obtain the similarity probability of the two interest points, the interest points are clustered based on the similarity probability, and a target interest point representing an entity is determined according to a clustering result. Because the similarity probability integrates the similarity of the two interest points in at least one preset dimension, the similarity probability is adopted for clustering, the accuracy of a clustering result can be improved, the problem that when the similarity clustering is used for representing the interest points of the same entity, the similarity of the two interest points in a high confidence dimension is not high, but the similarity is good in other dimensions to cause inaccurate clustering is solved, the scheme can obtain the target interest points which have high accuracy and correspond to the entities one by one, and the robustness and the accuracy of a plurality of interest point data sources in the fusion process are greatly improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart illustrating a method of data processing according to an exemplary embodiment.
FIG. 2 is a schematic flow diagram illustrating pre-processing of data from multiple data sources according to an example embodiment.
FIG. 3 is a flow diagram illustrating obtaining similarity in name dimensions, according to an example embodiment.
FIG. 4 is a flow diagram illustrating obtaining similarity in geographic dimensions, according to an example embodiment.
FIG. 5 is a flow diagram illustrating obtaining similarity in address dimensions, according to an example embodiment.
FIG. 6 is a flow diagram illustrating obtaining similarity in feature dimensions according to an example embodiment.
FIG. 7 is a flowchart illustrating obtaining a tree structure according to an example embodiment.
FIG. 8 is a flowchart illustrating application of a tree structure in an entity retrieval process according to an example embodiment.
Fig. 9 is a block diagram illustrating a structure of a data processing apparatus according to an exemplary embodiment.
FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The inventor finds that: under the scene that a user searches for interest points or analyzes information according to the interest points, the coverage rate of the interest points is not high enough when a single data source is used, and target interest points required by the user are not in a database; when a plurality of data sources are adopted, the problems that the same interest point has repeated description or the association between the interest point and an entity is inaccurate occur.
In order to solve the above problem, fig. 1 is a flowchart illustrating a data processing method according to an exemplary embodiment, and an execution subject of the embodiment may be an electronic device such as a server.
As shown in fig. 1, the method may include the following steps.
In step S11, point of interest information for a plurality of data sources is acquired.
In a specific implementation, referring to fig. 2, a series of standardized processing procedures such as consistency check, missing value check, abnormal value check, repeated value check, and the like may be performed on the interest point information of a plurality of data sources, such as map data, collected data, and the like, first. Based on the interest point information acquired by a plurality of ways, the interest point information is unified in a plurality of aspects such as a coordinate system, a name, a classification and a boundary. The unified data can satisfy the following conditions: the longitude and latitude are in the same coordinate system; the name does not contain punctuation marks and special characters; the classification rules are consistent; the boundary of the planar region is closed and the forward boundary of the region is defined counterclockwise. By unifying the interest point information of the plurality of data sources, the subsequent data processing process can be more efficient and accurate.
The point of interest information may include at least one of name information, geographical location information, address information, and feature information of the point of interest. The interest point may have a certain area range, and the interest point with a certain area range is also called an area of interest (AOI), which may include a plurality of interest points.
In step S12, according to the interest point information, calculating a similarity of two interest points in a preset dimension, where the preset dimension includes at least one of: a geographic dimension, a name dimension, an address dimension, and a feature dimension.
In specific implementation, the similarity of two interest points in the geographic dimension can be calculated according to the geographic position information of the interest points; the similarity of the two interest points on the name dimension can be calculated according to the name information of the interest points; the similarity of the two interest points on the address dimension can be calculated according to the address information of the interest points; and calculating the similarity of the two interest points on the characteristic dimension according to the characteristic information of the interest points, and the like. The following embodiments will detail the similarity of two points of interest in the geographic dimension, name dimension, address dimension and feature dimension, respectively.
The feature information includes, but is not limited to, category information, telephone information, picture information, selling price information, and the like.
In step S13, the similarity of the two interest points in one or more preset dimensions is input into a classification model obtained by pre-training, so as to obtain the similarity probability of the two interest points, the classification model is obtained by training based on the similarity of the two sample interest points in one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used to represent the probability that the two interest points represent the same entity.
In a specific implementation, the similarity of the two interest points calculated in step S12 in at least one preset dimension may be input into a classification model, and the classification model outputs the similarity probability of the two interest points. The classification model can be obtained by training models such as XGboost based on similarity of two sample interest points in one or more preset dimensions and whether the two sample interest points represent labels of the same entity. The XGboost model is a machine learning algorithm realized under a Gradient Boosting framework, a regularization term is added to prevent overfitting, and a classification model obtained by sub-XGboost model training can be used for obtaining a more accurate result.
Specifically, the classification model may be obtained by: the similarity of two sample interest points on one or more preset dimensions is input into a model to be trained, such as an XGboost model, a loss function value is calculated according to an output result and whether the two sample interest points represent the label of the same entity, the minimum loss function value is taken as a target, parameters in the model to be trained are optimized through a back propagation algorithm, and finally a classification model is obtained.
Through the classification model, the similarity probability of two interest points is calculated, the original classification problem is converted into a regression problem, and the adjustment is conveniently carried out through parameters such as the clustering radius and the like in the clustering process, so that the clustering result with high recall rate or high accuracy is obtained.
In this embodiment, the similarity probability of the two interest points obtained by the classification model calculation is used, the similarity of the two interest points in at least one preset dimension is integrated, and the similarity probability is used for clustering, so that the accuracy of a clustering result can be improved, and the problem that when the similarity clustering is used to represent the interest points of the same entity, the similarity of the two interest points in a high confidence dimension is not high, but the similarity in other dimensions is good, so that the clustering is inaccurate is solved. According to the method and the device, clustering is carried out according to the similar probability, the interest points can be accurately recalled, and a more accurate clustering result is obtained, so that the accuracy of the target interest points is improved. The high confidence dimension may be a dimension with high accuracy and stability of the dimension data, such as a geographical dimension.
In step S14, clustering the interest points according to the similarity probability of the two interest points to obtain an interest point set.
In specific implementation, firstly, a DBSCAN clustering algorithm may be used to cluster a plurality of interest points according to the similarity probability of two interest points, so as to obtain an interest point set. The aggregation radius in the DBSCAN clustering algorithm may be set according to actual requirements for recall rate and accuracy, and the specific numerical value of the aggregation radius is not limited in this embodiment. When the polymerization radius is set to be larger, a polymerization result with a higher recall rate can be obtained; when the polymerization radius is set smaller, a more accurate polymerization result can be obtained. The aggregation radius is a reference distance adopted during aggregation, and the reference distance is inversely proportional to the reference similarity probability, i.e., the larger the reference similarity probability is, the smaller the reference distance is.
In an optional implementation manner, if the similarity probability of the two interest points is greater than a third preset threshold, establishing an association relationship between the two interest points; and then aggregating the interest points with the incidence relation in the plurality of interest points into an interest point set. The third preset threshold may be set according to actual requirements for recall rate and accuracy, and the specific value of the third preset threshold is not limited in this embodiment. When the third preset threshold is set to be larger, a polymerization result with higher accuracy can be obtained; when the third preset threshold is set to be smaller, an aggregation result with a higher recall rate can be obtained.
In step S15, a target point of interest representing an entity is selected from the set of points of interest.
In an optional implementation manner, a sum of similarity probabilities of a first interest point and each second interest point may be first calculated, where the first interest point is any interest point in the interest point set, and the second interest point is any interest point except the first interest point in the interest point set; and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point. For example, the fourth preset threshold may be a maximum value of the summation result, that is, the first interest point with the maximum summation result is determined as the target interest point. The fourth preset threshold may also be set according to actual requirements, and the specific value of the fourth preset threshold is not limited in this embodiment.
After determining the target point of interest, the name information, the geographic location information, and the address information of the target point of interest may be determined as the name information, the geographic location information, and the address information of the entity. The characteristic information of the entity, such as the category, the telephone number, etc., may be determined by voting on the characteristic information of the interest points in the interest point set, for example, the category information with the highest number of votes may be determined as the category information of the entity.
By selecting the target interest points representing the entities from the interest point set, the unique description of the same entity in the result of fusing a plurality of data sources can be realized, and the problem of inaccuracy or repetition of the target interest points is avoided on the premise of meeting the coverage rate of the interest points.
The data processing method provided by the embodiment of the disclosure includes the steps of firstly calculating the similarity of two interest points in a preset dimension, then inputting the similarity of at least one preset dimension into a classification model to obtain the similarity probability of the two interest points, further clustering the interest points based on the similarity probability, and determining a target interest point representing an entity according to a clustering result. Because the similarity probability integrates the similarity of the two interest points in at least one preset dimension, the similarity probability is adopted for clustering, the accuracy of a clustering result can be improved, the problem that when the similarity clustering is used for representing the interest points of the same entity, the similarity of the two interest points in a high confidence dimension is not high, but the similarity is good in other dimensions to cause inaccurate clustering is solved, the scheme can obtain the target interest points which have high accuracy and correspond to the entities one by one, and the robustness and the accuracy of a plurality of interest point data sources in the fusion process are greatly improved. Under the condition of meeting the interest point coverage rate, the problems that the target interest point is low in accuracy rate, the target interest point is repeated or not representative enough and the like are solved, and the high efficiency and the accuracy of using the interest point information by a user are ensured.
In order to obtain the similarity between the two interest points in the name dimension, in an alternative implementation manner, the interest point information includes name information of the interest points, the preset dimension includes the name dimension, and the two interest points include a first interest point and a second interest point, with reference to fig. 3, in step S12, the method specifically includes:
in step S31, an entity characterized by the name information of the first point of interest is identified, and first entity information of the first point of interest is obtained.
In a specific implementation, Named Entity Recognition (NER) may be performed on the name information of the first point of interest to obtain the first Entity information.
Wherein the NER can identify named entities in the text to be processed. The named entity can be an entity such as a place name and an organization name in the name information of the interest point. For example, the named entity in the express headquarters parking lot is "express headquarters". Entities in YYY store of beijing XXX duck roast shop include beijing, XXX, duck roast shop, YYY store, which respectively correspond to place name, institution name, type noun, and place noun, and institution name "XXX" may be used as a named entity.
In specific implementation, a pre-trained long-and-short-term memory neural network model can be used for predicting the probability that each word in the name information is an entity word, and the entity word with the highest probability can be selected as the entity information of the interest point.
In step S32, the entity characterized by the name information of the second point of interest is identified, and the second entity information of the second point of interest is obtained.
In a specific implementation, the NER identification may be performed on the name information of the second point of interest to obtain the second entity information.
In step S33, if the first entity information is the same as the second entity information, the similarity between the name information of the first interest point and the name information of the second interest point is calculated, and the similarity between the first interest point and the second interest point in the name dimension is obtained.
In this embodiment, under the condition that the first entity information is the same as the second entity information, the similarity of the two interest points in the name dimension is calculated, so that the calculation amount can be reduced, and a more accurate similarity result can be obtained.
In order to calculate the similarity between the name information of the first interest point and the name information of the second interest point, in an optional implementation manner, a first vector corresponding to the name information of the first interest point may be first obtained; acquiring a second vector corresponding to the name information of the second interest point; and then calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In a specific implementation, the word2vec model may be adopted to respectively process the name information of the first interest point and the name information of the second interest point, so as to obtain a first vector and a second vector. The distance between the first vector and the second vector may be a cosine distance, and the cosine similarity of the first interest point and the second interest point in the name dimension may be calculated according to the following formula:
Figure BDA0003077266990000141
wherein sim represents the similarity of the first interest point and the second interest point in the name dimension, A represents a first vector, B represents a second vector, n represents the dimensions of the vector space in which the first vector and the second vector are located, i represents any dimension in the n-dimensional vector space, A represents the similarity of the first interest point and the second interest point in the name dimension, andirepresenting the component of the first vector in the i-th dimension, BiRepresenting the component of the second vector in the ith dimension, n being a positive integer.
It should be noted that the calculation of the similarity in the name dimension is not limited to the above scheme of cosine similarity, and any other scheme capable of calculating the similarity between phrases or the similarity between short texts may be substituted. For example, a word2vec model may be first adopted to obtain a first vector and a second vector, and then similarity of two points of interest in a name dimension may be obtained by calculating a Word Move's Distance (WMD).
In order to obtain the similarity between the two interest points in the geographic dimension, in an alternative implementation manner, the interest point information includes geographic location information of the interest point, the preset dimension includes the geographic dimension, and the two interest points include a first interest point and a second interest point, with reference to fig. 4, in step S12, the method specifically includes:
in step S41, a geographic distance between the first point of interest and the second point of interest is calculated according to the geographic location information of the first point of interest and the geographic location information of the second point of interest.
In step S42, a first geographical location relationship between the first point of interest and the third point of interest is determined according to whether the area range indicated by the geographical location information of the first point of interest is within the area range indicated by the geographical location information of the third point of interest, where the first geographical location relationship includes that the first point of interest is inside or outside the third point of interest.
When the area range indicated by the geographical location information of the first interest point is located within the area range indicated by the geographical location information of the third interest point, determining that the first geographical location relationship between the first interest point and the third interest point is that the first interest point is located inside the third interest point.
When the area range indicated by the geographical location information of the first interest point is not within the area range indicated by the geographical location information of the third interest point, determining that the first geographical location relationship between the first interest point and the third interest point is that the first interest point is located outside the third interest point.
In step S43, a second geographic location relationship between the second point of interest and the third point of interest is determined according to whether the area range indicated by the geographic location information of the second point of interest is within the area range indicated by the geographic location information of the third point of interest, where the second geographic location relationship includes that the second point of interest is located inside or outside the third point of interest.
When the area range indicated by the geographic position information of the second interest point is located in the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the second interest point and the third interest point is that the second interest point is located inside the third interest point.
When the area range indicated by the geographic position information of the second interest point is not located within the area range indicated by the geographic position information of the third interest point, determining that the first geographic position relationship between the second interest point and the third interest point is that the second interest point is located outside the third interest point.
In step S44, the similarity of the first interest point and the second interest point in the geographic dimension is obtained according to the geographic distance, the first geographic position relationship, and the second geographic position relationship.
In a specific implementation, if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than a first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold and less than a second preset threshold, the similarity of the first interest point and the second interest point in the geographic dimension is calculated according to the following formula,
Figure BDA0003077266990000151
where Sim (a, b) represents the similarity of the first interest point and the second interest point in the geographic dimension, a represents the geographic location indicated by the geographic location information of the first interest point, b represents the geographic location indicated by the geographic location information of the second interest point, and dist (a, b) represents the geographic distance between the first interest point and the second interest point. By adopting the formula, the similarity of the first interest point and the second interest point on the geographic dimension can be efficiently obtained, and the accuracy of similarity calculation is ensured.
The first preset threshold may be, for example, 1 meter, and the specific value may be set according to an actual requirement, which is not limited in this embodiment. The second preset threshold may be, for example, 50 meters, and the specific value may be set according to an actual requirement, which is not limited in this embodiment.
And if the first interest point and the second interest point are both positioned inside the third interest point and the geographic distance is less than or equal to a first preset threshold value, or the first interest point and the second interest point are both positioned outside the third interest point and the geographic distance is less than or equal to the first preset threshold value, determining the similarity of the first interest point and the second interest point in the geographic dimension as a first value.
The first value may be determined according to actual requirements, and may be equal to 1, for example.
Therefore, the similarity of two interest points with the geographic distance smaller than or equal to the first preset threshold (such as 1 meter) on the geographic dimension is determined as a fixed value, namely the first value, so that the influence of calculation errors can be reduced, and the accuracy of similarity calculation is improved.
Specifically, if the first point of interestAnd the second interest point is located inside the same interest point, i.e. a third interest point, and the geographic distance is greater than a first preset threshold, the similarity of the first interest point and the second interest point in the geographic dimension can be calculated according to the following formula,
Figure BDA0003077266990000161
if the first interest point and the second interest point are located outside any interest point, i.e. the third interest point, and the geographic distance is greater than the first preset threshold and less than the second preset threshold, the similarity of the first interest point and the second interest point in the geographic dimension can be calculated according to the following formula,
Figure BDA0003077266990000162
if the first interest point and the second interest point are located inside the same interest point, that is, a third interest point, and the geographic distance is less than or equal to a first preset threshold, it may be determined that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
If the first interest point and the second interest point are located outside any interest point, that is, the third interest point, and the geographic distance is less than or equal to the first preset threshold, it may be determined that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In practical application, it may be determined whether the first interest point is located inside other interest points according to the geographic location information, and if the first interest point is located inside a third interest point, the similarity in the geographic dimension between the first interest point and the second interest point (any interest point inside the third interest point except the first interest point) is calculated. The method for searching the interest point pair in the interest point can reduce the calculation amount on the basis of ensuring the accuracy.
If the first interest point is not located inside any interest point, calculating the similarity of the first interest point and a second interest point (if the first interest point is also not located inside any interest point) within a second preset threshold radius range in the geographic dimension. Therefore, the similarity between the two interest points with the distance smaller than the second preset threshold value is calculated, and the calculation amount can be reduced on the basis of ensuring the accuracy.
In order to obtain the similarity between the two interest points in the address dimension, in an alternative implementation manner, the interest point information includes address information of the interest points, the preset dimension includes the address dimension, and the two interest points include a first interest point and a second interest point, with reference to fig. 5, in step S12, the method specifically includes:
in step S51, if the address information of the first interest point and the address information of the second interest point belong to the same geocoding block, a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point are obtained.
In a specific implementation, the word2vec model may be adopted to respectively process the address information of the first interest point and the address information of the second interest point, so as to obtain a third vector and a fourth vector.
In step S52, the distance between the third vector and the fourth vector is calculated, and the similarity of the first interest point and the second interest point in the address dimension is obtained.
The distance between the third vector and the fourth vector may be a cosine distance, which is not limited in this embodiment.
In a specific implementation, the geocoding block where the interest point is located is taken as a range, and the similarity of two interest points located in the same geocoding block range in the address dimension is calculated.
Similar to similarity in the name dimension, the calculation of the similarity in the address dimension is not limited to the above scheme, and any scheme capable of calculating the similarity between phrases or the similarity between short texts may be substituted.
In order to obtain the similarity of the two interest points in the feature dimension, in an alternative implementation manner, the interest point information includes feature information of the interest points, the preset dimension includes the feature dimension, and the two interest points include a first interest point and a second interest point, with reference to fig. 6, in step S12, the method specifically includes:
in step S61, if the feature information of the first interest point is the same as the feature information of the second interest point, it is determined that the similarity between the first interest point and the second interest point in the feature dimension is a second value.
The second value may be determined according to an actual requirement, and may be equal to 1, for example, which is not limited in this embodiment.
In step S62, if the feature information of the first interest point and the feature information of the second interest point are different, it is determined that the similarity between the first interest point and the second interest point in the feature dimension is a third value.
The third value may be smaller than the second value, and the third value may be determined according to an actual requirement, for example, the third value may be equal to 0, which is not limited in this embodiment.
The inventor finds that when the interest points are used, the interest points obtained by the user are often inconsistent with the interest points concerned by the user. For example, when the user is located at the southern gate of the palace museum, the point of interest of the southern gate of the palace museum is often obtained, but the point of interest of the palace museum may be more desirable for the user, and the geographical range of the palace museum is larger than that of the southern gate of the palace museum, and the palace museum comprises the southern gate of the palace museum, so that the palace museum can be referred to as an upper-level point of interest or an upper-level point of interest of the southern gate of the palace museum.
In order to solve the above problem, in an alternative implementation manner, the point of interest information includes name information of the point of interest, the multiple target points of interest form a tree structure, and the tree structure includes a first target point of interest and a second target point of interest, and after step S15, referring to fig. 7, the method may further include:
in step S71, an entity characterized by the name information of the first target point of interest is identified, and entity information of the first target point of interest is obtained.
For example, the NER identification may be performed on the name information of the first target point of interest, so as to obtain the entity information of the first target point of interest.
In step S72, an entity characterized by the name information of the second target point of interest is identified, and entity information of the second target point of interest is obtained.
For example, the NER identification may be performed on the name information of the second target interest point to obtain the entity information of the second target interest point.
In step S73, if the entity information of the first target point of interest is the same as the entity information of the second target point of interest, a similarity between the name information of the first target point of interest and the name information of the second target point of interest is calculated.
In a specific implementation, the similarity between the name information of the first target interest point and the name information of the second target interest point may be calculated by referring to the method shown in fig. 3, and is not described herein again.
In step S74, if the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold, it is determined that the first target interest point and the second target interest point are similar target interest points.
The fifth preset threshold may be, for example, 0.6. The fifth preset threshold may be set according to actual requirements, and the specific value of the fifth preset threshold is not limited in this embodiment.
In step S75, the plurality of similar target points of interest constitute a target point of interest set.
In step S76, a similar target interest point satisfying a preset condition is selected from the target interest point set as a first similar target interest point, and the first similar target interest point is determined as a higher-level node of other similar target interest points in the target interest point set in the tree structure, where the preset condition is that the occupation ratio of entity information of the similar target interest point in the name information is the largest.
The upper node can be a previous node or a previous N-level node, and N is a positive integer.
In a specific implementation, the occupation ratio of the entity information of each similar target interest point in the target interest point set in the name information may be calculated, the similar target interest point with the largest occupation ratio is used as a first similar target interest point, the first similar target interest point is used as a superior node of other similar target interest points in the target interest point set in the tree structure, and the other similar target interest points in the target interest point set are used as inferior nodes of the first similar target interest point.
In actual use, through the tree structure of the name dimension, the user can be provided with the membership in the name dimension, such as: the AA university ChangPing school zone is mounted to the AA university, namely in the tree structure, the AA university is an upper node of the AA university ChangPing school zone, and the AA university ChangPing school zone is a lower node of the AA university.
In this implementation, the method may further include:
in step S77, a similar target interest point satisfying the preset condition among other similar target interest points is determined as a second similar target interest point.
Wherein the other similar target interest points may include similar target interest points in the target interest point set other than the first similar target interest point.
In step S78, the first similar target point of interest is determined as the top level node of the second similar target point of interest.
In step S79, the second similar target interest point is determined as a superior node of a third similar target interest point, where the third similar target interest point is any one of the similar target interest points in the target interest point set except the first similar target interest point and the second similar target interest point.
The upper node can be a previous node or a previous N-level node, and N is a positive integer.
In a specific implementation, the occupation ratio of the entity information of each similar target interest point in other similar target interest points in the name information may be calculated, the similar target interest point with the largest occupation ratio is used as a second similar target interest point, the second similar target interest point is used as a next-level node of the first similar target interest point, and the second similar target interest point is used as a superior node of a third similar target interest point.
For example: the target interest point set comprises the following similar target interest points: the entity information of similar target interest points in the district of Chang Ping school of AA university, the east district of Chang Ping school of AA university and the west district of Chang Ping school of AA university are all the AA university.
First, the occupation ratio of the entity information of each similar target interest point in the target interest point set in the name information can be calculated, the similar target interest point with the largest occupation ratio, namely the AA university, is taken as a first similar target interest point, and the AA university is taken as a superior node of other similar target interest points, such as an AA university chang school zone, an AA university chang school zone east zone and an AA university chang school zone west zone.
Then, the occupation ratios of the entity information of similar target interest points in other similar target interest points such as the AA university Chang Ping school zone, the AA university Chang Ping school zone east zone and the AA university Chang Ping school zone west zone in the name information can be calculated, the similar target interest point with the largest occupation ratio, namely the AA university Chang Ping school zone, is taken as a second similar target interest point, the AA university Chang Ping school zone is taken as a first similar target interest point, namely a next-level node of the AA university, and the AA university Chang Ping school zone is taken as a third similar target interest point, namely an AA university Chang Ping school zone east zone and an upper-level node of the AA university Chang Ping school zone west zone.
Therefore, the east area of the school district of the Chang Ping of the AA university and the west area of the school district of the Chang Ping of the AA university are hung to the AA university, the school district of the Chang Ping of the AA university is hung to the AA university, namely in the tree structure, the AA university is a father node of the school district of the Chang Ping of the AA university, the school district of the Chang Ping of the AA university is a child node of the university, the school district of the Chang Ping of the AA university is a father node of the east area of the school district of the Chang Ping of the AA university and the west area of the school district of the Chang Ping of the AA university is a child node of the school district of the Chang Ping of the AA university.
In this implementation, a tree structure of points of interest may be formed in the name dimension. When the content operation is carried out, the nodes needing attention can be selected in a targeted manner according to the hierarchical relation in the tree structure, and the accuracy and the effectiveness of the content operation are improved.
In another alternative implementation, the point of interest information includes geographical location information of the point of interest, the plurality of target points of interest form a tree structure, and the tree structure includes the first target point of interest and the second target point of interest, after step S15, the method may further include:
and if the area range indicated by the geographical position information of the first target interest point is located in the area range indicated by the geographical position information of the second target interest point, taking the second target interest point as a superior node of the first target interest point in the tree structure.
The upper node can be a previous node or a previous N-level node, and N is a positive integer.
This allows a tree structure of points of interest to be formed in geographic dimensions. When the content operation is carried out, the nodes needing attention can be selected in a targeted manner according to the hierarchical relation in the tree structure, and the accuracy and the effectiveness of the content operation are improved. Through the tree structure of the geographic dimension, the membership of the interest points in the geographic dimension can be provided for the user, for example: the unpermitted lake was mounted to the university of AA.
By forming a tree structure and referring to fig. 8, the method can help people to more accurately acquire related interest points in the entity retrieval process, not only can improve the working efficiency of content operation, but also can integrate related information of the interest points, reduce the proportion of noise data therein, and organize content for larger interest points when a user operates, thereby improving the accuracy and effectiveness of operation.
FIG. 9 is a block diagram illustrating a data processing apparatus according to an example embodiment. Referring to fig. 9, may include:
an information obtaining module 91 configured to obtain point of interest information of a plurality of data sources;
a similarity calculation module 92 configured to calculate, according to the interest point information, a similarity of the two interest points in a preset dimension, where the preset dimension includes at least one of: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
a probability calculation module 93, configured to input the similarity of the two interest points in one or more preset dimensions into a classification model obtained through pre-training, so as to obtain a similarity probability of the two interest points, where the classification model is obtained through training based on the similarity of two sample interest points in one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
a clustering module 94 configured to cluster the plurality of interest points according to the similarity probability of the two interest points, so as to obtain an interest point set;
a target selection module 95 configured to select a target point of interest from the set of points of interest that represents the entity.
In an optional implementation manner, the interest point information includes name information of an interest point, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
In an optional implementation manner, the similarity calculation module is specifically configured to:
acquiring a first vector corresponding to the name information of the first interest point;
acquiring a second vector corresponding to the name information of the second interest point;
and calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
In an optional implementation manner, the interest point information includes geographical location information of an interest point, the preset dimension includes a geographical dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographical position relationship between the first interest point and a third interest point according to whether an area range indicated by the geographical position information of the first interest point is located in an area range indicated by the geographical position information of the third interest point, wherein the first geographical position relationship comprises that the first interest point is located inside or outside the third interest point;
determining a second geographical position relationship between the second interest point and the third interest point according to whether the area range indicated by the geographical position information of the second interest point is located in the area range indicated by the geographical position information of the third interest point, wherein the second geographical position relationship comprises that the second interest point is located inside or outside the third interest point;
and according to the geographic distance, the first geographic position relation and the second geographic position relation, obtaining the similarity of the first interest point and the second interest point on the geographic dimension.
In an optional implementation manner, the similarity calculation module is specifically configured to:
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than a first preset threshold value, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold value and less than a second preset threshold value, calculating the similarity of the first interest point and the second interest point in the geographic dimension according to the following formula,
Figure BDA0003077266990000221
wherein the Sim (a, b) represents the similarity of the first interest point and the second interest point in the geographic dimension, and the a represents the geographic degree of the first interest pointA geographical location indicated by location information, the b representing a geographical location indicated by geographical location information of the second point of interest, the dist (a, b) representing a geographical distance between the first point of interest and the second point of interest;
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
In an optional implementation manner, the interest point information includes address information of an interest point, the preset dimension includes an address dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the address information of the first interest point and the address information of the second interest point belong to the same geocoding block, acquiring a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
In an optional implementation manner, the interest point information includes feature information of an interest point, the preset dimension includes a feature dimension, the two interest points include a first interest point and a second interest point, and the similarity calculation module is specifically configured to:
if the feature information of the first interest point is the same as the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a second value;
and if the feature information of the first interest point is different from the feature information of the second interest point, determining that the similarity of the first interest point and the second interest point on the feature dimension is a third value.
In an optional implementation, the clustering module is specifically configured to:
if the similarity probability of the two interest points is greater than a third preset threshold value, establishing an association relationship between the two interest points;
aggregating the interest points with the incidence relation in the plurality of interest points into the interest point set.
In an optional implementation manner, the target selecting module is specifically configured to:
calculating the sum of the similarity probability of a first interest point and each second interest point, wherein the first interest point is any interest point in the interest point set, and the second interest point is any interest point except the first interest point in the interest point set;
and if the summation result is greater than or equal to a fourth preset threshold value, determining the first interest point as the target interest point.
In an optional implementation manner, the interest point information includes name information of interest points, a plurality of target interest points form a tree structure, and the tree structure includes a first target interest point and a second target interest point, the apparatus further includes a first mounting module configured to:
identifying an entity represented by the name information of the first target interest point to obtain entity information of the first target interest point;
identifying an entity represented by the name information of the second target interest point to obtain entity information of the second target interest point;
if the entity information of the first target interest point is the same as the entity information of the second target interest point, calculating the similarity between the name information of the first target interest point and the name information of the second target interest point;
if the similarity between the name information of the first target interest point and the name information of the second target interest point is greater than or equal to a fifth preset threshold value, determining that the first target interest point and the second target interest point are similar target interest points;
the similar target interest points form a target interest point set;
selecting similar target interest points meeting preset conditions from the target interest point set as first similar target interest points, and determining the first similar target interest points as superior nodes of other similar target interest points in the target interest point set in the tree structure, wherein the preset conditions are that the occupation ratio of entity information of the similar target interest points in name information is maximum.
In an alternative implementation, the first mounting module is further configured to:
determining the similar target interest points meeting the preset condition from the other similar target interest points as second similar target interest points;
determining the first similar target interest point as a previous level node of the second similar target interest point;
and determining the second similar target interest point as a superior node of a third similar target interest point, wherein the third similar target interest point is any similar target interest point in the target interest point set except the first similar target interest point and the second similar target interest point.
In an optional implementation manner, the interest point information includes geographical location information of interest points, a plurality of target interest points form a tree structure, the tree structure includes a first target interest point and a second target interest point, the apparatus further includes a second mounting module configured to:
and if the area range indicated by the geographical position information of the first target interest point is located in the area range indicated by the geographical position information of the second target interest point, taking the second target interest point as a superior node of the first target interest point in the tree structure.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 10 is a block diagram of one type of electronic device 800 shown in the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 10, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of a data processing method as described in any embodiment. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the data processing methods described in any of the embodiments.
In an exemplary embodiment, a non-transitory computer readable storage medium including instructions, such as the memory 804 including instructions, executable by the processor 820 of the electronic device 800 to perform the data processing method of any of the embodiments is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which comprises readable program code executable by the processor 820 of the device 800 to perform the data processing method of any of the embodiments. Alternatively, the program code may be stored in a storage medium of the apparatus 800, which may be a non-transitory computer readable storage medium, for example, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
Fig. 11 is a block diagram of one type of electronic device 1900 shown in the present disclosure. For example, the electronic device 1900 may be provided as a server.
Referring to fig. 11, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the data processing method of any of the embodiments.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like, stored in memory 1932.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of data processing, the method comprising:
obtaining interest point information of a plurality of data sources;
according to the interest point information, calculating the similarity of the two interest points in a preset dimension, wherein the preset dimension comprises at least one of the following: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
inputting the similarity of the two interest points in one or more preset dimensions into a classification model obtained through pre-training to obtain the similarity probability of the two interest points, wherein the classification model is obtained through training based on the similarity of the two sample interest points in one or more preset dimensions and the label of whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
clustering a plurality of interest points according to the similarity probability of the two interest points to obtain an interest point set;
and selecting a target interest point representing the entity from the interest point set.
2. The data processing method of claim 1, wherein the interest point information includes name information of interest points, the preset dimension includes a name dimension, the two interest points include a first interest point and a second interest point, and the step of calculating the similarity of the two interest points in the preset dimension according to the interest point information includes:
identifying an entity represented by the name information of the first interest point to obtain first entity information of the first interest point;
identifying an entity represented by the name information of the second interest point to obtain second entity information of the second interest point;
if the first entity information is the same as the second entity information, calculating the similarity between the name information of the first interest point and the name information of the second interest point, and obtaining the similarity of the first interest point and the second interest point in the name dimension.
3. The data processing method of claim 2, wherein the step of calculating the similarity between the name information of the first point of interest and the name information of the second point of interest comprises:
acquiring a first vector corresponding to the name information of the first interest point;
acquiring a second vector corresponding to the name information of the second interest point;
and calculating the distance between the first vector and the second vector to obtain the similarity between the name information of the first interest point and the name information of the second interest point.
4. The data processing method of claim 1, wherein the interest point information includes geographical location information of interest points, the preset dimension includes a geographical dimension, the two interest points include a first interest point and a second interest point, and the step of calculating similarity of the two interest points in the preset dimension according to the interest point information includes:
calculating the geographic distance between the first interest point and the second interest point according to the geographic position information of the first interest point and the geographic position information of the second interest point;
determining a first geographical position relationship between the first interest point and a third interest point according to whether an area range indicated by the geographical position information of the first interest point is located in an area range indicated by the geographical position information of the third interest point, wherein the first geographical position relationship comprises that the first interest point is located inside or outside the third interest point;
determining a second geographical position relationship between the second interest point and the third interest point according to whether the area range indicated by the geographical position information of the second interest point is located in the area range indicated by the geographical position information of the third interest point, wherein the second geographical position relationship comprises that the second interest point is located inside or outside the third interest point;
and according to the geographic distance, the first geographic position relation and the second geographic position relation, obtaining the similarity of the first interest point and the second interest point on the geographic dimension.
5. The data processing method according to claim 4, wherein the step of obtaining the similarity of the first interest point and the second interest point in the geographic dimension according to the geographic distance, the first geographic position relationship and the second geographic position relationship comprises:
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is greater than a first preset threshold value, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is greater than the first preset threshold value and less than a second preset threshold value, calculating the similarity of the first interest point and the second interest point in the geographic dimension according to the following formula,
Figure FDA0003077266980000021
wherein Sim (a, b) represents the similarity of the first interest point and the second interest point in the geographic dimension, a represents the geographic position indicated by the geographic position information of the first interest point, b represents the geographic position indicated by the geographic position information of the second interest point, and dist (a, b) represents the geographic distance between the first interest point and the second interest point;
if the first interest point and the second interest point are both located inside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, or the first interest point and the second interest point are both located outside the third interest point and the geographic distance is smaller than or equal to the first preset threshold, determining that the similarity of the first interest point and the second interest point in the geographic dimension is a first value.
6. The data processing method according to claim 1, wherein the point of interest information includes address information of a point of interest, the preset dimension includes an address dimension, the two points of interest include a first point of interest and a second point of interest, and the step of calculating the similarity of the two points of interest in the preset dimension according to the point of interest information includes:
if the address information of the first interest point and the address information of the second interest point belong to the same geocoding block, acquiring a third vector corresponding to the address information of the first interest point and a fourth vector corresponding to the address information of the second interest point;
and calculating the distance between the third vector and the fourth vector to obtain the similarity of the first interest point and the second interest point in the address dimension.
7. A data processing apparatus, characterized in that the apparatus comprises:
the information acquisition module is configured to acquire interest point information of a plurality of data sources;
a similarity calculation module configured to calculate, according to the interest point information, a similarity of the two interest points in a preset dimension, where the preset dimension includes at least one of: a geographic dimension, a name dimension, an address dimension, and a feature dimension;
a probability calculation module configured to input the similarity of the two interest points in one or more preset dimensions into a classification model obtained through pre-training to obtain a similarity probability of the two interest points, where the classification model is obtained through training based on the similarity of the two sample interest points in one or more preset dimensions and whether the two sample interest points represent the same entity, and the similarity probability is used for representing the probability that the two interest points represent the same entity;
the clustering module is configured to cluster the interest points according to the similarity probability of the two interest points to obtain an interest point set;
a target selection module configured to select a target point of interest representing the entity from the set of points of interest.
8. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1 to 6.
9. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the data processing method according to any one of claims 1 to 6 when executed by a processor.
CN202110556281.9A 2021-05-21 2021-05-21 Data processing method and device, electronic equipment and storage medium Pending CN113420595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110556281.9A CN113420595A (en) 2021-05-21 2021-05-21 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110556281.9A CN113420595A (en) 2021-05-21 2021-05-21 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113420595A true CN113420595A (en) 2021-09-21

Family

ID=77712691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110556281.9A Pending CN113420595A (en) 2021-05-21 2021-05-21 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113420595A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905456A (en) * 2023-01-06 2023-04-04 浪潮电子信息产业股份有限公司 Data identification method, system, equipment and computer readable storage medium
CN116257515A (en) * 2023-05-16 2023-06-13 之江实验室 Geographic interest point deduplication method, device and medium based on machine learning
CN117591904A (en) * 2024-01-18 2024-02-23 中睿信数字技术有限公司 Freight car clustering method based on density clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376205A (en) * 2018-09-07 2019-02-22 顺丰科技有限公司 Excavate method, apparatus, equipment and the storage medium of address point of interest relationship
CN110489507A (en) * 2019-08-16 2019-11-22 腾讯科技(深圳)有限公司 Determine the method, apparatus, computer equipment and storage medium of point of interest similarity
CN111209354A (en) * 2018-11-22 2020-05-29 北京搜狗科技发展有限公司 Method and device for judging repetition of map interest points and electronic equipment
CN111954175A (en) * 2020-08-25 2020-11-17 腾讯科技(深圳)有限公司 Method for judging visiting of interest point and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376205A (en) * 2018-09-07 2019-02-22 顺丰科技有限公司 Excavate method, apparatus, equipment and the storage medium of address point of interest relationship
CN111209354A (en) * 2018-11-22 2020-05-29 北京搜狗科技发展有限公司 Method and device for judging repetition of map interest points and electronic equipment
CN110489507A (en) * 2019-08-16 2019-11-22 腾讯科技(深圳)有限公司 Determine the method, apparatus, computer equipment and storage medium of point of interest similarity
CN111954175A (en) * 2020-08-25 2020-11-17 腾讯科技(深圳)有限公司 Method for judging visiting of interest point and related device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905456A (en) * 2023-01-06 2023-04-04 浪潮电子信息产业股份有限公司 Data identification method, system, equipment and computer readable storage medium
CN116257515A (en) * 2023-05-16 2023-06-13 之江实验室 Geographic interest point deduplication method, device and medium based on machine learning
CN117591904A (en) * 2024-01-18 2024-02-23 中睿信数字技术有限公司 Freight car clustering method based on density clustering
CN117591904B (en) * 2024-01-18 2024-04-16 中睿信数字技术有限公司 Freight car clustering method based on density clustering

Similar Documents

Publication Publication Date Title
US11048983B2 (en) Method, terminal, and computer storage medium for image classification
CN113420595A (en) Data processing method and device, electronic equipment and storage medium
US20210117726A1 (en) Method for training image classifying model, server and storage medium
CN109800325A (en) Video recommendation method, device and computer readable storage medium
CN109274732B (en) Geographic position obtaining method and device, electronic equipment and storage medium
JP2017510104A (en) Identifying entities associated with wireless network access points
CN110019645B (en) Index library construction method, search method and device
CN109961094B (en) Sample acquisition method and device, electronic equipment and readable storage medium
US20210248477A1 (en) Method and device for recommending video, and computer readable storage medium
CN104994125B (en) Method for sending information, information display method and device
KR20160048708A (en) Recognition method and apparatus for communication message
CN109670077B (en) Video recommendation method and device and computer-readable storage medium
CN112417318A (en) Method and device for determining state of interest point, electronic equipment and medium
CN111209354A (en) Method and device for judging repetition of map interest points and electronic equipment
CN104850238A (en) Method and device for sorting candidate items generated by input method
CN111966769B (en) Method, device, equipment and medium for recommending information based on life circle
CN112328911B (en) Place recommending method, device, equipment and storage medium
CN110929176A (en) Information recommendation method and device and electronic equipment
CN111984749A (en) Method and device for ordering interest points
CN113705210A (en) Article outline generation method and device for generating article outline
US20130144904A1 (en) Method and system for providing query using an image
CN112101216A (en) Face recognition method, device, equipment and storage medium
CN114880480A (en) Question-answering method and device based on knowledge graph
CN107357865A (en) Information cuing method and device
CN108241678B (en) Method and device for mining point of interest data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination