CN112381166A

CN112381166A - Information point identification method and device and electronic equipment

Info

Publication number: CN112381166A
Application number: CN202011313164.1A
Authority: CN
Inventors: 谢红伟; 宿玲玲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-19
Anticipated expiration: 2040-11-20
Also published as: CN112381166B

Abstract

The application discloses an information point identification method and device and electronic equipment, and relates to the technical field of deep learning. The specific implementation scheme is as follows: acquiring a first similarity characteristic and a second similarity characteristic; the first similarity feature is used for representing the text semantic similarity between a first information point and a second information point, the second similarity feature is used for representing the N similarities of the first information point and the second information point in the N dimension, and N is a positive integer greater than 1; fusing the first similarity characteristic and the second similarity characteristic to obtain a target characteristic; and determining whether the first information point and the second information point are the same information point or not based on the target characteristics. According to the technology of the application, the problem that the identification accuracy rate of the information point identification technology is low is solved, and the accuracy rate of information point identification is improved.

Description

Information point identification method and device and electronic equipment

Technical Field

The application relates to the technical field of intelligent search, in particular to the technical field of deep learning, and specifically relates to an information point identification method and device and electronic equipment.

Background

The information point identification technology is used for judging whether two information points belong to the same space entity or not through the multi-dimensional characteristics corresponding to the information points. The method is widely applied to scenes such as information point data online, information point data duplicate removal, information point high-quality basic attribute supplement, information point reservation service, high-quality content attribute supplement and the like, and is one of the most core basic technologies of map content ecology.

At present, an information point identification mode mainly includes two stages of information point identification, the first stage is to calculate text semantic similarity of two information points, and the second stage is to judge whether the two information points belong to the same space entity based on the text semantic similarity of the two information points and similarity in other dimensions.

Disclosure of Invention

The disclosure provides an information point identification method and device and electronic equipment.

According to a first aspect of the present disclosure, there is provided an information point identification method, including:

acquiring a first similarity characteristic and a second similarity characteristic; the first similarity feature is used for representing the text semantic similarity between a first information point and a second information point, the second similarity feature is used for representing the N similarities of the first information point and the second information point in the N dimension, and N is a positive integer greater than 1;

fusing the first similarity characteristic and the second similarity characteristic to obtain a target characteristic;

and determining whether the first information point and the second information point are the same information point or not based on the target characteristics.

According to a second aspect of the present disclosure, there is provided an information point identifying apparatus including:

the acquisition module is used for acquiring the first similarity characteristic and the second similarity characteristic; the first similarity feature is used for representing the text semantic similarity between a first information point and a second information point, the second similarity feature is used for representing the N similarities of the first information point and the second information point in the N dimension, and N is a positive integer greater than 1;

the fusion module is used for fusing the first similarity characteristic and the second similarity characteristic to obtain a target characteristic;

and the determining module is used for determining whether the first information point and the second information point are the same information point or not based on the target characteristics.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform any one of the methods of the first aspect.

According to the technology of the application, the problem that the identification accuracy rate of the information point identification technology is low is solved, and the accuracy rate of information point identification is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flow chart of an information point identification method according to a first embodiment of the present application;

FIG. 2 is a frame diagram of an implementation of the information point identification method;

FIG. 3 is a block diagram of a computation framework for address similarity;

FIG. 4 is a block diagram of a text semantic matching network;

fig. 5 is a schematic structural view of an information point identifying apparatus according to a second embodiment of the present application;

fig. 6 is a block diagram of an electronic device for implementing the information point identification method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

First embodiment

As shown in fig. 1, the present application provides an information point identification method, including the following steps:

step S101: acquiring a first similarity characteristic and a second similarity characteristic; the first similarity feature is used for representing the text semantic similarity between a first information point and a second information point, the second similarity feature is used for representing the N similarities of the first information point and the second information point in the N dimension, and N is a positive integer greater than 1.

In this embodiment, the information point identification method relates to the technical field of intelligent search, and in particular, to the technical field of deep learning, and may be applied to an electronic device, where the electronic device may be a server or a terminal, and is not specifically limited herein.

In some application scenarios, the information point identification technology may be referred to as an information point chain finger technology, and may be widely applied to information point data online, information point data deduplication, information point high-quality basic attribute supplementation, information point reservation service, high-quality content attribute supplementation, and other scenarios. The specific application of the method is that a point of information is linked from a map system to the same point of information or to a different point of information to perform corresponding processing.

For example, in an application scenario where the data of the information point is online, before the target information point is online, it may be determined whether the same information point exists in the map system, and specifically, the target information point may be matched with each information point in the map system to determine whether the same information point exists in the map system.

When the same information point is linked in the map system, it is not necessary to newly record the information point, and when an information point different from the information point is linked in the map system, it is recorded. In addition, when the same information point as the target information point is linked in the map system, the attribute content of the same information point in the map system can be supplemented for the attribute content of the target information point, such as the multimedia content of the supplementary information point, so as to enrich the attribute content of the information point in the map system.

For another example, duplication checking may be performed on online data, and information point identification may be performed on information points included in a map system to perform information point duplication checking.

For example, the same or similar information points can be recalled from the map system for the target information points to realize the user search function.

In step S101, the first information point and the second information point are two information points, which may be referred to as interest points, and in the geographic information system, one information point may be a house, a shop, a mailbox, a bus station, or a scenic spot.

The first information point and the second information point can be information points in a map system. In some application scenarios, one information point may be an information point in a map system, and the other information point is not an information point in the map system, and the specific application of the method is to match a target information point (which may not be an information point in the map system) with an information point (which may be referred to as an information point to be matched) in the map system, and determine whether the target information point and the information point to be matched are the same information point, so as to perform corresponding processing.

In an application scenario where the first information point and the second information point are information points in a map system, and another information point is not an information point in the map system, the first information point may be a target information point, and the second information point may be an information point to be matched in the map system. The first information point can also be an information point to be matched in the map system, and the second information point can also be a target information point. In the following embodiments, the first information point is taken as a target information point, and the second information point is taken as an information point to be matched in a map system.

The first similarity feature is used for representing text semantic similarity of the first information point and the second information point, and the text semantic similarity is one of the text similarities. The text similarity between the first information point and the second information point may include similarity of one or more dimensions, and the similarity in the dimensions may be obtained based on text information of two information points, where the text information refers to information of information points represented in a text form, and specifically may include name information, tag information, address information, location information, contact information, and the like of the information points.

The tag information of the first information point may indicate a classification category of the first information point, if the tag information of the first information point is "leisure entertainment", that is, the classification category of the first information point is leisure entertainment. In addition, the first information point may include a multi-level tag, for example, if the first information point is a business hotel, the first level tag may be a leisure and recreation, and the second level tag may be a hotel.

The address information of the first information point may include a city, a county, a road, a house number, and the like where the first information point is located, and the location information of the first information point may refer to geographical location information determined by navigation and positioning, such as latitude and longitude information.

The contact information of the first information point may include a contact phone, a website, an account, and the like corresponding to the first information point.

The semantic similarity of the text is the similarity of the first information point and the second information point in the name dimension, and the essence is to compare the name information of the first information point with the name information of the second information point to determine whether the name of the first information point is similar to the name of the second information point.

The text similarity may further include other similarities besides name dimension, where the N similarities are similarities in the text similarity except the text semantic similarity, for example, the N similarities include address similarity, spatial similarity, tag similarity, and phone similarity.

The label similarity may be a similarity between the first information point and the second information point in the label dimension, and it is essential to compare the label information of the first information point with the label information of the second information point to determine whether the label of the first information point is similar to the label of the second information point.

The spatial similarity may be a similarity between the first information point and the second information point in the position dimension, and it is essential to determine a spatial distance between the first information point and the second information point to determine whether the position of the first information point is similar to the position of the second information point.

The address similarity may be a similarity between the first information point and the second information point in the address dimension, and is substantially determined by comparing the address information of the first information point and the address information of the second information point to determine whether the address of the first information point is similar to the address of the second information point.

The telephone similarity may be a similarity between the first information point and the second information point in the dimension of the contact information, and it is essential to compare the contact telephone of the first information point with the contact telephone of the second information point to determine whether the contact telephone of the first information point is similar to the contact telephone of the second information point.

The second similarity feature may be used to represent other similarities, except for the text semantic similarity, in the text similarity between the first information point and the second information point, and specifically may be used to represent at least one of the address similarity, the spatial similarity, the tag similarity, the phone similarity, and the like.

The first similarity feature may be determined based on text semantic similarity, is a feature expression of the text semantic similarity, and the feature expression manner may be various, and may be a feature expression manner by using a binary numerical value or a decimal numerical value.

The second similarity feature may be determined based on at least one similarity among address similarity, spatial similarity, tag similarity, telephone similarity, and the like, and in the case that the second similarity feature represents the similarities of multiple dimensions, feature information representing the similarities of the multiple dimensions needs to be fused to obtain the second similarity feature. The feature information for representing the similarity of each dimension is determined based on the similarity of the dimension and is a feature expression of the similarity of the dimension, and in order to unify the feature expressions, the similarity of each dimension is expressed by adopting a binary numerical value.

Referring to fig. 2, fig. 2 is a schematic diagram of an implementation framework of an information point identification method, as shown in fig. 2, the information point identification method may be obtained from an end-to-end object model, which may be referred to as an end-to-end chain finger model, and aims to recall information points according to whether an object information point chain points to the same information point in a map system as the object information point or to an information point different from the object information point.

The end-to-end chain finger model is characterized in that text information of two information points is input, the recognition results of the two information points can be directly output, the text semantic similarity of the two information points is determined without adopting a deep semantic matching model, and then the text semantic similarity and other similarities are input into one chain finger model.

Because the deep semantic matching model and the chain finger model are respectively output, the optimization targets of the deep semantic matching model and the chain finger model are not uniform, the optimization effect of the model is damaged, and the output of the deep semantic matching model contributes to the characteristics of the chain finger model and has large characteristics, so that the identification effect of the information point cannot be ensured, and the identification accuracy of the information point is influenced.

In the embodiment, the target can be optimized uniformly through an end-to-end chain finger model, and the model optimization effect loss caused by non-uniform optimization targets is avoided.

As shown in fig. 2, the input of the end-to-end chain finger model may be text information of two information points, which may include address information, location information, tag information, contact phone information, and name information. The main structure of the end-to-end chain finger model can comprise a shallow layer wide part and a deep layer deep part, the wide part aims to respectively obtain characteristic information representing address similarity, characteristic information representing spatial similarity, characteristic information representing tag similarity and characteristic information representing telephone similarity, a second similarity characteristic is obtained based on the characteristic information representing address similarity, the characteristic information representing spatial similarity, the characteristic information representing tag similarity and the characteristic information representing telephone similarity, and the deep part aims to finally obtain a first similarity characteristic based on name information of two information points.

The address similarity may be determined by address resolution and address comparison, referring to fig. 3, fig. 3 is a schematic diagram of a computation framework of the address similarity, and as shown in fig. 3, the address information of the first information point and the address information of the second information point may be respectively input to an address resolver, and the address resolver may perform address resolution based on a named entity recognition technology of Chinese vocabulary Analysis (LAC) to obtain an address resolution result of the first information point and an address resolution result of the second information point. The LAC may be a stacked bidirectional Gated Round Unit (GRU) structure.

The address analysis result can be an address with a serial structure, each address label in the address analysis result represents a certain meaning, CIT represents a city, DIS represents a district or county, ROAD represents a ROAD, SITE represents a house number or a floor number, and POI represents an information point; SEG _ ROAD represents that the entity attribute is a ROAD, SEG _ ROAD _ NUM represents a house number of the ROAD, SEG _ POI represents that the entity attribute is an information point, and SEG _ FLOOR _ NUM represents a FLOOR number of the information point.

And inputting the address resolution results of the two information points into an address comparator, and finally outputting the address similarity of the two information points.

Under the condition that the two information points have accurate addresses, if at least one of the road name, the house number and the floor number is different, the addresses of the two information points can be represented to be different, the address similarity can be represented by a numerical value 1, if the road name, the house number and the floor number are all the same, the addresses of the two information points are represented to be the same, and the address similarity can be represented by a numerical value 0. And in the case that there is at least one information point without an accurate address, the address similarity of the two information points is unknown, and the two information points can be represented by other values, such as a value 2, which is not specifically limited herein.

The spatial similarity may be determined by calculating a distance between the first information point and the second information point, specifically, based on the position information of the first information point and the position information of the second information point, a euclidean distance between the first information point and the second information point may be calculated, where the euclidean distance is an absolute distance between the first information point and the second information point, and the spatial similarity may be obtained after normalization based on the absolute distance.

In an alternative embodiment, the absolute distance may be directly normalized to obtain a spatial similarity, where the spatial similarity represents the absolute distance between two information points.

In another alternative embodiment, the concept of determining whether two information points belong to the same space entity according to the distance may be different, for example, a same-name park at 300 m may be the same, a same-name chain of brand stores at 300 m may be one or two, and a toilet at 300 m may not be the same.

When a chain of information points is made, such as searching for a co-named park within 300 meters, using a uniform recall distance may result in missed or false recalls. Therefore, different recall distances can be set for the information points of different classification categories, and the absolute distance is normalized based on the recall distance to obtain the spatial similarity, wherein the spatial similarity represents the relative distance between two information points.

Examples of recall distances for information points of different classification categories may be as shown in table 1 below.

TABLE 1 recall distance table for information points of partially classified categories

A delicious food; snack fast food restaurant	A hotel; star hotel	Leisure and entertainment; leisure square	Tourist attractions; zoo
				200 m	500 m	1000 m	5000 m

In addition, the influence of the distance on the chain finger results (the chain finger results may be called as recall results or search results) should be non-linear, the chain finger results are the same when the distance is smaller than a certain value range, the chain finger results are different when the distance is larger than the certain value range, and the chain finger results in the middle value range gradually change.

Therefore, when calculating the spatial similarity, the recall distances preset respectively for the two information points may be queried based on the tag information of the two information points, and when normalizing the absolute distance based on the recall distances corresponding to the two information points, the relative distance between the two information points is calculated, and the relative distance is the spatial similarity between the two information points.

The absolute distance may be normalized based on the recall distance of two information points using a dynamic sigmoid function, which is shown in equation (1):

in the above formula (1), y is a spatial similarity representing a relative distance between two information points, a size interval is [0,1], d represents an absolute distance between a first information point and a second information point, and n represents a recall distance corresponding to the two information points.

When the classification categories, i.e., the label information, of the first information point and the second information point are the same, the recall distance is the recall distance corresponding to the classification category of the first information point or the second information point, and when the classification categories of the first information point and the second information point are different, the recall distance may be the average of the two recall distances of the two information points. For example, if the classification category of the first information point is hotel and the recall distance is 500 meters, and the classification category of the second information point is leisure and entertainment and the recall distance is 1000 meters, the recall distance may be 750 meters.

The spatial similarity of two information points is a continuous type of 0 to 1, and the smaller the spatial similarity, the closer the spatial distance is, and the larger the spatial similarity, the farther the spatial distance is.

The label similarity is mainly obtained through statistics according to the chain finger relationship on the line, and can comprise three different levels of different, similar and identical.

The information points which represent the two classification categories are rarely in the same chain finger aggregation group; information points which similarly represent the two classification categories have certain probability to appear in the same chain finger aggregation group; the information points that are identical and represent the two categorical categories are mostly co-occurring in the same chain-finger aggregation group. Wherein, chain refers to aggregation group refers to aggregating the same information points together for easy recall.

The value 2 may be used for representing the label information of two information points, the value 1 may be used for representing the label information of two information points, and the value 2 may be used for representing the label information of two information points. In addition, when the tag information of at least one information point is empty and cannot be compared, the tag similarity can be represented by a value of-1.

The phone similarity can be accomplished by two basic procedures of phone parsing and phone comparison. The telephone analysis is responsible for analyzing the contact telephone corresponding to the information point into a plurality of structured telephones according to the punctuation marks or the space marks, the telephone comparison is responsible for assembling the plurality of structured telephones into telephone pairs and comparing the telephone pairs respectively, as long as one telephone pair has the same comparison result and represents the same contact telephone of two information points, otherwise, the contact telephones representing the information points are different.

The phone similarity may be characterized by a value of 0 in case the contact phones of the two information points are the same, and by a value of 1 in case the contact phones of the two information points are different.

After the spatial similarity, the address similarity, the label similarity and the telephone similarity are obtained respectively, feature expression can be performed on the similarity of each dimension by adopting binary numerical values respectively, and feature information representing the similarity of each dimension is obtained.

If the similarity of a certain dimension is a discrete numerical value such as address similarity, the address similarity can be characterized by adopting a binary system with corresponding digits according to the hierarchy of the address similarity, and if the hierarchy of the address similarity is 3 layers, the address similarity can be characterized by adopting a 3-digit binary system. When the address similarity is 2, it may be characterized as "001", when the address similarity is 0, it may be characterized as "010", and when the address similarity is 1, it may be characterized as "100".

Of course, the above-mentioned manner of characterizing features is only an example, and other manners of characterizing features may be provided, which are not described herein. For other discrete similarities, the characteristic representation manner may be similar to the address similarity, and details thereof are not repeated here.

If the similarity of a certain dimension is continuous, such as spatial similarity, the similarity needs to be discretized into different levels, then the levels where the spatial similarities of the two information points are located are represented by adopting the binary system of the corresponding digit, and finally the characteristic information representing the spatial similarity is obtained.

Then, the feature information may be directly spliced, or the feature information may be fused and crossed by using a full connection layer, as shown in fig. 2, to finally obtain the second similarity feature.

The deep portion may employ a deep semantic matching network based on an attention mechanism to determine a text semantic similarity of the first information point and the second information point. The attention mechanism-based deep semantic matching network can adopt a twin network structure, and the left sub-network and the right sub-network are symmetrical, wherein the left sub-network is used for extracting the text semantic features of a first information point based on the name information of the first information point, and the right sub-network is used for extracting the text semantic features of a second information point based on the name information of the second information point.

Each side of the sub-network primary computing unit comprises: the system comprises a self-attention mechanism network, a forward network layer, a multi-head attention mechanism network and a summation standardization module, wherein the self-attention mechanism network is responsible for independently calculating the feature vectors of information points, the forward network layer is a simple full connection layer, the multi-head attention mechanism network is responsible for calculating the interaction relation feature vectors of the two information points, and the summation standardization module is responsible for residual connection and feature vector normalization to finally obtain the text semantic features of the information points.

And then, connecting the text semantic features of the first information points with the text semantic features of the second information points based on a splicing module, inputting the text semantic features of the first information points and the text semantic features of the second information points into a logistic regression model for classification, and finally obtaining the text semantic similarity.

The semantic similarity of the text can be a floating-point numerical value normalized to 0 to 1, the smaller the score is, the more dissimilar the name representing the first information point and the name representing the second information point, otherwise, the more similar the name representing the first information point and the name representing the second information point.

After the text semantic similarity is obtained, feature expression can be performed on the text semantic similarity by adopting binary numerical values, and finally a first similarity feature is obtained.

Step S102: and fusing the first similarity characteristic and the second similarity characteristic to obtain a target characteristic.

In this step, as shown in fig. 2, the first similarity feature and the second similarity feature may be subjected to feature fusion and intersection by using a full connection layer, so as to finally obtain the target feature.

Step S103: and determining whether the first information point and the second information point are the same information point or not based on the target characteristics.

In this step, it may be determined whether the first information point and the second information point are the same information point based on the target feature obtained by the fusion.

For example, in the case of feature expression, since the binary value of each bit represents a corresponding meaning, in terms of address similarity, the second bit in the binary value is 1, and the other bits are 0, the addresses of two information points are the same. Therefore, whether the first information point and the second information point are the same information point can be determined by judging whether the binary system of the corresponding position in the target feature is 1, if the binary system of the corresponding digit is 1, the first information point and the second information point are the same information point, and if not, the first information point and the second information point are different information points.

In practical application, the target features can be input into the logistic regression model, and finally, the recognition result is output.

In this embodiment, the first similarity characteristic and the second similarity characteristic are fused to obtain a target characteristic, and whether the first information point and the second information point are the same information point is determined based on the target characteristic, so that an end-to-end chain finger model can be realized through characteristic fusion, so that targets can be optimized uniformly, model optimization effect loss caused by non-uniform optimization targets is avoided, and further, the identification accuracy of the information points can be improved. In addition, in the application scene of the information point chain finger, the recall rate of the information point chain finger can be improved.

In addition, the end-to-end chain refers to that the input of the model is the text information of the information points, and the output is the recognition result of the two information points, so that large-scale labeled samples are available, and compared with a training sample of a deep semantic matching model (the semantic similarity of the output text needs to be constructed), the sample construction cost of the model can be reduced.

Optionally, N is greater than 2, and obtaining a second similarity characteristic includes:

acquiring N similarity of the first information point and the second information point on N dimensions;

acquiring characteristic information representing the similarity of each dimension;

and fusing the plurality of feature information representing the N similarities to obtain the second similarity feature.

In this embodiment, as shown in fig. 2, the second similarity characteristic is mainly obtained from the wide portion of the end-to-end chain finger model.

Specifically, first, the spatial similarity, the address similarity, the tag similarity, and the phone similarity between the first information point and the second information point are respectively obtained, and the specific obtaining process is described in detail above, and is not described herein again.

Then, for the similarity in each dimension, it may be characterized using a binary system or other binary systems to obtain characteristic information characterizing the similarity in each dimension. In order to unify the feature expression, the similarity of each dimension is represented in the same way, and may be represented by binary values, as shown in fig. 2.

In addition, discrete similarity exists in the N similarities, such as address similarity, tag similarity and telephone similarity, and continuous similarity exists, such as spatial similarity. For the discrete similarity, in an alternative embodiment, the similarity may be characterized by using binary of corresponding bit number based on the level of the similarity, for example, the address similarity is divided into three levels, and the address similarity may be characterized by using 3-bit binary.

For similarity of continuity, before feature characterization, the similarity can be discretized into a plurality of levels, and then feature characterization is performed on the similarity by using binary values of corresponding digits.

Finally, the full connection layer can be adopted to fully fuse and cross the multiple features representing the N similarities to obtain a second similarity feature.

In the embodiment, the wide part of the end-to-end chain finger model performs sufficient feature fusion and intersection on feature information representing N similarities to obtain a second similarity feature, and meanwhile, the second similarity feature of the wide part sufficiently fuses the first similarity feature representing the semantic similarity of text of the deep part to finally obtain a target feature for information point identification. Therefore, the similarity features of the first information point and the second information point are fully subjected to feature fusion and intersection, and the accuracy of information point identification can be further improved.

Optionally, the N similarity degrees include spatial similarity degrees of the first information point and the second information point in a distance dimension, and obtaining the spatial similarity degrees of the first information point and the second information point in the distance dimension includes:

acquiring a first recall distance corresponding to the first information point, a second recall distance corresponding to the second information point and a target distance between the first information point and the second information point;

and normalizing the target distance based on the first recall distance and the second recall distance to obtain the spatial similarity.

Because the concept of judging whether two information points belong to the same space entity according to the distance may be different, for example, a same-name park at a distance of 300 m is likely to be the same, a same-name chain-brand store at a distance of 300 m is likely to be one or two, and a toilet at a distance of 300 m is likely not to be the same. Thus, when a chain of information points is made, such as searching for a co-named park within 300 meters, using a uniform recall distance may result in missed or false recalls.

In this embodiment, different recall distances may be set for information points of different classification categories, and based on the recall distance, the absolute distance is normalized to obtain a spatial similarity, which represents a relative distance between two information points.

Specifically, for different classification categories of the information points, recall distances of the information points may be different, a first recall distance corresponding to a first information point may be obtained based on tag information of the first information point, and a second recall distance corresponding to a second information point may be obtained based on tag information of the second information point.

And if the label information of the first information point is the same as that of the second information point, namely the first information point and the second information point belong to the same classification category, the first recall distance and the second recall distance are equal, otherwise, the first recall distance and the second recall distance may not be equal.

The target distance, namely the absolute distance, between the first information point and the second information point can be normalized by adopting a dynamic sigmoid function based on the first recall distance and the second recall distance, and finally the spatial similarity between the first information point and the second information point is obtained.

Wherein, in the above formula (1), n may be either the first recall distance or the second recall distance when the first recall distance and the second recall distance are equal, and n may be an average value of the first recall distance and the second recall distance when the first recall distance and the second recall distance are not equal.

In this embodiment, the absolute distance between the two information points is normalized by the recall distance corresponding to the information points to obtain the relative distance between the two information points, so that information points of different classification categories can be avoided.

Optionally, obtaining feature information representing spatial similarity in the distance dimension includes:

determining discretization parameters corresponding to the spatial similarity based on a preset discretization step length;

and determining characteristic information representing the spatial similarity based on the discretization parameters.

In this embodiment, the spatial similarity is a continuous similarity, and when performing feature characterization on the continuous similarity, it is necessary to determine a discretization parameter corresponding to the spatial similarity based on a preset discretization step length.

The discretization parameter represents a hierarchy where a relative distance between the first information point and the second information point is located, and the smaller the hierarchy is, the closer the relative distance between the first information point and the second information point can be represented, and the larger the hierarchy is, the farther the relative distance between the first information point and the second information point can be represented.

Before determining the discretization parameter corresponding to the spatial similarity between the first information point and the second information point based on the preset discretization step length, layering the spatial similarity based on the preset discretization step length. The spatial similarity may be layered in a linear manner based on the same discretization step length, or layered in a nonlinear manner based on different discretization step lengths, which is not specifically limited herein.

For example, the size interval of the spatial similarity is [0,1], and the spatial similarity can be divided into 10 levels according to a discretization step size of 0.1.

For another example, the size interval of the spatial similarity is [0,1], and the spatial similarity can be divided into 19 levels based on a preset discretization step size, wherein 0 to 0.1 can be divided into the first 9 levels, the discretization step size is 0.01, and 0.1 to 1 are divided into the last 10 levels, and the discretization step size is 0.1. As the hierarchy grows, the relative distance between two information points becomes greater.

After layering is performed in a nonlinear manner, if the spatial similarity between the first information point and the second information point is obtained and is 0.055, and since the spatial similarity is less than 0.1, the first information point and the second information point are positioned to be the first 9 levels, the discretization step length is 0.01, the 0.055 is divided by 0.01, the quotient is 5, and the remainder exists, it can be determined that the level where the spatial similarity between the first information point and the second information point is 6, that is, the discretization parameter corresponding to the spatial similarity between the first information point and the second information point is 6.

And according to the discretization parameters, representing the discretization parameters by adopting binary numerical values, and determining characteristic information representing the spatial similarity.

In this embodiment, for the similarity of continuity, discretization is required to different levels, and then discretization parameters of the similarity of continuity of the first information point and the second information point are determined according to a preset discretization step length, and feature information representing the spatial similarity is determined based on the discretization parameters. Therefore, the feature characterization of the similarity of the continuous type can be realized, and a precondition is laid for feature fusion and intersection.

Optionally, the obtaining the first similarity feature includes:

acquiring first name information of the first information point and second name information of the second information point;

inputting the first name information and the second name information into a text semantic matching network; the text semantic matching network comprises a first sub-network and a second sub-network which are symmetrical to each other, the first sub-network is used for extracting the features of the first information points by adopting an attention mechanism to obtain the text semantic features of the first information points, and the second sub-network is used for extracting the features of the second information points by adopting the attention mechanism to obtain the text semantic features of the second information points;

and outputting the first similarity characteristic based on the text semantic characteristic of the first information point and the text semantic characteristic of the second information point.

In this embodiment, the text semantic matching network may adopt a deep semantic matching network with a double-tower structure, and the feature extraction unit of the deep semantic matching network is a self-attention mechanism network and a multi-head attention mechanism network.

Referring to fig. 4, fig. 4 is a schematic diagram of a text semantic matching network, and as shown in fig. 4, the text semantic matching network may adopt a twin network structure, and left and right subnetworks are symmetric, and may be a first subnetwork and a second subnetwork respectively, and each subnetwork is a deep semantic matching network. The left sub-network is used for extracting the text semantic features of the first information points based on the name information of the first information points, and the right sub-network is used for extracting the text semantic features of the second information points based on the name information of the second information points.

Each side of the sub-network primary computing unit comprises: the system comprises a self-attention mechanism network, a forward network layer, a multi-head attention mechanism network and a summation standardization module, wherein the self-attention mechanism network is responsible for independently calculating the feature vectors of information points, the forward network layer is a simple full connection layer, the multi-head attention mechanism network is responsible for calculating the interaction relation feature vectors of the two information points, and the summation standardization module is responsible for residual connection and feature vector normalization.

Specifically, first name information of a first information point and second name information of a second information point are obtained, the first name information is input to a first sub-network through an input embedding module, and the second name information is input to a second sub-network through the input embedding module.

The first sub-network performs feature extraction based on the first name information through a self-attention mechanism network and a multi-head attention mechanism network in combination with a forward network layer and a summation standardization module to obtain text semantic features of the first information point. And the second sub-network performs feature extraction based on the second name information through a self-attention mechanism network and a multi-head attention mechanism network in combination with the forward network layer and the summation standardization module to obtain the text semantic features of the second information point.

And then, splicing the text semantic features of the first information points and the text semantic features of the second information points through a splicing module, inputting the spliced text semantic features and the text semantic features into a logistic regression model, and finally outputting a first similarity feature representing the text semantic similarity of the first information points and the second information points.

In the embodiment, the deep part in the end-to-end chain finger model extracts the text semantic features of the first information point and the text semantic features of the second information point through a deep semantic matching network based on a double-tower structure, and outputs a first similarity feature representing the text semantic similarity between the first information point and the second information point based on the text semantic features of the first information point and the text semantic features of the second information point. And fusing and crossing the first similarity characteristic of the deep part and the second similarity characteristic of the wide part, so that a text semantic matching network can be fused into the chain finger model, an end-to-end chain finger model is finally formed, and the identification accuracy of the information points is improved.

In addition, the end-to-end chain refers to that the model needs to be trained in advance, and the training sample magnitude can be in the order of tens of millions such as 1200w, wherein the proportion of positive and negative examples can be 1:2, the number of positive examples can be 400w, and the number of negative examples can be 800 w.

The positive sample source may be manually marked data (i.e., sample data manually marked as a positive example) and link-finger relationship data in the map system, i.e., data in a link-finger aggregation group of the map system, and the negative sample source may be manually marked data (i.e., sample data manually marked as a negative example), parent-child relationship data and sibling relationship data on the map system, where the parent-child relationship data refers to different information points having a containing relationship, such as a parking lot in a building, and the sibling relationship data refers to different information points having a parallel relationship, such as two different buildings. After training, the information point identification can be carried out based on the end-to-end chain finger model.

Second embodiment

As shown in fig. 5, the present application provides an information point identifying apparatus 500 including:

an obtaining module 501, configured to obtain a first similarity characteristic and a second similarity characteristic; the first similarity feature is used for representing the text semantic similarity between a first information point and a second information point, the second similarity feature is used for representing the N similarities of the first information point and the second information point in the N dimension, and N is a positive integer greater than 1;

a fusion module 502, configured to fuse the first similarity characteristic and the second similarity characteristic to obtain a target characteristic;

a determining module 503, configured to determine whether the first information point and the second information point are the same information point based on the target feature.

Optionally, where N is greater than 2, the obtaining module 501 includes:

a first obtaining unit, configured to obtain N similarity degrees of the first information point and the second information point in an N dimension;

the second acquisition unit is used for acquiring characteristic information representing the similarity of each dimension;

and the feature fusion unit is used for fusing the plurality of feature information representing the N similarities to obtain the second similarity feature.

Optionally, the N similarities include spatial similarities of the first information point and the second information point in a distance dimension, and the first obtaining unit is specifically configured to obtain a first recall distance corresponding to the first information point, a second recall distance corresponding to the second information point, and a target distance between the first information point and the second information point; and normalizing the target distance based on the first recall distance and the second recall distance to obtain the spatial similarity.

Optionally, the second obtaining unit is specifically configured to determine a discretization parameter corresponding to the spatial similarity based on a preset discretization step length; and determining characteristic information representing the spatial similarity based on the discretization parameters.

Optionally, the obtaining module 501 further includes:

a third obtaining unit, configured to obtain first name information of the first information point and second name information of the second information point;

the input unit is used for inputting the first name information and the second name information into a text semantic matching network; the text semantic matching network comprises a first sub-network and a second sub-network which are symmetrical to each other, the first sub-network is used for extracting the features of the first information points by adopting an attention mechanism to obtain the text semantic features of the first information points, and the second sub-network is used for extracting the features of the second information points by adopting the attention mechanism to obtain the text semantic features of the second information points;

and the output unit is used for outputting the first similarity characteristic based on the text semantic characteristics of the first information points and the text semantic characteristics of the second information points.

The information point identifying device 500 provided by the present application can implement each process implemented by the above information point identifying method embodiment, and can achieve the same beneficial effects, and for avoiding repetition, the details are not repeated here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, the embodiment of the present application is a block diagram of an electronic device according to an information point identification method. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the information point identification method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the information point identifying method provided by the present application.

The memory 602, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the information point identification method in the embodiment of the present application (for example, the obtaining module 501, the fusing module 502, and the determining module 503 shown in fig. 5). The processor 501 executes various functional applications of the server and data processing, i.e., implements the information point identifying method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of the electronic device according to the information point identification method of the embodiment of the present application, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 may optionally include a memory remotely located from the processor 601, and these remote memories may be connected to the electronic device of the information point identification method through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the information point identification method according to the embodiment of the present application may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the information point recognition method of the embodiment of the present application, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

In this embodiment, the first similarity characteristic and the second similarity characteristic are fused to obtain a target characteristic, and whether the first information point and the second information point are the same information point is determined based on the target characteristic, so that an end-to-end chain finger model can be realized through characteristic fusion, so that targets can be optimized uniformly, model optimization effect loss caused by non-uniform optimization targets is avoided, and further, the identification accuracy of the information points can be improved. Therefore, according to the technical scheme of the embodiment of the application, the problem that the identification accuracy rate of the information point identification technology is low is well solved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An information point identification method, comprising:

2. The method of claim 1, wherein N is greater than 2, obtaining a second similarity feature comprises:

3. The method according to claim 2, wherein the N similarities include a spatial similarity of the first information point and the second information point in a distance dimension, and the obtaining the spatial similarity of the first information point and the second information point in the distance dimension includes:

4. The method of claim 3, wherein obtaining feature information characterizing spatial similarity in the distance dimension comprises:

5. The method of claim 1, wherein obtaining a first similarity feature comprises:

6. An information point identifying apparatus comprising:

7. The apparatus of claim 6, wherein N is greater than 2, the obtaining means comprising:

8. The apparatus according to claim 7, wherein the N similarities include spatial similarities of the first information point and the second information point in a distance dimension, and the first obtaining unit is specifically configured to obtain a first recall distance corresponding to the first information point, a second recall distance corresponding to the second information point, and a target distance between the first information point and the second information point; and normalizing the target distance based on the first recall distance and the second recall distance to obtain the spatial similarity.

9. The apparatus according to claim 8, wherein the second obtaining unit is specifically configured to determine a discretization parameter corresponding to the spatial similarity based on a preset discretization step length; and determining characteristic information representing the spatial similarity based on the discretization parameters.

10. The apparatus of claim 6, wherein the means for obtaining further comprises:

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.