CN111291155A

CN111291155A - Method and system for identifying homonymous cells based on text similarity

Info

Publication number: CN111291155A
Application number: CN202010054556.4A
Authority: CN
Inventors: 朱晨晓; 李昭; 陈浩; 高靖; 崔岩; 卢述奇; 陈呈; 张宵
Original assignee: Qingwutong Co ltd
Current assignee: Qingwutong Co ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-16

Abstract

The invention discloses a method and a system for identifying a cell with the same name based on text similarity, wherein the method comprises the following steps: acquiring a first name of a first cell to be identified and a second name of a second cell to be identified; acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the two cells. Because the distance between the cells to be distinguished and the text similarity between the cell names are comprehensively considered, the method can avoid misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same, and effectively improves the distinguishing accuracy.

Description

Method and system for identifying homonymous cells based on text similarity

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a system for identifying homonymous cells based on text similarity.

Background

With the rapid popularization and development of the internet, a house renting and selling platform is greatly popularized. And the house broker issues the house source information to each renting and selling platform so that the user can conveniently search the required house source information on the house source website by setting a screening condition.

However, in some application scenarios, if the alias of the cell a is the cell B, different house brokers may use different cell names when issuing the house source information, which results in that a user cannot distinguish whether the two are the same house source when searching the house source information; in addition, in another application scenario, if there are two cells with the same or similar names, the user may misunderstand that the two cells are the same house source.

In order to solve the above problem, the method for determining whether two cell names are the same cell in the prior art is as follows: firstly, judging whether the cities and the urban areas where the two cells are located are the same; and if the two cell names are the same, further calculating the text similarity of the two cell names, and if the text similarity is more than or equal to 90%, judging that the two cells are the same cell.

However, in the above method for identifying the cells with the same name, when an alias with text similarity smaller than 90% exists in a certain cell, or text similarity of names of two different cells exceeds 90%, a high misjudgment frequency occurs, and the identification accuracy is greatly reduced.

Disclosure of Invention

The invention provides a method and a system for identifying homonymous cells based on text similarity, which can effectively improve the identification accuracy of homonymous cells and reduce the misjudgment risk.

In a first aspect, the present application provides a method for identifying a cell with the same name based on text similarity, where the method includes:

acquiring a first name of a first cell to be identified and a second name of a second cell to be identified;

acquiring first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;

when the first basic information and the second basic information are the same, determining the distance between a first cell to be distinguished and a second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information;

when the distance between a first cell to be distinguished and a second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name;

and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.

Optionally, the first basic information includes a first city and a first urban area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where the second cell to be distinguished is located;

the method for judging whether the first basic information and the second basic information are the same comprises the following steps:

comparing whether a first city in which the first cell to be distinguished is located is the same as a second city in which the second cell to be distinguished is located; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;

if so, comparing whether a first urban area in which the first cell to be distinguished is located is the same as a second urban area in which the second cell to be distinguished is located: if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell; and if so, executing the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information.

Optionally, the first longitude and latitude information includes a first longitude and a first latitude of a first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of a second cell to be distinguished;

the distance between the first cell to be distinguished and the second cell to be distinguished is obtained by adopting the following formula:

wherein d is the distance between the first cell to be discriminated and the second cell to be discriminated, R is the radius of the earth,

and

first and second latitudes, respectively, and Δ λ represents the difference between the first and second longitudes.

Optionally, after the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished, the method further includes:

and when the distance between the first cell to be distinguished and the second cell to be distinguished is larger than a preset threshold value, judging that the first cell to be distinguished and the second cell to be distinguished are not the same cell.

Optionally, the step of calculating the text similarity between the first name and the second name includes:

preprocessing a text corresponding to the first name and a text corresponding to the second name to respectively obtain a first text and a second text;

after word segmentation processing is carried out on the first text and the second text, vectorizing is carried out on the first text and the second text respectively by utilizing a word vector model trained in advance, and a first word vector and a second word vector are obtained;

and calculating the cosine similarity of the first word vector and the second word vector as the text similarity of the first name and the second name.

Optionally, the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text respectively includes:

and removing invalid suffix and symbol characters in the text corresponding to the first name and the text corresponding to the second name, converting capital English characters into lowercase English characters, and converting numbers into Chinese characters.

Optionally, the step of determining a recognition result according to the text similarity and the distance between the first cell to be recognized and the second cell to be recognized includes:

when the text similarity is greater than or equal to 0.9, the first cell to be distinguished and the second cell to be distinguished are the same cell according to the distinguishing result;

when the text similarity is larger than or equal to 0.7, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 300 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;

when the text similarity is larger than or equal to 0.6, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 50 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;

and when the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 10 meters, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell.

In a second aspect, the present application provides a system for identifying a cell of the same name based on text similarity, the system comprising:

the name acquisition module is used for acquiring a first name of a first cell to be distinguished and a second name of a second cell to be distinguished;

an information obtaining module, configured to obtain first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;

a distance determining module, configured to determine, when the first basic information and the second basic information are the same, a distance between a first cell to be identified and a second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;

the similarity calculation module is used for calculating the text similarity of the first name and the second name when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value;

and the result determining module is used for determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.

Compared with the prior art, the homonymous cell identification method and system based on text similarity provided by the invention at least realize the following beneficial effects:

according to the method and the system for identifying the homonymous cell based on the text similarity, first attribute information of a first cell to be identified and second attribute information of a second cell to be identified are obtained by obtaining the first name of the first cell to be identified and the second name of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;

when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated. The distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.

Of course, it is not necessary for any product in which the present invention is practiced to achieve all of the above-described technical effects simultaneously.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart illustrating a method for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a system for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the method for distinguishing the cells with the same name provided by the prior art, for the cells to be distinguished in the same city and urban area, whether the cells are the same cell is judged only by judging whether the text similarity is greater than or equal to a preset threshold value. It can be understood that if a name a and an alias B exist in a certain cell, and the text similarity between a and B is less than 90%, the name a and the alias B are determined to be different cells; for another example, because the names of two cells with names a and B are the same or similar, the text similarity is greater than or equal to 90%, and then the name a and the alias B are determined to be the same cell. Therefore, the method for distinguishing the cells with the same name provided by the prior art only takes the text similarity as a distinguishing basis, so that high misjudgment frequency can occur, and the distinguishing accuracy is greatly reduced.

In view of the above, the invention provides a method for identifying a cell with the same name based on text similarity, which can effectively improve the identification accuracy of the cell with the same name and reduce the risk of misjudgment.

The following detailed description is to be read in connection with the drawings and the detailed description.

Fig. 1 is a flowchart illustrating a method for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure. Referring to fig. 1, the method for identifying a cell of the same name includes:

step 101, acquiring a first name of a first cell to be identified and a second name of a second cell to be identified;

102, acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;

103, when the first basic information and the second basic information are the same, determining the distance between the first cell to be identified and the second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;

104, when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name;

and 105, determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.

Specifically, after acquiring a first name of a first cell to be identified and a second name of a second cell to be identified, the first basic information and the second basic information may be compared first, and preliminary identification may be performed through space limitation. When the first basic information is the same as the second basic information, calculating the distance between the first cell to be identified and the second cell to be identified by using the acquired first longitude and latitude information and the acquired second longitude and latitude information; it will be appreciated that if the first cell to be distinguished and the second cell to be distinguished are the same cell, the distance between the two should be small. Therefore, only when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to the preset threshold, the text similarity of the first name and the second name is calculated, and the distinguishing result is determined by combining the text similarity and the distance between the two cells.

According to the method for distinguishing the cells with the same name, the distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, whether the cells are the same or not can be accurately distinguished, misjudgment caused by the fact that the cells to be distinguished have the alias or the cell names to be distinguished are the same can be avoided, and distinguishing accuracy is effectively improved.

Meanwhile, whether the same-name cells are the same cell or not is distinguished, so that whether the same house resources exist on the house renting and selling platform or not can be further checked when the house brokerage distributes the house resource information, subsequent house resource combination or house resource duplication elimination is facilitated, a user can quickly and accurately acquire required information in massive house resource data, and the use experience of the user is improved.

Optionally, the first basic information includes a first city and a first urban area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where a second cell to be identified is located;

before the step 103, a method of determining whether the first basic information and the second basic information are the same may be:

It will be appreciated that if the two cells to be distinguished are the same cell, then both must also be in the same city and downtown area, i.e.: two cells to be distinguished in different cities or urban areas must not be the same cell. Therefore, aiming at the condition that the cell to be distinguished is in different cities or urban areas, the distinguishing result can be quickly determined by comparing the first basic information with the second basic information without subsequent calculation, so that the calculation resources are greatly saved, and the real-time performance of the algorithm is improved.

Optionally, the first longitude and latitude information includes a first longitude and a first latitude of the first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of the second cell to be distinguished;

the distance between the first cell to be distinguished and the second cell to be distinguished can be calculated by adopting the following formula:

and

In this embodiment, the first longitude and latitude information and the second longitude and latitude information may be obtained by a map App, and the radius of the earth may be 6371 km. The earth is a sphere, and the longitude and latitude can be converted into the earth coordinate by using the formula, so that the distance between the cells to be distinguished can be accurately calculated.

Optionally, after the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished in step 103, the method further includes:

Wherein, the preset threshold value can be set according to the area of the cell to be distinguished. In this embodiment, the preset threshold may be 1500 meters. When the distance between the first cell to be distinguished and the second cell to be distinguished is larger than 1500, the first cell and the second cell to be distinguished are not the same cell, and the operation speed of the algorithm is greatly increased.

Optionally, in the step 104, the step of calculating the text similarity between the first name and the second name includes:

s1, preprocessing the text corresponding to the first name and the text corresponding to the second name to respectively obtain a first text and a second text;

s2, after word segmentation processing is carried out on the first text and the second text, vectorization is carried out on the first text and the second text respectively by using a word vector model trained in advance, and a first word vector and a second word vector are obtained;

s3, calculating the cosine similarity of the first word vector and the second word vector as the text similarity of the first name and the second name.

The text similarity represents the matching degree between two or more texts; the larger the text similarity is, the higher the similarity between the explanatory texts is, whereas the smaller the text similarity is, the lower the similarity between the explanatory texts is. In particular, when vectorizing text, different vectorization granularities may be selected. For example, vectorization may be performed in units of words or in units of words.

Optionally, in the step S1, the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text, respectively, includes:

Specifically, the invalid suffix may be a word having no practical meaning in the names of "cell", "house", "building", and the like. It should be understood that invalid suffix in the text corresponding to the first name and the text corresponding to the second name are removed, so that the meaningless words can be prevented from generating adverse influence on the calculation result, and the accuracy of the calculated text similarity is improved.

In addition, after the invalid suffix is removed, stop words which appear in the text but are hardly useful for characterizing the text features can be further removed from the text corresponding to the first name and the text corresponding to the second name. For example, "a, the, of, and, or" etc. in English, "I, Y, etc. in China.

Obviously, before the text similarity is calculated, the stop words are removed, so that the density of the keywords can be improved, the dimensionality of the text can be reduced, the calculation accuracy of the text similarity is further improved, the algorithm efficiency is effectively improved, and the real-time performance is better.

Optionally, in the step 105, the step of determining the recognition result according to the text similarity and the distance between the first cell to be recognized and the second cell to be recognized includes:

when the text similarity is more than or equal to 0.9, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell;

when the text similarity is more than or equal to 0.7, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 300 meters; if so, the discrimination result is that the first cell to be discriminated and the second cell to be discriminated are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;

when the text similarity is greater than or equal to 0.6, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 50 meters; if so, the discrimination result is that the first cell to be discriminated and the second cell to be discriminated are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;

For convenience of understanding, the method for identifying a cell of the same name provided in the embodiments of the present application is described below with reference to specific application scenarios. Table 1 shows two sets of cells to be identified and corresponding attribute information in the embodiment of the present application.

TABLE 1

Referring to table 1, in the first group of cells to be identified, a first name of the first cell to be identified is "cuiyi cell", and corresponding first basic information is a guangzhou haizhu area; the second name of the second cell to be distinguished is "Cuiyi Community", and the corresponding second basic information is the Guangzhou city sea pearl area as well. Therefore, the first basic information is the same as the second basic information.

Then, the distance between "cuiyi cell" and "cuiyi community" is calculated. As shown in table 1, the first longitude and latitude information of the first cell to be distinguished is (113.2874, 23.0892), the second longitude and latitude information of the second cell to be distinguished is (113.2866, 23.0901), and the distance between the two is calculated to be 22.66 meters. If the preset threshold is 1500 meters, since 22.66 < 1500, the text similarity between the "green cell" and the "green community" is further calculated to be 0.75.

In summary, "cuiyi cell" and "cuiyi community" are the same cell.

Referring to table 1 again, in the second group of cells to be identified, the first name of the first cell to be identified is "chocolate city", and the corresponding first basic information is beijing city boutique; the second cell to be identified is named as the "Runxing Home", and the corresponding second basic information is the Toyobo district in Beijing City. Since the first basic information is the same as the second basic information, the distance between the two cells is continuously calculated. As shown in table 1, the first longitude and latitude information of the first cell to be identified, namely "chocolate city", is (116.4511, 39.8225), the second longitude and latitude information of the first cell to be identified, namely "star wetting home", is (116.4509, 38.8225), and the distance between the first cell to be identified and the second cell to be identified is calculated to be 9.8 meters. Similarly, if the calculated distance is less than the preset threshold of 1500 meters, the text similarity of the chocolate city and the Runxing homestead needs to be further calculated, and the text similarity of the chocolate city and the Runxing homestead is 0.

And the distance and text similarity between the first cell to be distinguished and the second cell to be distinguished are integrated, so that the chocolate city and the Runxing homestead are the same cell.

The homonymous cell distinguishing method based on the text similarity at least has the following beneficial effects that:

according to the method for distinguishing the homonymous cells based on the text similarity, a first name of a first cell to be distinguished and a second name of a second cell to be distinguished are obtained; acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished; when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated. The distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.

Based on the same inventive concept, the present application further provides a system for identifying a cell with the same name based on text similarity, and fig. 2 is a schematic structural diagram of the system for identifying a cell with the same name based on text similarity according to an embodiment of the present application. Referring to fig. 2, the system includes:

a name obtaining module 210, configured to obtain a first name of a first cell to be identified and a second name of a second cell to be identified;

an information obtaining module 220, configured to obtain first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;

a distance determining module 230, configured to determine, when the first basic information and the second basic information are the same, a distance between the first cell to be identified and the second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;

the similarity calculation module 240 is configured to calculate text similarity between the first name and the second name when a distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to a preset threshold;

and a result determining module 250, configured to determine a recognition result according to the text similarity and a distance between the first cell to be recognized and the second cell to be recognized.

According to the homonymous cell distinguishing system based on the text similarity, the distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.

In summary, the method and system for identifying the cell with the same name based on the text similarity provided by the invention at least achieve the following beneficial effects:

Although some specific embodiments of the present invention have been described in detail by way of examples, it should be understood by those skilled in the art that the above examples are for illustrative purposes only and are not intended to limit the scope of the present invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A method for identifying a cell with the same name based on text similarity is characterized by comprising the following steps:

2. The method according to claim 1, wherein the first basic information includes a first city and a first city area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where the second cell to be distinguished is located;

3. The text similarity-based synonymous cell distinguishing method according to claim 2, wherein the first longitude and latitude information includes a first longitude and a first latitude of a first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of a second cell to be distinguished;

and

4. The method for identifying homonymous cells based on text similarity according to claim 3, wherein the step of determining the distance between the first cell to be identified and the second cell to be identified is followed by further comprising:

5. The method of claim 1, wherein the step of calculating the text similarity between the first name and the second name comprises:

6. The method for identifying homonymous cells based on text similarity according to claim 5, wherein the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text respectively comprises:

7. The method for identifying homonymous cells based on text similarity according to claim 1, wherein the step of determining the identification result according to the text similarity and the distance between the first cell to be identified and the second cell to be identified comprises:

8. A system for identifying a cell of a same name based on text similarity, the system comprising: