CN111291155A - Method and system for identifying homonymous cells based on text similarity - Google Patents

Method and system for identifying homonymous cells based on text similarity Download PDF

Info

Publication number
CN111291155A
CN111291155A CN202010054556.4A CN202010054556A CN111291155A CN 111291155 A CN111291155 A CN 111291155A CN 202010054556 A CN202010054556 A CN 202010054556A CN 111291155 A CN111291155 A CN 111291155A
Authority
CN
China
Prior art keywords
cell
distinguished
name
text
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010054556.4A
Other languages
Chinese (zh)
Inventor
朱晨晓
李昭
陈浩
高靖
崔岩
卢述奇
陈呈
张宵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingwutong Co ltd
Original Assignee
Qingwutong Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingwutong Co ltd filed Critical Qingwutong Co ltd
Priority to CN202010054556.4A priority Critical patent/CN111291155A/en
Publication of CN111291155A publication Critical patent/CN111291155A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for identifying a cell with the same name based on text similarity, wherein the method comprises the following steps: acquiring a first name of a first cell to be identified and a second name of a second cell to be identified; acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the two cells. Because the distance between the cells to be distinguished and the text similarity between the cell names are comprehensively considered, the method can avoid misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same, and effectively improves the distinguishing accuracy.

Description

Method and system for identifying homonymous cells based on text similarity
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a system for identifying homonymous cells based on text similarity.
Background
With the rapid popularization and development of the internet, a house renting and selling platform is greatly popularized. And the house broker issues the house source information to each renting and selling platform so that the user can conveniently search the required house source information on the house source website by setting a screening condition.
However, in some application scenarios, if the alias of the cell a is the cell B, different house brokers may use different cell names when issuing the house source information, which results in that a user cannot distinguish whether the two are the same house source when searching the house source information; in addition, in another application scenario, if there are two cells with the same or similar names, the user may misunderstand that the two cells are the same house source.
In order to solve the above problem, the method for determining whether two cell names are the same cell in the prior art is as follows: firstly, judging whether the cities and the urban areas where the two cells are located are the same; and if the two cell names are the same, further calculating the text similarity of the two cell names, and if the text similarity is more than or equal to 90%, judging that the two cells are the same cell.
However, in the above method for identifying the cells with the same name, when an alias with text similarity smaller than 90% exists in a certain cell, or text similarity of names of two different cells exceeds 90%, a high misjudgment frequency occurs, and the identification accuracy is greatly reduced.
Disclosure of Invention
The invention provides a method and a system for identifying homonymous cells based on text similarity, which can effectively improve the identification accuracy of homonymous cells and reduce the misjudgment risk.
In a first aspect, the present application provides a method for identifying a cell with the same name based on text similarity, where the method includes:
acquiring a first name of a first cell to be identified and a second name of a second cell to be identified;
acquiring first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;
when the first basic information and the second basic information are the same, determining the distance between a first cell to be distinguished and a second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information;
when the distance between a first cell to be distinguished and a second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name;
and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.
Optionally, the first basic information includes a first city and a first urban area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where the second cell to be distinguished is located;
the method for judging whether the first basic information and the second basic information are the same comprises the following steps:
comparing whether a first city in which the first cell to be distinguished is located is the same as a second city in which the second cell to be distinguished is located; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
if so, comparing whether a first urban area in which the first cell to be distinguished is located is the same as a second urban area in which the second cell to be distinguished is located: if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell; and if so, executing the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information.
Optionally, the first longitude and latitude information includes a first longitude and a first latitude of a first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of a second cell to be distinguished;
the distance between the first cell to be distinguished and the second cell to be distinguished is obtained by adopting the following formula:
Figure BDA0002372366650000031
wherein d is the distance between the first cell to be discriminated and the second cell to be discriminated, R is the radius of the earth,
Figure BDA0002372366650000032
and
Figure BDA0002372366650000033
first and second latitudes, respectively, and Δ λ represents the difference between the first and second longitudes.
Optionally, after the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished, the method further includes:
and when the distance between the first cell to be distinguished and the second cell to be distinguished is larger than a preset threshold value, judging that the first cell to be distinguished and the second cell to be distinguished are not the same cell.
Optionally, the step of calculating the text similarity between the first name and the second name includes:
preprocessing a text corresponding to the first name and a text corresponding to the second name to respectively obtain a first text and a second text;
after word segmentation processing is carried out on the first text and the second text, vectorizing is carried out on the first text and the second text respectively by utilizing a word vector model trained in advance, and a first word vector and a second word vector are obtained;
and calculating the cosine similarity of the first word vector and the second word vector as the text similarity of the first name and the second name.
Optionally, the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text respectively includes:
and removing invalid suffix and symbol characters in the text corresponding to the first name and the text corresponding to the second name, converting capital English characters into lowercase English characters, and converting numbers into Chinese characters.
Optionally, the step of determining a recognition result according to the text similarity and the distance between the first cell to be recognized and the second cell to be recognized includes:
when the text similarity is greater than or equal to 0.9, the first cell to be distinguished and the second cell to be distinguished are the same cell according to the distinguishing result;
when the text similarity is larger than or equal to 0.7, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 300 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
when the text similarity is larger than or equal to 0.6, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 50 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
and when the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 10 meters, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell.
In a second aspect, the present application provides a system for identifying a cell of the same name based on text similarity, the system comprising:
the name acquisition module is used for acquiring a first name of a first cell to be distinguished and a second name of a second cell to be distinguished;
an information obtaining module, configured to obtain first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;
a distance determining module, configured to determine, when the first basic information and the second basic information are the same, a distance between a first cell to be identified and a second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;
the similarity calculation module is used for calculating the text similarity of the first name and the second name when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value;
and the result determining module is used for determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.
Compared with the prior art, the homonymous cell identification method and system based on text similarity provided by the invention at least realize the following beneficial effects:
according to the method and the system for identifying the homonymous cell based on the text similarity, first attribute information of a first cell to be identified and second attribute information of a second cell to be identified are obtained by obtaining the first name of the first cell to be identified and the second name of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;
when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated. The distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.
Of course, it is not necessary for any product in which the present invention is practiced to achieve all of the above-described technical effects simultaneously.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a flowchart illustrating a method for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a system for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In the method for distinguishing the cells with the same name provided by the prior art, for the cells to be distinguished in the same city and urban area, whether the cells are the same cell is judged only by judging whether the text similarity is greater than or equal to a preset threshold value. It can be understood that if a name a and an alias B exist in a certain cell, and the text similarity between a and B is less than 90%, the name a and the alias B are determined to be different cells; for another example, because the names of two cells with names a and B are the same or similar, the text similarity is greater than or equal to 90%, and then the name a and the alias B are determined to be the same cell. Therefore, the method for distinguishing the cells with the same name provided by the prior art only takes the text similarity as a distinguishing basis, so that high misjudgment frequency can occur, and the distinguishing accuracy is greatly reduced.
In view of the above, the invention provides a method for identifying a cell with the same name based on text similarity, which can effectively improve the identification accuracy of the cell with the same name and reduce the risk of misjudgment.
The following detailed description is to be read in connection with the drawings and the detailed description.
Fig. 1 is a flowchart illustrating a method for identifying a cell of the same name based on text similarity according to an embodiment of the present disclosure. Referring to fig. 1, the method for identifying a cell of the same name includes:
step 101, acquiring a first name of a first cell to be identified and a second name of a second cell to be identified;
102, acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;
103, when the first basic information and the second basic information are the same, determining the distance between the first cell to be identified and the second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;
104, when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name;
and 105, determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.
Specifically, after acquiring a first name of a first cell to be identified and a second name of a second cell to be identified, the first basic information and the second basic information may be compared first, and preliminary identification may be performed through space limitation. When the first basic information is the same as the second basic information, calculating the distance between the first cell to be identified and the second cell to be identified by using the acquired first longitude and latitude information and the acquired second longitude and latitude information; it will be appreciated that if the first cell to be distinguished and the second cell to be distinguished are the same cell, the distance between the two should be small. Therefore, only when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to the preset threshold, the text similarity of the first name and the second name is calculated, and the distinguishing result is determined by combining the text similarity and the distance between the two cells.
According to the method for distinguishing the cells with the same name, the distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, whether the cells are the same or not can be accurately distinguished, misjudgment caused by the fact that the cells to be distinguished have the alias or the cell names to be distinguished are the same can be avoided, and distinguishing accuracy is effectively improved.
Meanwhile, whether the same-name cells are the same cell or not is distinguished, so that whether the same house resources exist on the house renting and selling platform or not can be further checked when the house brokerage distributes the house resource information, subsequent house resource combination or house resource duplication elimination is facilitated, a user can quickly and accurately acquire required information in massive house resource data, and the use experience of the user is improved.
Optionally, the first basic information includes a first city and a first urban area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where a second cell to be identified is located;
before the step 103, a method of determining whether the first basic information and the second basic information are the same may be:
comparing whether a first city in which the first cell to be distinguished is located is the same as a second city in which the second cell to be distinguished is located; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
if so, comparing whether a first urban area in which the first cell to be distinguished is located is the same as a second urban area in which the second cell to be distinguished is located: if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell; and if so, executing the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information.
It will be appreciated that if the two cells to be distinguished are the same cell, then both must also be in the same city and downtown area, i.e.: two cells to be distinguished in different cities or urban areas must not be the same cell. Therefore, aiming at the condition that the cell to be distinguished is in different cities or urban areas, the distinguishing result can be quickly determined by comparing the first basic information with the second basic information without subsequent calculation, so that the calculation resources are greatly saved, and the real-time performance of the algorithm is improved.
Optionally, the first longitude and latitude information includes a first longitude and a first latitude of the first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of the second cell to be distinguished;
the distance between the first cell to be distinguished and the second cell to be distinguished can be calculated by adopting the following formula:
Figure BDA0002372366650000081
wherein d is the distance between the first cell to be discriminated and the second cell to be discriminated, R is the radius of the earth,
Figure BDA0002372366650000082
and
Figure BDA0002372366650000083
first and second latitudes, respectively, and Δ λ represents the difference between the first and second longitudes.
In this embodiment, the first longitude and latitude information and the second longitude and latitude information may be obtained by a map App, and the radius of the earth may be 6371 km. The earth is a sphere, and the longitude and latitude can be converted into the earth coordinate by using the formula, so that the distance between the cells to be distinguished can be accurately calculated.
Optionally, after the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished in step 103, the method further includes:
and when the distance between the first cell to be distinguished and the second cell to be distinguished is larger than a preset threshold value, judging that the first cell to be distinguished and the second cell to be distinguished are not the same cell.
Wherein, the preset threshold value can be set according to the area of the cell to be distinguished. In this embodiment, the preset threshold may be 1500 meters. When the distance between the first cell to be distinguished and the second cell to be distinguished is larger than 1500, the first cell and the second cell to be distinguished are not the same cell, and the operation speed of the algorithm is greatly increased.
Optionally, in the step 104, the step of calculating the text similarity between the first name and the second name includes:
s1, preprocessing the text corresponding to the first name and the text corresponding to the second name to respectively obtain a first text and a second text;
s2, after word segmentation processing is carried out on the first text and the second text, vectorization is carried out on the first text and the second text respectively by using a word vector model trained in advance, and a first word vector and a second word vector are obtained;
s3, calculating the cosine similarity of the first word vector and the second word vector as the text similarity of the first name and the second name.
The text similarity represents the matching degree between two or more texts; the larger the text similarity is, the higher the similarity between the explanatory texts is, whereas the smaller the text similarity is, the lower the similarity between the explanatory texts is. In particular, when vectorizing text, different vectorization granularities may be selected. For example, vectorization may be performed in units of words or in units of words.
Optionally, in the step S1, the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text, respectively, includes:
and removing invalid suffix and symbol characters in the text corresponding to the first name and the text corresponding to the second name, converting capital English characters into lowercase English characters, and converting numbers into Chinese characters.
Specifically, the invalid suffix may be a word having no practical meaning in the names of "cell", "house", "building", and the like. It should be understood that invalid suffix in the text corresponding to the first name and the text corresponding to the second name are removed, so that the meaningless words can be prevented from generating adverse influence on the calculation result, and the accuracy of the calculated text similarity is improved.
In addition, after the invalid suffix is removed, stop words which appear in the text but are hardly useful for characterizing the text features can be further removed from the text corresponding to the first name and the text corresponding to the second name. For example, "a, the, of, and, or" etc. in English, "I, Y, etc. in China.
Obviously, before the text similarity is calculated, the stop words are removed, so that the density of the keywords can be improved, the dimensionality of the text can be reduced, the calculation accuracy of the text similarity is further improved, the algorithm efficiency is effectively improved, and the real-time performance is better.
Optionally, in the step 105, the step of determining the recognition result according to the text similarity and the distance between the first cell to be recognized and the second cell to be recognized includes:
when the text similarity is more than or equal to 0.9, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell;
when the text similarity is more than or equal to 0.7, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 300 meters; if so, the discrimination result is that the first cell to be discriminated and the second cell to be discriminated are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
when the text similarity is greater than or equal to 0.6, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 50 meters; if so, the discrimination result is that the first cell to be discriminated and the second cell to be discriminated are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
and when the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 10 meters, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell.
For convenience of understanding, the method for identifying a cell of the same name provided in the embodiments of the present application is described below with reference to specific application scenarios. Table 1 shows two sets of cells to be identified and corresponding attribute information in the embodiment of the present application.
TABLE 1
Figure BDA0002372366650000111
Referring to table 1, in the first group of cells to be identified, a first name of the first cell to be identified is "cuiyi cell", and corresponding first basic information is a guangzhou haizhu area; the second name of the second cell to be distinguished is "Cuiyi Community", and the corresponding second basic information is the Guangzhou city sea pearl area as well. Therefore, the first basic information is the same as the second basic information.
Then, the distance between "cuiyi cell" and "cuiyi community" is calculated. As shown in table 1, the first longitude and latitude information of the first cell to be distinguished is (113.2874, 23.0892), the second longitude and latitude information of the second cell to be distinguished is (113.2866, 23.0901), and the distance between the two is calculated to be 22.66 meters. If the preset threshold is 1500 meters, since 22.66 < 1500, the text similarity between the "green cell" and the "green community" is further calculated to be 0.75.
In summary, "cuiyi cell" and "cuiyi community" are the same cell.
Referring to table 1 again, in the second group of cells to be identified, the first name of the first cell to be identified is "chocolate city", and the corresponding first basic information is beijing city boutique; the second cell to be identified is named as the "Runxing Home", and the corresponding second basic information is the Toyobo district in Beijing City. Since the first basic information is the same as the second basic information, the distance between the two cells is continuously calculated. As shown in table 1, the first longitude and latitude information of the first cell to be identified, namely "chocolate city", is (116.4511, 39.8225), the second longitude and latitude information of the first cell to be identified, namely "star wetting home", is (116.4509, 38.8225), and the distance between the first cell to be identified and the second cell to be identified is calculated to be 9.8 meters. Similarly, if the calculated distance is less than the preset threshold of 1500 meters, the text similarity of the chocolate city and the Runxing homestead needs to be further calculated, and the text similarity of the chocolate city and the Runxing homestead is 0.
And the distance and text similarity between the first cell to be distinguished and the second cell to be distinguished are integrated, so that the chocolate city and the Runxing homestead are the same cell.
The homonymous cell distinguishing method based on the text similarity at least has the following beneficial effects that:
according to the method for distinguishing the homonymous cells based on the text similarity, a first name of a first cell to be distinguished and a second name of a second cell to be distinguished are obtained; acquiring first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished; when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated. The distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.
Based on the same inventive concept, the present application further provides a system for identifying a cell with the same name based on text similarity, and fig. 2 is a schematic structural diagram of the system for identifying a cell with the same name based on text similarity according to an embodiment of the present application. Referring to fig. 2, the system includes:
a name obtaining module 210, configured to obtain a first name of a first cell to be identified and a second name of a second cell to be identified;
an information obtaining module 220, configured to obtain first attribute information of a first cell to be identified and second attribute information of a second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;
a distance determining module 230, configured to determine, when the first basic information and the second basic information are the same, a distance between the first cell to be identified and the second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;
the similarity calculation module 240 is configured to calculate text similarity between the first name and the second name when a distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to a preset threshold;
and a result determining module 250, configured to determine a recognition result according to the text similarity and a distance between the first cell to be recognized and the second cell to be recognized.
According to the homonymous cell distinguishing system based on the text similarity, the distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.
In summary, the method and system for identifying the cell with the same name based on the text similarity provided by the invention at least achieve the following beneficial effects:
according to the method and the system for identifying the homonymous cell based on the text similarity, first attribute information of a first cell to be identified and second attribute information of a second cell to be identified are obtained by obtaining the first name of the first cell to be identified and the second name of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be distinguished;
when the first basic information and the second basic information are the same, determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information; when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name; and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated. The distance between the cells to be distinguished and the text similarity between the cell names of the cells to be distinguished are comprehensively considered, so that misjudgment caused by the fact that the cells to be distinguished have aliases or the cell names to be distinguished are the same can be avoided, and the distinguishing accuracy is effectively improved.
Although some specific embodiments of the present invention have been described in detail by way of examples, it should be understood by those skilled in the art that the above examples are for illustrative purposes only and are not intended to limit the scope of the present invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (8)

1. A method for identifying a cell with the same name based on text similarity is characterized by comprising the following steps:
acquiring a first name of a first cell to be identified and a second name of a second cell to be identified;
acquiring first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;
when the first basic information and the second basic information are the same, determining the distance between a first cell to be distinguished and a second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information;
when the distance between a first cell to be distinguished and a second cell to be distinguished is smaller than or equal to a preset threshold value, calculating the text similarity of the first name and the second name;
and determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.
2. The method according to claim 1, wherein the first basic information includes a first city and a first city area where the first cell to be identified is located; the second basic information comprises a second city and a second urban area where the second cell to be distinguished is located;
the method for judging whether the first basic information and the second basic information are the same comprises the following steps:
comparing whether a first city in which the first cell to be distinguished is located is the same as a second city in which the second cell to be distinguished is located; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
if so, comparing whether a first urban area in which the first cell to be distinguished is located is the same as a second urban area in which the second cell to be distinguished is located: if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell; and if so, executing the step of determining the distance between the first cell to be distinguished and the second cell to be distinguished according to the first longitude and latitude information and the second longitude and latitude information.
3. The text similarity-based synonymous cell distinguishing method according to claim 2, wherein the first longitude and latitude information includes a first longitude and a first latitude of a first cell to be distinguished, and the second longitude and latitude information includes a second longitude and a second latitude of a second cell to be distinguished;
the distance between the first cell to be distinguished and the second cell to be distinguished is obtained by adopting the following formula:
Figure FDA0002372366640000021
wherein d is the distance between the first cell to be discriminated and the second cell to be discriminated, R is the radius of the earth,
Figure FDA0002372366640000022
and
Figure FDA0002372366640000023
first and second latitudes, respectively, and Δ λ represents the difference between the first and second longitudes.
4. The method for identifying homonymous cells based on text similarity according to claim 3, wherein the step of determining the distance between the first cell to be identified and the second cell to be identified is followed by further comprising:
and when the distance between the first cell to be distinguished and the second cell to be distinguished is larger than a preset threshold value, judging that the first cell to be distinguished and the second cell to be distinguished are not the same cell.
5. The method of claim 1, wherein the step of calculating the text similarity between the first name and the second name comprises:
preprocessing a text corresponding to the first name and a text corresponding to the second name to respectively obtain a first text and a second text;
after word segmentation processing is carried out on the first text and the second text, vectorizing is carried out on the first text and the second text respectively by utilizing a word vector model trained in advance, and a first word vector and a second word vector are obtained;
and calculating the cosine similarity of the first word vector and the second word vector as the text similarity of the first name and the second name.
6. The method for identifying homonymous cells based on text similarity according to claim 5, wherein the step of preprocessing the text corresponding to the first name and the text corresponding to the second name to obtain the first text and the second text respectively comprises:
and removing invalid suffix and symbol characters in the text corresponding to the first name and the text corresponding to the second name, converting capital English characters into lowercase English characters, and converting numbers into Chinese characters.
7. The method for identifying homonymous cells based on text similarity according to claim 1, wherein the step of determining the identification result according to the text similarity and the distance between the first cell to be identified and the second cell to be identified comprises:
when the text similarity is greater than or equal to 0.9, the first cell to be distinguished and the second cell to be distinguished are the same cell according to the distinguishing result;
when the text similarity is larger than or equal to 0.7, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 300 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
when the text similarity is larger than or equal to 0.6, judging whether the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to 50 meters; if so, determining that the first cell to be distinguished and the second cell to be distinguished are the same cell; if not, the first cell to be distinguished and the second cell to be distinguished are not the same cell;
and when the distance between the first cell to be distinguished and the second cell to be distinguished is less than or equal to 10 meters, the distinguishing result is that the first cell to be distinguished and the second cell to be distinguished are the same cell.
8. A system for identifying a cell of a same name based on text similarity, the system comprising:
the name acquisition module is used for acquiring a first name of a first cell to be distinguished and a second name of a second cell to be distinguished;
an information obtaining module, configured to obtain first attribute information of the first cell to be identified and second attribute information of the second cell to be identified; the first attribute information comprises first basic information and first longitude and latitude information of a first cell to be distinguished; the second attribute information comprises second basic information and second longitude and latitude information of a second cell to be identified;
a distance determining module, configured to determine, when the first basic information and the second basic information are the same, a distance between a first cell to be identified and a second cell to be identified according to the first longitude and latitude information and the second longitude and latitude information;
the similarity calculation module is used for calculating the text similarity of the first name and the second name when the distance between the first cell to be distinguished and the second cell to be distinguished is smaller than or equal to a preset threshold value;
and the result determining module is used for determining a discrimination result according to the text similarity and the distance between the first cell to be discriminated and the second cell to be discriminated.
CN202010054556.4A 2020-01-17 2020-01-17 Method and system for identifying homonymous cells based on text similarity Withdrawn CN111291155A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010054556.4A CN111291155A (en) 2020-01-17 2020-01-17 Method and system for identifying homonymous cells based on text similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010054556.4A CN111291155A (en) 2020-01-17 2020-01-17 Method and system for identifying homonymous cells based on text similarity

Publications (1)

Publication Number Publication Date
CN111291155A true CN111291155A (en) 2020-06-16

Family

ID=71030808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010054556.4A Withdrawn CN111291155A (en) 2020-01-17 2020-01-17 Method and system for identifying homonymous cells based on text similarity

Country Status (1)

Country Link
CN (1) CN111291155A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832304A (en) * 2020-06-29 2020-10-27 上海巧房信息科技有限公司 Method and device for checking duplicate of building name, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169301A1 (en) * 2008-12-31 2010-07-01 Michael Rubanovich System and method for aggregating and ranking data from a plurality of web sites
CN109033465A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 Based on geographical location multi-platform cell combining method similar with name
CN109977287A (en) * 2019-03-28 2019-07-05 国家计算机网络与信息安全管理中心 A kind of house property data identity method of discrimination of different aforementioned sources
CN110096634A (en) * 2019-04-29 2019-08-06 成都理工大学 A kind of house property data vector alignment schemes based on particle group optimizing
CN110209810A (en) * 2018-09-10 2019-09-06 腾讯科技(深圳)有限公司 Similar Text recognition methods and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169301A1 (en) * 2008-12-31 2010-07-01 Michael Rubanovich System and method for aggregating and ranking data from a plurality of web sites
CN109033465A (en) * 2018-08-31 2018-12-18 北京诸葛找房信息技术有限公司 Based on geographical location multi-platform cell combining method similar with name
CN110209810A (en) * 2018-09-10 2019-09-06 腾讯科技(深圳)有限公司 Similar Text recognition methods and device
CN109977287A (en) * 2019-03-28 2019-07-05 国家计算机网络与信息安全管理中心 A kind of house property data identity method of discrimination of different aforementioned sources
CN110096634A (en) * 2019-04-29 2019-08-06 成都理工大学 A kind of house property data vector alignment schemes based on particle group optimizing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832304A (en) * 2020-06-29 2020-10-27 上海巧房信息科技有限公司 Method and device for checking duplicate of building name, electronic equipment and storage medium
CN111832304B (en) * 2020-06-29 2024-02-27 上海巧房信息科技有限公司 Weight checking method and device for building names, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
Unar et al. Detected text‐based image retrieval approach for textual images
CN116795973B (en) Text processing method and device based on artificial intelligence, electronic equipment and medium
CN107784110B (en) Index establishing method and device
CN111488468B (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN113032584B (en) Entity association method, entity association device, electronic equipment and storage medium
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN114386421A (en) Similar news detection method and device, computer equipment and storage medium
WO2024031943A1 (en) Store deduplication processing method and apparatus, device, and storage medium
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN111291155A (en) Method and system for identifying homonymous cells based on text similarity
CN116681056B (en) Text value calculation method and device based on value scale
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN114943285B (en) Intelligent auditing system for internet news content data
CN112579713B (en) Address recognition method, address recognition device, computing equipment and computer storage medium
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
CN113139379B (en) Information identification method and system
CN113761137B (en) Method and device for extracting address information
CN114491056A (en) Method and system for improving POI (Point of interest) search in digital police scene
CN108920361B (en) String matching code similarity detection method
CN111259966A (en) Method and system for identifying homonymous cell with multi-feature fusion
CN116992111B (en) Data processing method, device, electronic equipment and computer storage medium
CN114880572B (en) Intelligent news client recommendation system
Chen et al. Discriminative Object Discovery Toward Personalized Sightseeing Spot Recommendation
Zhang et al. Visual retrieval of digital media image features based on active noise control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200616