CN115858878A - Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium - Google Patents

Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium Download PDF

Info

Publication number
CN115858878A
CN115858878A CN202211305393.8A CN202211305393A CN115858878A CN 115858878 A CN115858878 A CN 115858878A CN 202211305393 A CN202211305393 A CN 202211305393A CN 115858878 A CN115858878 A CN 115858878A
Authority
CN
China
Prior art keywords
name
hierarchical
matched
standard
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211305393.8A
Other languages
Chinese (zh)
Inventor
马明
李博
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
"chinese Medical Journal" Co ltd
Original Assignee
"chinese Medical Journal" Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by "chinese Medical Journal" Co ltd filed Critical "chinese Medical Journal" Co ltd
Priority to CN202211305393.8A priority Critical patent/CN115858878A/en
Publication of CN115858878A publication Critical patent/CN115858878A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-dimensional matching method, device, equipment and storage medium for names of hierarchical mechanisms, and relates to the technical field of natural language processing. The method comprises the steps of firstly obtaining search relevance, character string similarity and region similarity of a to-be-matched layered mechanism name and each standard layered mechanism name, then carrying out multi-dimensional fusion on the search relevance, the character string similarity, the region similarity and other dimensions by adopting a linear weighting model to obtain the comprehensive matching degree of the to-be-matched layered mechanism name and each standard layered mechanism name, and finally taking the standard layered mechanism name corresponding to the maximum value of the comprehensive matching degree as a matching result of the to-be-matched layered mechanism name and outputting the matching result.

Description

Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method, a device, equipment and a storage medium for multidimensional matching of a hierarchical mechanism name.
Background
In the field of text data analysis for natural language processing, it is common to need to align the names of institutions (e.g., medical institutions and medical research institutions, etc.): academic papers published in medical journals carry names of institutions to which the authors of the articles belong, and the names may be the canonical names of the institutions or the names input by the authors with strokes. If it is necessary to normalize all agency names or to correlate data about third party agencies by these agency names (for further data aggregation and analysis), it becomes a problem to be solved how to correlate at least two agency names that are actually the same agency but whose names may be slightly different.
Currently, there are three general solutions for solving the above problems in the industry: text search based schemes, machine learning or distance editing based schemes, and manual alignment and calibration based schemes.
The scheme based on text search mainly comprises the steps of establishing a data set of organization names, searching and matching the organization names to be matched in the data set of the organization names in a text search mode, and taking the organization names with the first rank in search results as matching results. But a significant disadvantage of this approach is the low accuracy. Since the current main stream of search systems is a TF-IDF (Term Frequency-Inverse Document Frequency) model (the main principle of which is that in Term Frequency, if a certain search word appears in a searched Document many times, the score of the Document matching the search increases, and in Inverse Document Frequency, if the number of times of successfully matched search words appears in all documents is very small, the score of the matched Document also increases), such as BM25 algorithm, the TF-IDF model is obviously insufficient in the problem of institution name matching: (11) The mechanism matching search is carried out on the mechanism name to be matched, the text data contained in the standard data set only has the mechanism name, and the word frequency characteristic is invalid; (12) The frequency of the inverse document is good under the condition of longer text content, but the effect is reduced on the phrase text of the organization name; (13) The organization name generally has a hierarchical organization structure characteristic (for example, three levels of provincial level people hospitals, city level people hospitals, county level people hospitals and the like), so that if the times of a certain word appearing in the standard data set are relatively small, the word appears relatively little in the organization name, and the word is not necessarily important, for example, the words contained in the organization name and having regional characteristics are likely to appear many times in the standard data set, but the words are not important.
Machine learning or edit distance based schemes are often used to assist in text analysis problems, and if used on the problem of institution name matching, build a classification learning model based on word vector features of institution names in combination with string alignment features such as edit distance. However, this solution also has the following drawbacks: (21) The cost of building a model is high, and a large amount of labeled data is needed for a supervised learning model, and the data generally needs to be labeled manually. Hiring data scientists in the aspect of natural language processing to complete model construction and reasoning processes, and the price is high; (22) The effect of the model is unstable, repeated iteration is needed, and the phenomenon of overfitting is easy to occur; (23) The model is not easy to expand, and once the model is applied to other similar problems, the whole model needs to be reconstructed.
The drawback of the manual alignment and calibration based solution is obviously that it is time consuming and labor intensive and cannot be reused to solve similar problems.
Disclosure of Invention
The invention aims to provide a method, a device, computer equipment and a computer readable storage medium for multi-dimensional matching of names of hierarchical mechanisms, which are used for solving the problems of low accuracy, high model building cost, unstable model effect, difficulty in model expansion and labor and time waste caused by manual work in the existing scheme for matching names of hierarchical mechanisms.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, a multidimensional matching method for hierarchical mechanism names is provided, which includes:
acquiring the search relevancy between the name of the layered mechanism to be matched and each standard layered mechanism name in a standard layered mechanism name set, wherein the search relevancy takes values in an interval [0,1 ];
acquiring the character string similarity of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the character string similarity takes values in an interval [0,1 ];
performing word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms in sequence to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun;
and calculating the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure BDA0003905717780000021
wherein n represents a positive integer, RS n Representing the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, count () representing a statistical function of the total number of set elements, ED representing the regional entity set of the name of the to-be-matched hierarchical mechanism, SD n A region feature set which represents the nth standard hierarchical organization name and comprises at least one normalized region entity noun, max () represents a function of solving the maximum value, and n represents an intersection symbol;
and calculating the comprehensive matching degree of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched layered organization and the name of the nth standard layered organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Represents the said treatmentString similarity, h, of matching hierarchy name to the nth standard hierarchy name SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1;
And taking the standard layered structure name corresponding to the maximum value of the comprehensive matching degree in the standard layered structure name set as a matching result of the names of the layered structures to be matched and outputting the matching result.
Based on the content of the invention, a multi-dimensional matching scheme for accurately matching the names of the layered mechanisms is provided, namely, the search correlation, the character string similarity and the region similarity of the names of the layered mechanisms to be matched and each standard layered mechanism are obtained, then a linear weighting model is adopted to perform multi-dimensional fusion on the search correlation, the character string similarity, the region similarity and other dimensions to obtain the comprehensive matching degree of the names of the layered mechanisms to be matched and each standard layered mechanism, and finally the standard layered mechanism name corresponding to the maximum value of the comprehensive matching degree is used as the matching result of the names of the layered mechanisms to be matched and output.
In one possible design, obtaining the search relevance of the name of the to-be-matched hierarchical mechanism and each standard hierarchical mechanism name in the standard hierarchical mechanism name set comprises:
importing a standard hierarchical name set into an elastic search engine;
applying the elastic search engine to input information of the names of the hierarchical mechanisms to be matched to obtain correlation scores of the names of the hierarchical mechanisms to be matched and the names of the standard hierarchical mechanisms in the standard hierarchical mechanism name set, wherein the correlation scores are obtained based on a BM25 algorithm;
and normalizing the correlation scores of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms to obtain the search correlation degrees of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms.
In one possible design, the BM25 algorithm uses the following formula:
Figure BDA0003905717780000041
in the formula, n represents a positive integer, x represents the name of the layering mechanism to be matched, and D n Represents the nth standard hierarchical name, score, in the set of standard hierarchical names BM25 (x,D n ) Expressing the correlation score of the name of the hierarchical mechanism to be matched and the name of the nth standard hierarchical mechanism, wherein M represents a positive integer, M represents the total number of words of the name of the hierarchical mechanism to be matched, D represents the name set of the standard hierarchical mechanism, and T represents the number of words of the name set of the standard hierarchical mechanism m The mth word represented in the to-be-matched hierarchy name,
Figure BDA0003905717780000042
represents a number of occurrences of the mth word in the set of standard hierarchical names, <' > based on a criterion of a hierarchy>
Figure BDA0003905717780000043
Representing the number of occurrences of the mth word in the nth standard hierarchy name.
In one possible design, normalizing the relevance scores of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms to obtain the search relevance of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms includes:
extracting K standard hierarchical mechanism names which are positioned at the top K names in a correlation scoring dimension from the standard hierarchical mechanism name set to obtain a standard hierarchical mechanism name candidate set for replacing the standard hierarchical mechanism name set, wherein K represents a positive integer not less than 8;
and calculating the search correlation between the name of the to-be-matched hierarchical mechanism and each standard hierarchical mechanism name in the standard hierarchical mechanism name candidate set according to the following formula:
Figure BDA0003905717780000044
wherein k represents a positive integer, SS k Representing a search relevance, score, of the name of the hierarchy to be matched to the name of the kth standard hierarchy in the candidate set of standard hierarchy names k Represents a relevance Score, of the name of the hierarchical organization to be matched and the name of the kth standard hierarchical organization min Represents the minimum value of the correlation Score between the name of the to-be-matched hierarchical organization and the standard hierarchical organization name candidate set, score max And the maximum value of the correlation score of the to-be-matched hierarchical name and the standard hierarchical name candidate set is represented.
In one possible design, obtaining the similarity of the character strings between the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms includes:
acquiring the edit distance similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the edit distance similarity takes values in an interval [0,1 ];
acquiring J-W distance similarity between the name of the layered mechanism to be matched and the name of each standard layered mechanism, wherein the J-W distance similarity takes values in an interval [0,1 ];
calculating the Jacard similarity between the name of the to-be-matched layered mechanism and the names of the standard layered mechanisms according to the following formula:
Figure BDA0003905717780000051
in which n represents a positive integer, ZS n,jc Representing Jacard similarity between the name of the hierarchical mechanism to be matched and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, tx representing the word set of the name of the hierarchical mechanism to be matched, and TD n A set of words representing the nth standard hierarchical organization name;
and calculating the similarity of the longest common character string of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure BDA0003905717780000052
in the formula, ZS n,lcs Representing the longest common character string similarity between the name of the to-be-matched hierarchical mechanism and the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, x representing the name of the to-be-matched hierarchical mechanism, and D n Representing the nth standard hierarchy name; LCS (x, D) n ) Representing the longest common substring length of the to-be-matched hierarchical structure name and the nth standard hierarchical structure name;
and calculating the character string similarity of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
ZS n =k bd *ZS n,bd +k jw *ZS n,jw +k jc *ZS n,jc +k lcs *ZS n,lcs
in the formula ZS n Representing a string similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,bd Indicating an edit distance similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,jw Representing the J-W distance similarity, k, between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism bd 、k jw 、k jc And k lcs Are respectively in the interval [0,1]A second class of weighting coefficients of the inner value and having k bd +k jw +k jc +k lcs =1。
In one possible design, after calculating the regional similarity between the name of the to-be-matched hierarchical organization and the names of the standard hierarchical organizations, the method further includes:
if the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism is zero, checking each regional entity name in the regional entity set ED and the regional feature set SD according to a regional relation database n Whether each domain entity noun in (1) has domain affiliation;
if the first region entity noun in the region entity set ED and the region feature set SD n If the second domain entity noun has a domain dependency relationship, further determining whether the first domain entity noun is the administrative center of the second domain entity noun or whether the second domain entity noun is the administrative center of the first domain entity noun;
if so, updating the region similarity of the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism to a preset first numerical value, otherwise, updating the region similarity of the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism to a preset second numerical value, wherein the first numerical value takes a value in an interval [0,1], and the second numerical value also takes a value in an interval [0,1] but is smaller than the first numerical value.
In one possible design, the first class of weighting factors are determined in advance using the following formula:
Figure BDA0003905717780000061
in the formula, RP SS A matched hierarchical name subset, RP, representing a matched hierarchical name which can be correctly matched with a standard hierarchical name in the standard hierarchical names only based on the maximum value of the search correlation degree in the matched hierarchical name set ZS Representing in the matched set of hierarchical names based on only string relevance maximaA matched hierarchical name subset, RP, capable of correctly matching a standard hierarchical name of said standard hierarchical names RS And the matched hierarchical organization name subset represents a matched hierarchical organization name subset which can be correctly matched with a certain standard hierarchical organization name in the standard hierarchical organization names only based on the maximum region similarity in the matched hierarchical organization name set.
In a second aspect, a multi-dimensional matching device for hierarchical names is provided, which comprises a search relevancy obtaining module, a character string similarity obtaining module, a region entity obtaining module, a region similarity calculating module, a comprehensive matching degree calculating module and a matching result determining module;
the search relevancy obtaining module is used for obtaining the search relevancy between the name of the hierarchical mechanism to be matched and each standard hierarchical mechanism name in the standard hierarchical mechanism name set, wherein the search relevancy takes values in an interval [0,1 ];
the character string similarity obtaining module is used for obtaining the character string similarity of the name of the hierarchical mechanism to be matched and the name of each standard hierarchical mechanism, wherein the character string similarity takes values in an interval [0,1 ];
the region entity acquisition module is used for sequentially carrying out word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun;
the region similarity calculation module is in communication connection with the region entity acquisition module, and is configured to calculate the region similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure BDA0003905717780000071
wherein n represents a positive integer, RS n Representing the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name setThe Count () represents a statistical function of the total number of the set elements, the ED represents the set of the regional entities of the names of the hierarchical organization to be matched, and the SD represents the similarity of the regional entities of the hierarchical organization to be matched n A region feature set which represents the nth standard hierarchical organization name and comprises at least one normalized region entity noun, max () represents a function of solving the maximum value, and n represents an intersection symbol;
the comprehensive matching degree calculation module is respectively in communication connection with the search correlation degree acquisition module, the character string similarity acquisition module and the region similarity calculation module, and is used for calculating and obtaining the comprehensive matching degree of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched hierarchical organization and the name of the nth standard hierarchical organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Representing the string similarity, h, of the name of the hierarchy to be matched and the name of the nth standard hierarchy SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1;
And the matching result determining module is in communication connection with the comprehensive matching degree calculating module and is used for taking the standard layered mechanism name corresponding to the maximum value of the comprehensive matching degree in the standard layered mechanism name set as the matching result of the layered mechanism name to be matched and outputting the matching result.
In a third aspect, the present invention provides a computer device, including a memory, a processor and a transceiver, which are sequentially connected in communication, wherein the memory is used for storing a computer program, the transceiver is used for sending and receiving messages, and the processor is used for reading the computer program and executing the hierarchical organization name multidimensional matching method as described in the first aspect or any possible design of the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon instructions which, when run on a computer, perform the method for multidimensional matching of names of hierarchical organizations as described in the first aspect or any possible design thereof.
In a fifth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method for multidimensional matching of hierarchical names as described in the first aspect or any possible design thereof.
The beneficial effect of above-mentioned scheme:
(1) The invention creatively provides a multidimensional matching scheme for accurately matching the names of the layered mechanisms, namely, the search correlation, the character string similarity and the region similarity of the names of the layered mechanisms to be matched and each standard layered mechanism are firstly obtained, then a linear weighting model is adopted to carry out multidimensional fusion on the dimensions such as the search correlation, the character string similarity and the region similarity, the comprehensive matching degree of the names of the layered mechanisms to be matched and each standard layered mechanism is obtained, and finally the standard layered mechanism name corresponding to the maximum value of the comprehensive matching degree is taken as the matching result of the names of the layered mechanisms to be matched and is output;
(2) And the weight coefficient can be obtained finely so as to carry out accurate multidimensional fusion and further improve the matching accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flow chart of a multi-dimensional matching method for names of hierarchical mechanisms according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of a hierarchical name multidimensional matching device according to an embodiment of the present application.
Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be briefly described below with reference to the accompanying drawings and the embodiments or the description in the prior art, it is obvious that the following description of the structure of the drawings is only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto.
It will be understood that, although the terms first, second, etc. may be used herein to describe various objects, these objects should not be limited by these terms. These terms are only used to distinguish one object from another. For example, a first object may be referred to as a second object, and a second object may similarly be referred to as a first object, without departing from the scope of example embodiments of the invention.
It should be understood that, for the term "and/or" as may appear herein, it is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists independently, B exists independently or A and B exist simultaneously; as another example, A, B and/or C, may indicate the presence of any one or any combination of A, B and C; for the term "/and" as may appear herein, which describes another associative object relationship, it means that there may be two relationships, e.g., a/and B, which may mean: a exists singly or A and B exist simultaneously; in addition, for the character "/" that may appear herein, it generally means that the former and latter associated objects are in an "or" relationship.
The embodiment is as follows:
as shown in fig. 1, the multidimensional matching method for names of hierarchical organizations provided in the first aspect of the present embodiment may be, but is not limited to be, implemented by a Computer device with certain computing resources, for example, an electronic device such as a platform server, a Personal Computer (PC, which refers to a multipurpose Computer with a size, price and performance suitable for Personal use, a desktop Computer, a notebook Computer, a small notebook Computer, a tablet Computer, an ultrabook, and the like all belong to a Personal Computer), a smart phone, a Personal Digital Assistant (PDA), or a wearable device. As shown in fig. 1, the hierarchical organization name multidimensional matching method may include, but is not limited to, the following steps S1 to S6.
S1, obtaining the search relevancy between the name of the hierarchical mechanism to be matched and each standard hierarchical mechanism name in a standard hierarchical mechanism name set, wherein the search relevancy takes values in an interval [0,1 ].
In step S1, the to-be-matched hierarchical organization name is organization name data that needs to be standardized, has hierarchical organization structure characteristics, and is presented in a character string form, and may be, but is not limited to, a specific medical organization name or medical research organization name. The standard layered organization name set is a set of a plurality of standard layered organization names, wherein the standard layered organization names are standardized organization name data which have layered organization structure characteristics and are presented in a character string form; the standard hierarchical organization name set may include an alternative name of the organization and information about the region where the organization is located (i.e., information about province, city, and county), such as "infectious disease hospital in Maanshan city" and "Min Hospital in Maanshan city" for the fourth person, and so on. The search relevancy refers to the basis for searching and matching the names of the to-be-matched hierarchical mechanisms in the standard hierarchical mechanism name set in a text search mode and then obtaining and ranking the names. Preferably, the search correlation between the name of the to-be-matched hierarchical mechanism and each standard hierarchical mechanism name in the standard hierarchical mechanism name set is obtained, which includes but is not limited to the following steps S11 to S13.
S11, importing the standard hierarchical mechanism name set into an elastic search engine.
In step S11, the standard set of hierarchical names is used as the search base data, and the elastic search engine is an existing search engine based on Lucene and having distributed multi-user capability, so that the standard set of hierarchical names can be imported into the elastic search engine based on a conventional manner.
And S12, with the names of the to-be-matched hierarchical mechanisms as input information, applying the elastic search engine to return and obtain the correlation scores of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms in the standard hierarchical mechanism name set and obtained based on a BM25 algorithm.
In the step S12, the elastic search engine is applied to search and match the names of the to-be-matched hierarchies in the standard set of names of hierarchies in a text search manner; if some standard hierarchical organization name has an alias, the alias is also required to be searched and matched, and a correlation score of the hierarchical organization name to be matched and the alias and obtained based on a BM25 algorithm is obtained. The BM25 algorithm is an existing algorithm for evaluating the relevance between search terms and documents. Since the mechanism name belongs to noun phrases or proper names, the search and matching are performed based on the word bag obtained after word segmentation, which is different from the situation that the searched words are matched in general and long documents, so that the original formula of the BM25 algorithm is optimized and modified according to the phrase matching characteristics of the layered mechanism name. The original formula of the BM25 algorithm is as follows:
Figure BDA0003905717780000101
the notation in the formula is common knowledge, for example, wherein docCount represents the number of documents in the index, which is equivalent to the total number of elements of the standard set of hierarchical names in the embodiment. For the original formula, the factor k can be adjusted 1 Setting the word frequency item to be 0 to disable the word frequency item; IDF (T) i ) In order to solve the above problem, the IDF calculation formula is adjusted as follows, in which the IDF calculation formula is an inverse document term, that is, the lower the frequency of a word appearing in a document set, the higher the score of the word is increased after the word is matched with the word:
Figure BDA0003905717780000111
so that the adjustment can be made at the same docCount and->
Figure BDA0003905717780000112
Next, the new IDF is larger than the original IDF (T) i ) And therewith->
Figure BDA0003905717780000113
The increase speed is increased, that is, the BM25 algorithm preferably adopts the following formula:
Figure BDA0003905717780000114
in the formula, n represents a positive integer, x represents the name of the layered mechanism to be matched, and D n Represents the nth standard hierarchy name, score, in the set of standard hierarchy names BM25 (x,D n ) Expressing the correlation score of the name of the hierarchical mechanism to be matched and the name of the nth standard hierarchical mechanism, wherein M represents a positive integer, M represents the total number of words of the name of the hierarchical mechanism to be matched, D represents the name set of the standard hierarchical mechanism, and T represents the number of words of the name set of the standard hierarchical mechanism m The mth word represented in the to-be-matched hierarchy name,
Figure BDA0003905717780000115
representing the number of occurrences of the mth word in the set of standard hierarchy names,/>
Figure BDA0003905717780000116
Representing the number of occurrences of the mth word in the nth standard hierarchical name.
And S13, carrying out normalization processing on the correlation scores of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms to obtain the search correlation degrees of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms.
In step S13, it is considered that the correlation score obtained based on the BM25 algorithm is not normalized, that is, the value thereof has no clear upper limit, and therefore, it needs to be normalized, so that the search correlation takes a value within the interval [0,1], and therefore, the normalization process needs to be performed. Meanwhile, in view of saving processing time and reducing computational complexity, it is preferable that the correlation scores between the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms are normalized to obtain the search correlation between the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms, including but not limited to the following steps S131 to S132.
S131, extracting K standard hierarchical mechanism names which are positioned at the top K in the relevance scoring dimension from the standard hierarchical mechanism name set to obtain a standard hierarchical mechanism name candidate set for replacing the standard hierarchical mechanism name set, wherein K represents a positive integer not less than 8.
In the step S131, it is assumed that 20000 standard hierarchy names exist in the standard hierarchy name set, and the value K is 10, and if subsequent calculation is performed based on the standard hierarchy name set, the number of calculation times needs 20000 times, and if subsequent calculation is performed based on the standard hierarchy name candidate set, the number of calculation times needs only 10 times, which can greatly reduce the calculation complexity and is beneficial to quickly obtaining a matching result.
S132, calculating the search correlation degree between the name of the to-be-matched hierarchical mechanism and each standard hierarchical mechanism name in the standard hierarchical mechanism name candidate set according to the following formula:
Figure BDA0003905717780000121
wherein k represents a positive integer, SS k Representing a search relevance, score, of the name of the hierarchy to be matched to the name of the kth standard hierarchy in the candidate set of standard hierarchy names k Represents a relevance Score, of the name of the hierarchical organization to be matched and the name of the kth standard hierarchical organization min Represents the minimum value of the correlation Score between the name of the to-be-matched hierarchical organization and the standard hierarchical organization name candidate set, score max And the maximum value of the correlation score of the to-be-matched hierarchical name and the standard hierarchical name candidate set is represented.
S2, obtaining the character string similarity of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms, wherein the character string similarity takes values in an interval [0,1 ].
In the step S2, the character string similarity is used to reflect the similarity between the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms on the whole character string information. Specifically, the method includes, but is not limited to, the following steps S21 to S25.
S21, obtaining the editing distance similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the editing distance similarity takes values in an interval [0,1 ].
In the step S21, a specific edit Distance may be, but is not limited to, a Levenshtein Distance (Levenshtein Distance), and therefore, the edit Distance similarities between the to-be-matched hierarchical name and the standard hierarchical names may be obtained in a conventional manner.
S22, obtaining the J-W distance similarity between the name of the layering mechanism to be matched and the name of each standard layering mechanism, wherein the J-W distance similarity takes values in an interval [0,1 ].
In the step S22, the specific J-W Distance is the Jaro-Winkler Distance (which is a character string metric for measuring the edit Distance between two character sequences and is a variation of the Jaro Distance metric proposed by William e.winkler in 1990), so the J-W Distance similarity between the name of the hierarchical organization to be matched and the names of the standard hierarchical organizations can be obtained in a conventional manner.
S23, calculating the Jacard similarity between the name of the to-be-matched layered mechanism and the names of the standard layered mechanisms according to the following formula:
Figure BDA0003905717780000131
in which n represents a positive integer, ZS n,jc Representing Jacard similarity between the name of the hierarchical mechanism to be matched and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, tx representing the word set of the name of the hierarchical mechanism to be matched, and TD n A set of words representing the nth standard hierarchical organization name.
S24, calculating the longest common character string similarity between the name of the to-be-matched layered mechanism and the name of each standard layered mechanism according to the following formula:
Figure BDA0003905717780000132
in the formula, ZS n,lcs Representing the longest common character string similarity between the name of the to-be-matched hierarchical mechanism and the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, x representing the name of the to-be-matched hierarchical mechanism, and D n Representing the nth standard hierarchical organization name; LCS (x, D) n ) And the length of the longest common substring of the to-be-matched hierarchical structure name and the nth standard hierarchical structure name is represented.
S25, calculating the character string similarity between the name of the to-be-matched layered mechanism and the name of each standard layered mechanism according to the following formula:
ZS n =k bd *ZS n,bd +k jw *ZS n,jw +k jc *ZS n,jc +k lcs *ZS n,lcs
in the formula, ZS n Representing a string similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,bd Indicating an edit distance similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,jw Representing the J-W distance similarity, k, of the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism bd 、k jw 、k jc And k lcs Are respectively in the interval [0,1]A second class of weighting coefficients of inner value, and k bd +k jw +k jc +k lcs =1。
In the step S25, the second class weight coefficient k bd 、k jw 、k jc And k lcs Examples thereof are 0.25 each.
And S3, performing word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms in sequence to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun.
In the step S3, the word segmentation process and the region entity identification process may both be implemented in an existing manner, wherein the word segmentation process may specifically employ a jieba word segmentation plug-in, and may add a custom dictionary and a synonym for optimization, for example, the word "hospital" often appears in the name of a medical institution, and may be disabled to avoid the interference of the invalid word; the region entity identification processing can specifically select the existing Paddleadd-python entity identification tool. In addition, the recognized regional entity nouns need to be planned and can be divided into three levels of province-city-county, for example, "Guangdong" needs to be normalized to "Guangdong province".
S4, calculating the regional similarity between the name of the to-be-matched layered mechanism and the name of each standard layered mechanism according to the following formula:
Figure BDA0003905717780000141
wherein n represents a positive integer, RS n Representing the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, count () representing a statistical function of the total number of set elements, ED representing the regional entity set of the name of the to-be-matched hierarchical mechanism, SD n A set of geographical features representing the name of the nth standard hierarchy and including at least one normalized geographical entity noun, max () representing a function of solving a maximum, and n representing an intersection symbol.
In the step S4, the geographic entity nouns in the geographic feature set may be obtained by referring to the step S3 in advance; if RS n If the value is equal to zero, it indicates that the to-be-matched hierarchical structure name and the nth standard hierarchical structure name do not intersect on the surface, but a potential region dependency relationship may exist, and preferably, after the region similarity between the to-be-matched hierarchical structure name and each standard hierarchical structure name is calculated, the method further includes, but is not limited to, the following steps S411 to S413.
S411, if the region similarity between the name of the to-be-matched layered mechanism and the name of the nth standard layered mechanism is zero, checking each region entity name in the region entity set ED and the region feature set SD according to a region relation database n Whether each domain entity name in (2) has a domain affiliation.
In step S411, the region relation database is used to reflect predetermined region affiliations between province and city, city and county, etc., such as Guangzhou city belonging to Guangdong province and being a provincial meeting city. In addition, the geographic relation database may specifically adopt a plug-in ltree of PostgresSQL to perform database maintenance, and check each geographic entity term in the geographic entity set ED and the geographic feature set SD n Whether each domain entity name in (2) has a domain affiliation.
S412, if the first region entity noun in the region entity set ED and the region feature set SD n If the second domain entity noun has a domain dependency relationship, further determining whether the first domain entity noun is the administrative center of the second domain entity noun or whether the second domain entity noun is the administrative center of the first domain entity noun.
And S413, if so, updating the region similarity between the name of the to-be-matched layered mechanism and the name of the nth standard layered mechanism to a preset first numerical value, otherwise, updating the region similarity between the name of the to-be-matched layered mechanism and the name of the nth standard layered mechanism to a preset second numerical value, wherein the first numerical value takes a value in an interval [0,1], and the second numerical value also takes a value in an interval [0,1] but is smaller than the first numerical value.
In step S413, the first numerical value may be, for example, 0.3; the second value may be, for example, 0.15.
S5, calculating and obtaining the comprehensive matching degree of the names of the to-be-matched layered mechanisms and the names of the standard layered mechanisms according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched hierarchical organization and the name of the nth standard hierarchical organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Representing the string similarity, h, of the name of the hierarchy to be matched and the name of the nth standard hierarchy SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1。
In said step S5, considering that the hierarchical names can be described as these three parts: region information, organization information and characteristic information; for example, "Chongqing medical university affiliated oral hospital", the regional information is "Chongqing", the organization information is "medical university", and the characteristic information is "oral cavity", so this embodiment is directed to these three kinds of information, add the whole information of the whole character string, to adopt the linear weighting model to carry out multidimensional fusion on the dimensionalities such as search relevancy, character string similarity and regional similarity, obtain the comprehensive matching degree of the to-be-matched hierarchical organization name and the names of the various standard hierarchical organizations, so as to accurately obtain the matching result in the following.
And S6, taking the standard layered structure name corresponding to the maximum value of the comprehensive matching degree in the standard layered structure name set as a matching result of the layered structure name to be matched and outputting the result.
Therefore, based on the multi-dimensional matching method for the names of the hierarchical mechanisms described in the steps S1 to S6, a multi-dimensional matching scheme for accurately matching the names of the hierarchical mechanisms is provided, that is, the search correlation, the character string similarity and the region similarity between the names of the hierarchical mechanisms to be matched and each standard hierarchical mechanism are obtained first, then a linear weighting model is adopted to perform multi-dimensional fusion on the dimensions such as the search correlation, the character string similarity and the region similarity, so as to obtain the comprehensive matching degree between the names of the hierarchical mechanisms to be matched and each standard hierarchical mechanism, and finally the name of the standard hierarchical mechanism corresponding to the maximum value of the comprehensive matching degree is taken as the matching result of the names of the hierarchical mechanisms to be matched and output.
On the basis of the technical solution of the first aspect, the present embodiment further provides a first possible design for how to determine the first-class weight coefficients in advance, that is, the first-class weight coefficients are determined in advance by using the following formula:
Figure BDA0003905717780000161
in the formula, RP SS A matched hierarchical organization name subset representing a standard hierarchical organization name in the standard hierarchical organization names which can be correctly matched only based on the maximum value of the search relevance in the matched hierarchical organization name set (namely, a standard hierarchical organization name corresponding to the maximum value of the search relevance in the standard hierarchical organization name set is taken as a matching result of the matched hierarchical organization name), and the RP ZS A matched hierarchical name subset representing that a standard hierarchical name in the standard hierarchical names can be correctly matched based on only the maximum value of the correlation degree of the character strings in the matched hierarchical name set (namely, a standard hierarchical name corresponding to the maximum value of the correlation degree of the character strings in the standard hierarchical name set is taken as a matching result of the matched hierarchical name), and a RP RS And the matched hierarchical organization name subset represents a matched hierarchical organization name subset which can be correctly matched with a certain standard hierarchical organization name in the standard hierarchical organization names only based on the maximum region similarity in the matched hierarchical organization name set (namely, a certain standard hierarchical organization name corresponding to the maximum region similarity in the standard hierarchical organization name set is used as a matching result of the matched hierarchical organization name). The determination mode can be gradually optimized in an iterative mode in the matching process aiming at the names of the multiple to-be-matched hierarchical mechanisms; finally, repeatedly observing character string characteristics of the error matching items by an expert who is proficient in the principle of a plurality of character string distance algorithms, and finely fine-tuning to obtain the first-class weight coefficient h SS =0.45、h ZS =0.4 and h RS =0.15。
Therefore, based on the possible design I, the first class of weight coefficients can be obtained in a fine mode, so that accurate multi-dimensional fusion can be carried out, and the matching accuracy is further improved. In addition, for the second class of weighting coefficients, similar means may also be adopted for fine determination, which is not described herein again.
As shown in fig. 2, a second aspect of this embodiment provides a virtual device for implementing the first aspect or possibly designing the method for multidimensional matching of names of hierarchical organizations, including a search relevancy obtaining module, a character string similarity obtaining module, a region entity obtaining module, a region similarity calculating module, a comprehensive matching degree calculating module, and a matching result determining module;
the search relevancy obtaining module is used for obtaining the search relevancy between the name of the hierarchical mechanism to be matched and each standard hierarchical mechanism name in the standard hierarchical mechanism name set, wherein the search relevancy takes values in an interval [0,1 ];
the character string similarity obtaining module is used for obtaining the character string similarity of the name of the hierarchical mechanism to be matched and the name of each standard hierarchical mechanism, wherein the character string similarity takes values in an interval [0,1 ];
the region entity acquisition module is used for sequentially carrying out word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun; the region similarity calculation module is in communication connection with the region entity acquisition module and is used for calculating the region similarity between the name of the to-be-matched layered mechanism and the name of each standard layered mechanism according to the following formula:
Figure BDA0003905717780000171
wherein n represents a positive integer, RS n Representing the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, count () representing a statistical function of the total number of set elements, ED representing the regional entity set of the name of the to-be-matched hierarchical mechanism, SD n A geographical feature set representing the name of the nth standard hierarchical organization and including at least one normalized geographical entity noun, max () representing a function of solving a maximum value, and n representing an intersection symbol; the comprehensive matching degree calculation module is respectively in communication connection with the search correlation degree acquisition module, the character string similarity degree acquisition module and the serverThe region similarity calculation module is used for calculating and obtaining the comprehensive matching degree of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched hierarchical organization and the name of the nth standard hierarchical organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Representing the string similarity, h, of the name of the hierarchy to be matched and the name of the nth standard hierarchy SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1;
And the matching result determining module is in communication connection with the comprehensive matching degree calculating module and is used for taking the standard layered structure name corresponding to the maximum value of the comprehensive matching degree in the standard layered structure name set as the matching result of the layered structure name to be matched and outputting the matching result.
For the working process, the working details, and the technical effects of the foregoing apparatus provided in the second aspect of this embodiment, reference may be made to the first aspect, or a multidimensional matching method for names of hierarchical mechanisms may be designed, which is not described herein again.
As shown in fig. 3, a third aspect of this embodiment provides a computer device for executing the multidimensional matching method for hierarchical names according to the first aspect or any design thereof, and the computer device includes a memory, a processor, and a transceiver, which are sequentially and communicatively connected, where the memory is used for storing a computer program, the transceiver is used for transceiving messages, and the processor is used for reading the computer program to execute the multidimensional matching method for hierarchical names according to the first aspect or any design thereof. For example, the Memory may include, but is not limited to, a Random-Access Memory (RAM), a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a First-in First-out (FIFO), and/or a First-in Last-out (FILO), and the like; the processor may be, but is not limited to, a microprocessor of the model number STM32F105 family. In addition, the computer device may also include, but is not limited to, a power module, a display screen, and other necessary components.
For the working process, working details and technical effects of the foregoing computer device provided in the third aspect of this embodiment, reference may be made to the first aspect or a multidimensional matching method for names of hierarchical mechanisms may be designed, which is not described herein again.
A fourth aspect of the present embodiment provides a computer-readable storage medium storing instructions including the hierarchical name multidimensional matching method according to the first aspect, or possibly a design of the hierarchical name multidimensional matching method, where the instructions are stored on the computer-readable storage medium, and when the instructions are executed on a computer, the hierarchical name multidimensional matching method according to the first aspect is executed, or a design of the hierarchical name multidimensional matching method is possible. The computer-readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a computer-readable storage medium such as a floppy disk, an optical disk, a hard disk, a flash Memory, a flash disk and/or a Memory Stick (Memory Stick), and the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
For the working process, the working details and the technical effects of the foregoing computer-readable storage medium provided in the fourth aspect of this embodiment, reference may be made to the first aspect or a possible design of a multidimensional matching method for names of hierarchical mechanisms, which is not described herein again.
A fifth aspect of the present embodiment provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the method for multidimensional matching of hierarchical names according to the first aspect or possible design. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices.
Finally, it should be noted that: the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A multi-dimensional matching method for names of layered mechanisms is characterized by comprising the following steps:
acquiring the search relevancy between the name of the hierarchical mechanism to be matched and each standard hierarchical mechanism name in a standard hierarchical mechanism name set, wherein the search relevancy takes values in an interval [0,1 ];
acquiring the character string similarity of the name of the hierarchical mechanism to be matched and the name of each standard hierarchical mechanism, wherein the character string similarity takes values in an interval [0,1 ];
performing word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms in sequence to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun;
and calculating the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure FDA0003905717770000011
wherein n represents a positive integer, RS n Representing the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, count () representing a statistical function of the total number of set elements, ED representing the regional entity set of the name of the to-be-matched hierarchical mechanism, SD n A geographical feature set representing the name of the nth standard hierarchical organization and including at least one normalized geographical entity noun, max () representing a function of solving a maximum value, and n representing an intersection symbol;
and calculating the comprehensive matching degree of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched hierarchical organization and the name of the nth standard hierarchical organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Representing the string similarity, h, of the name of the hierarchy to be matched and the name of the nth standard hierarchy SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1;
And taking the standard layered mechanism name corresponding to the maximum value of the comprehensive matching degree in the standard layered mechanism name set as a matching result of the layered mechanism name to be matched and outputting the matching result.
2. The multidimensional matching method for names of hierarchies according to claim 1, wherein the obtaining of the search correlation between the name of the hierarchy to be matched and each standard hierarchy name in the standard hierarchy name set comprises:
importing a standard hierarchical name set into an elastic search engine;
using the elastic search engine to return and obtain the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism in the standard hierarchical mechanism name set by using the name of the to-be-matched hierarchical mechanism as input information, and obtaining a correlation score based on a BM25 algorithm;
and normalizing the correlation scores of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms to obtain the search correlation degrees of the names of the to-be-matched hierarchical mechanisms and the names of the standard hierarchical mechanisms.
3. The hierarchical name multidimensional matching method according to claim 2, wherein the BM25 algorithm employs the following formula:
Figure FDA0003905717770000021
in the formula, n represents a positive integer, x represents the name of the layering mechanism to be matched, and D n Represents the nth standard hierarchy name, score, in the set of standard hierarchy names BM25 (x,D n ) Representing the correlation score of the name of the to-be-matched layered organization and the name of the nth standard layered organization, M representing a positive integer, M representing the total number of words of the name of the to-be-matched layered organization, D representing the set of the names of the standard layered organizations, and T m The mth word represented in the to-be-matched hierarchy name,
Figure FDA0003905717770000022
represents the number of occurrences of the mth word in the set of standard hierarchy names, <' > or>
Figure FDA0003905717770000023
Representing the number of occurrences of the mth word in the nth standard hierarchy name.
4. The multidimensional matching method for names of hierarchies according to claim 2, wherein the normalization processing is performed on the correlation scores of the names of hierarchies to be matched and the names of the respective standard hierarchies to obtain the search correlation degrees of the names of the hierarchies to be matched and the names of the respective standard hierarchies, and the method comprises the following steps:
extracting K standard hierarchical mechanism names which are positioned at the top K in the correlation scoring dimension from the standard hierarchical mechanism name set to obtain a standard hierarchical mechanism name candidate set for replacing the standard hierarchical mechanism name set, wherein K represents a positive integer not less than 8;
and calculating the search correlation between the name of the to-be-matched hierarchical mechanism and each standard hierarchical mechanism name in the standard hierarchical mechanism name candidate set according to the following formula:
Figure FDA0003905717770000024
wherein k represents a positive integer, SS k Representing a search relevance, score, of the name of the hierarchy to be matched to the name of the kth standard hierarchy in the candidate set of standard hierarchy names k Represents a relevance Score, of the name of the hierarchical organization to be matched and the name of the kth standard hierarchical organization min Represents the minimum value of the correlation Score between the name of the to-be-matched hierarchical organization and the standard hierarchical organization name candidate set, score max And the maximum value of the correlation score of the to-be-matched hierarchical name and the standard hierarchical name candidate set is represented.
5. The multidimensional matching method for names of hierarchical organizations according to claim 1, wherein obtaining the similarity of the character strings between the names of the hierarchical organizations to be matched and the names of the standard hierarchical organizations comprises:
acquiring the edit distance similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the edit distance similarity takes values in an interval [0,1 ];
acquiring the J-W distance similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the J-W distance similarity takes values in an interval [0,1 ];
calculating the Jacard similarity between the name of the to-be-matched layered mechanism and the names of the standard layered mechanisms according to the following formula:
Figure FDA0003905717770000031
in which n represents a positive integer, ZS n,jc Representing the name of the hierarchical organization to be matched with the standardJacard similarity of the nth standard hierarchical organization name in the hierarchical organization name set, tx represents the word set of the hierarchical organization name to be matched, TD n A set of words representing the nth standard hierarchical organization name;
and calculating the similarity of the longest common character string of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure FDA0003905717770000032
/>
in the formula, ZS n,lcs Representing the longest common character string similarity between the name of the to-be-matched hierarchical mechanism and the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, x representing the name of the to-be-matched hierarchical mechanism, and D n Representing the nth standard hierarchical organization name; LCS (x, D) n ) Representing the longest common substring length of the to-be-matched hierarchical structure name and the nth standard hierarchical structure name;
and calculating the character string similarity of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
ZS n =k bd *ZS n,bd +k jw *ZS n,jw +k jc *ZS n,jc +k lcs *ZS n,lcs
in the formula, ZS n Representing a string similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,bd Indicating an edit distance similarity, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n,jw Representing the J-W distance similarity, k, between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism bd 、k jw 、k jc And k lcs Are respectively in the interval [0,1]A second class of weighting coefficients of the inner value and having k bd +k jw +k jc +k lcs =1。
6. The method for multidimensional matching of names of hierarchies according to claim 1, wherein after the regional similarity between the name of a hierarchy to be matched and each of the names of standard hierarchies is calculated, the method further comprises:
if the regional similarity between the name of the to-be-matched hierarchical mechanism and the name of the nth standard hierarchical mechanism is zero, checking each regional entity name in the regional entity set ED and the regional feature set SD according to a regional relation database n Whether each region entity noun in the domain has region subordination relation or not;
if the first region entity noun in the region entity set ED and the region feature set SD n If the second domain entity noun in the set of domain names has a domain dependency relationship, further determining whether the first domain entity noun is the administrative center of the second domain entity noun or whether the second domain entity noun is the administrative center of the first domain entity noun;
if yes, updating the region similarity of the name of the to-be-matched layered mechanism and the name of the nth standard layered mechanism to be a preset first numerical value, otherwise, updating the region similarity of the name of the to-be-matched layered mechanism and the name of the nth standard layered mechanism to be a preset second numerical value, wherein the first numerical value takes a value in an interval [0,1], and the second numerical value also takes a value in an interval [0,1] but is smaller than the first numerical value.
7. The hierarchical authority name multidimensional matching method according to claim 1, wherein the first class weight coefficients are determined in advance using the following formula:
Figure FDA0003905717770000041
in the formula, RP SS Indicating that a target in the standard hierarchical organization name can be correctly matched in the matched hierarchical organization name set based on the maximum value of the search relevanceMatched hierarchical name subset of quasi-hierarchical names, RP ZS A matched hierarchical name subset, RP, representing a matched hierarchical name which can be correctly matched with a standard hierarchical name in the standard hierarchical names based on the maximum value of the correlation degree of the character strings in the matched hierarchical name set RS And the matched hierarchical organization name subset represents a matched hierarchical organization name subset which can be correctly matched with a certain standard hierarchical organization name in the standard hierarchical organization names only based on the maximum region similarity in the matched hierarchical organization name set.
8. A multi-dimensional matching device for hierarchical mechanism names is characterized by comprising a search correlation degree acquisition module, a character string similarity degree acquisition module, a region entity acquisition module, a region similarity degree calculation module, a comprehensive matching degree calculation module and a matching result determination module;
the search relevancy obtaining module is used for obtaining the search relevancy between the name of the hierarchical mechanism to be matched and each standard hierarchical mechanism name in the standard hierarchical mechanism name set, wherein the search relevancy takes values in an interval [0,1 ];
the character string similarity obtaining module is used for obtaining the character string similarity of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism, wherein the character string similarity takes values in an interval [0,1 ];
the region entity acquisition module is used for sequentially carrying out word segmentation processing and region entity identification processing on the names of the to-be-matched hierarchical mechanisms to obtain a region entity set, wherein the region entity set comprises at least one normalized region entity noun;
the region similarity calculation module is in communication connection with the region entity acquisition module, and is configured to calculate the region similarity between the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
Figure FDA0003905717770000051
wherein n represents a positive integer, RS n Representing the regional similarity between the name of the to-be-matched hierarchical mechanism and the nth standard hierarchical mechanism in the standard hierarchical mechanism name set, count () representing a statistical function of the total number of set elements, ED representing the regional entity set of the name of the to-be-matched hierarchical mechanism, SD n A geographical feature set representing the name of the nth standard hierarchical organization and including at least one normalized geographical entity noun, max () representing a function of solving a maximum value, and n representing an intersection symbol;
the comprehensive matching degree calculation module is respectively in communication connection with the search correlation degree acquisition module, the character string similarity acquisition module and the region similarity calculation module, and is used for calculating and obtaining the comprehensive matching degree of the name of the to-be-matched hierarchical mechanism and the name of each standard hierarchical mechanism according to the following formula:
P n =h SS *SS n +h ZS *ZS n +h RS *RS n
in the formula, P n Represents the comprehensive matching degree of the name of the to-be-matched hierarchical organization and the name of the nth standard hierarchical organization, SS n Representing the search correlation, ZS, of the name of the hierarchy to be matched and the name of the nth standard hierarchy n Representing the string similarity, h, of the name of the hierarchy to be matched and the name of the nth standard hierarchy SS 、h ZS And h RS Are respectively in the interval [0,1]A first class weight coefficient of inner value, and h SS +h RS +h ZS =1;
And the matching result determining module is in communication connection with the comprehensive matching degree calculating module and is used for taking the standard layered structure name corresponding to the maximum value of the comprehensive matching degree in the standard layered structure name set as the matching result of the layered structure name to be matched and outputting the matching result.
9. Computer device, comprising a memory, a processor and a transceiver, which are in communication connection in sequence, wherein the memory is used for storing a computer program, the transceiver is used for sending and receiving messages, and the processor is used for reading the computer program and executing the method for multidimensional matching of hierarchical names according to any of claims 1 to 7.
10. A computer-readable storage medium having stored thereon instructions for performing the method for multidimensional matching of hierarchical names according to any one of claims 1 to 7 when the instructions are run on a computer.
CN202211305393.8A 2022-10-24 2022-10-24 Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium Pending CN115858878A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211305393.8A CN115858878A (en) 2022-10-24 2022-10-24 Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211305393.8A CN115858878A (en) 2022-10-24 2022-10-24 Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium

Publications (1)

Publication Number Publication Date
CN115858878A true CN115858878A (en) 2023-03-28

Family

ID=85661751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211305393.8A Pending CN115858878A (en) 2022-10-24 2022-10-24 Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium

Country Status (1)

Country Link
CN (1) CN115858878A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312707A (en) * 2023-09-05 2023-12-29 东南大学 Website fingerprint generation method based on dynamic and static feature combination

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312707A (en) * 2023-09-05 2023-12-29 东南大学 Website fingerprint generation method based on dynamic and static feature combination

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
US9613024B1 (en) System and methods for creating datasets representing words and objects
US11714831B2 (en) Data processing and classification
CN105468605B (en) Entity information map generation method and device
US8341095B2 (en) Supervised semantic indexing and its extensions
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
EP3958145A1 (en) Method and apparatus for semantic retrieval, device and storage medium
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
US20130036076A1 (en) Method for keyword extraction
CN106372117B (en) A kind of file classification method and its device based on Term co-occurrence
US20200073890A1 (en) Intelligent search platforms
CN111753167A (en) Search processing method, search processing device, computer equipment and medium
Yilahun et al. Entity extraction based on the combination of information entropy and TF-IDF
CN115858878A (en) Multi-dimensional matching method, device and equipment for names of layered mechanisms and storage medium
CN115828854B (en) Efficient table entity linking method based on context disambiguation
Zhang et al. An approach for named entity disambiguation with knowledge graph
US20230282018A1 (en) Generating weighted contextual themes to guide unsupervised keyphrase relevance models
Yang et al. Exploring word similarity to improve chinese personal name disambiguation
CN115203379A (en) Retrieval method, retrieval apparatus, computer device, storage medium, and program product
CN112215006B (en) Organization named entity normalization method and system
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
Sun et al. Joint self-attention based neural networks for semantic relation extraction
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
Bolikowski et al. Towards a flexible author name disambiguation framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination