CN113868312A - Multi-method fused mechanism matching method, device, equipment and storage medium - Google Patents

Multi-method fused mechanism matching method, device, equipment and storage medium Download PDF

Info

Publication number
CN113868312A
CN113868312A CN202111192516.7A CN202111192516A CN113868312A CN 113868312 A CN113868312 A CN 113868312A CN 202111192516 A CN202111192516 A CN 202111192516A CN 113868312 A CN113868312 A CN 113868312A
Authority
CN
China
Prior art keywords
data
entity
matching
matched
preprocessed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111192516.7A
Other languages
Chinese (zh)
Inventor
王杨
王茜
张奥琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai R&d Public Service Platform Management Center
Original Assignee
Shanghai R&d Public Service Platform Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai R&d Public Service Platform Management Center filed Critical Shanghai R&d Public Service Platform Management Center
Priority to CN202111192516.7A priority Critical patent/CN113868312A/en
Publication of CN113868312A publication Critical patent/CN113868312A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Automation & Control Theory (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a multi-method fusion mechanism matching method, a multi-method fusion mechanism matching device and a multi-method fusion mechanism matching storage medium, wherein a data preprocessing configuration file is constructed to clean and standardize mechanism data to be matched and a target mechanism table; entity labeling is carried out on the preprocessed mechanism data to be matched by utilizing a machine learning model, and a mechanism entity and a region entity are extracted from an entity labeling result by combining a user-defined rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result. The method and the device can be used for solving the problems of mechanism entity alignment, homonymy student disambiguation and the like, greatly save the manual cost in the preprocessing and entity labeling process, are suitable for the processing requirements of different data sets, can improve the model labeling effect and the matching accuracy, and realize higher matching accuracy.

Description

Multi-method fused mechanism matching method, device, equipment and storage medium
Technical Field
The present application relates to the field of training set data generation technologies, and in particular, to a multi-method fusion mechanism matching method, apparatus, device, and storage medium.
Background
With the development of big data in the scientific field, the standardized alignment of texts of irregular institutions becomes an urgent problem to be solved in the information analysis of scientific talents, scientific research institutions and scientific literature. Because the data scale is large and manual processing is difficult, the traditional rule-based data cleaning cannot adapt to complex and disordered data formats, and the technology of adopting a machine learning model becomes a new solving path. In the past, the mechanism entity is marked by using a machine learning model and then is directly matched, but the method is difficult to process when the mechanism entity is subjected to the conditions of identical meaning, inconsistent spelling and the like. In addition, in the conventional method, cleaning and weight setting are not performed on the target institution library, and a secondary institution is preferentially matched with a primary institution when a similarity model is applied, so that the standardized data quality requirement on the target institution library is high.
With the development of big data technology, various methods such as data preprocessing, standardization, machine learning model labeling, weight setting and the like are comprehensively applied, so that the accuracy of mechanism name matching can be improved, and the method is further applied to the work of establishing a knowledge graph, data analysis, disambiguation of homonymy scholars and the like.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present application to provide a multi-method converged organization matching method, apparatus, device and storage medium to solve at least one problem existing in the prior art.
To achieve the above and other related objects, the present application provides a multi-method fused mechanism matching method, including: constructing a data preprocessing configuration file, and cleaning and standardizing mechanism data to be matched and a target mechanism table; entity labeling is carried out on the preprocessed mechanism data to be matched by utilizing a machine learning model, and a mechanism entity and a region entity are extracted from an entity labeling result by combining a user-defined rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
In an embodiment of the present application, the constructing a data preprocessing configuration file, and cleaning and standardizing the mechanism data to be matched and the target mechanism table includes: constructing a matching mode and a processing rule of invalid characters, special characters, irrelevant information and symbol specifications to be used as a data preprocessing configuration file for input; cleaning and standardizing the mechanism data to be matched and the target mechanism table by utilizing a plurality of tools and combining regular expressions based on the data preprocessing configuration file; and storing the preprocessed mechanism data to be matched and the target mechanism table into a MongoDB database for entity marking and mechanism matching.
In one embodiment of the present application, the cleaning and normalizing comprises: any one or more of unresolved HTML content conversion, unification of symbols, cleansing of invalid characters located in the middle, removal of interfering characters at the beginning and end, cleansing of extraneous information, unification of spelling format into title specification, compaction of multiple spaces into one, specification of control space format, unification of full-angle characters into half-angle, and standardization of organization name for organization.
In an embodiment of the present application, the entity labeling of the preprocessed mechanism data to be matched by using the machine learning model, and extracting the mechanism entity and the area entity from the entity labeling result by combining the customized rule include: segmenting a plurality of mechanism names in preprocessed mechanism data to be matched; utilizing a machine learning model to label the entity of each organization name, and storing the labeling results corresponding to all the organization entities into a MongoDB database; and processing the labeling result of the machine learning model according to a custom rule so as to add the unidentified organization name to the organization entity and store the organization name in the MongoDB database.
In an embodiment of the present application, the method further includes: extracting region entities through a machine learning model; expanding a regional entity containing an organization name into the organization entity; and respectively storing the optimized organization entity and the optimized region entity into a MongoDB database.
In an embodiment of the present application, the directly matching the preprocessed mechanism data to be matched with the target mechanism table includes: constructing a dictionary data type by the target mechanism table according to the weight; directly matching the preprocessed mechanism data to be matched and the extracted mechanism entity by the dictionary data type acquisition method according to the configured weights from big to small; and adding corresponding organization identifications for the successfully matched organization entities.
In an embodiment of the present application, the fuzzy matching of the preprocessed mechanism data to be matched and the target mechanism table includes: importing the weights of the target mechanism table and the configuration into an Elasticissearch to perform fuzzy matching; sequentially fusing multi-step and multi-source mechanism name similarity calculation on the preprocessed mechanism data to be matched by utilizing an elastic search in fuzzy matching; obtaining the final matching degree score between the organization names by combining the user-defined weight; and sequentially adopting the preprocessed mechanism data to be matched, the entity marking result and the partitioned result of the preprocessed mechanism data to be matched to perform fuzzy search on the target mechanism table, and selecting the mechanism name corresponding to the matching degree score meeting the corresponding threshold value as a final matching result.
To achieve the above and other related objects, the present application provides a multi-method fused mechanism matching apparatus, comprising: the preprocessing module is used for constructing a data preprocessing configuration file, and cleaning and standardizing the mechanism data to be matched and the target mechanism table; the processing module is used for carrying out entity labeling on the preprocessed mechanism data to be matched by utilizing the machine learning model and extracting mechanism entities and region entities from the entity labeling result by combining with a custom rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
To achieve the above and other related objects, the present application provides a computer apparatus, comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method as described above.
To achieve the above and other related objects, the present application provides a computer readable storage medium storing computer instructions which, when executed, perform the method as described above.
In summary, the multi-method integrated mechanism matching method, device, equipment and storage medium provided by the application clean and standardize the mechanism data to be matched and the target mechanism table by constructing a data preprocessing configuration file; entity labeling is carried out on the preprocessed mechanism data to be matched by utilizing a machine learning model, and a mechanism entity and a region entity are extracted from an entity labeling result by combining a user-defined rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
Has the following beneficial effects:
1) semi-automation of the organization name data preprocessing process and the entity labeling is realized, labor cost in the preprocessing and entity labeling process is greatly saved, and the self-defined preprocessing configuration file is input, so that the method can be suitable for processing requirements of different data sets;
2) the adaptability to the data of a non-standardized target mechanism table and a mechanism to be matched is strong, the two parties can have the same data format specification through data cleaning, mechanism name standardization and weight configuration, and the model marking effect and the matching accuracy can be improved;
3) constructing an Elasticissearch-based fuzzy search module, performing fuzzy matching on the mechanism texts which cannot be directly matched, and obtaining a final matching score by combining configuration weights;
4) the preprocessed text and the entity labeled result are comprehensively used as the input of the matching stage, so that the problems caused by labeling of a machine learning model are avoided, and higher matching accuracy is realized;
5) the matching of the mechanism data to be matched and the target mechanism table is realized, and the method can be used for solving the problems of mechanism entity alignment, homonymy student disambiguation and the like.
Drawings
Fig. 1 is a flow chart illustrating a multi-method mechanism matching method according to an embodiment of the present disclosure.
Fig. 2 is a flowchart illustrating step S1 according to an embodiment of the present invention.
Fig. 3 is a flowchart illustrating step S2 according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating step S23 according to an embodiment of the present invention.
Fig. 5 is a flowchart illustrating the direct matching in step S4 according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating fuzzy matching in step S4 according to an embodiment of the present application.
FIG. 7 is a block diagram of a multi-method fusion mechanism matching system according to an embodiment of the present application.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only schematic and illustrate the basic idea of the present application, and although the drawings only show the components related to the present application and are not drawn according to the number, shape and size of the components in actual implementation, the type, quantity and proportion of the components in actual implementation may be changed at will, and the layout of the components may be more complex.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.
In view of the problems that algorithm model training is affected by insufficient training data, high labeling data cost, poor data consistency and the like in the natural language algorithm training process at present, the method, the device, the equipment and the medium for automatically generating mass training data based on a small number of sentence-splitting short text labels provided by the application can be used for solving the practical problems at present. The application can greatly reduce the cost of enterprise manual labeling, ensure the consistency of repeated text labeling, simultaneously reduce the interference caused by inconsistent corpora during model algorithm training, and improve the accuracy of model learning. In the application, the annotating personnel only need to mark a small part of short texts which are obtained by splitting the original long text and are de-duplicated, so that complete original long text annotation files can be automatically generated in batches and provided for the model to be trained.
Fig. 1 is a schematic flow chart of a multi-method fusion mechanism matching method according to an embodiment of the present application. As shown, the method comprises:
step S1: and constructing a data preprocessing configuration file, and cleaning and standardizing the mechanism data to be matched and the target mechanism table.
In brief, the mechanism data preprocessing framework based on the custom matching mode (or the configuration file) and the processing rule configuration file is suitable for cleaning and standardizing the mechanism data to be matched and the target mechanism table, so that the matching parties have the same data format specification.
In the application, a configuration file of data preprocessing is constructed, matching modes and processing modes of special characters, invalid characters and irrelevant texts are explained in the file, and characters and character strings are processed by utilizing regular expressions in a parameter input mode. And the standardized processing of the data is realized by standardizing the spelling format and the space format of the mechanism name and removing the mechanism type term, and the data preprocessing is respectively carried out on the mechanism data to be matched and the target mechanism table, so that the standardization of the target mechanism table and the cleaning of the mechanism data to be matched are realized.
In this embodiment, as shown in fig. 2, step S1 specifically includes:
step S11: and constructing a matching mode and a processing rule of invalid characters, special characters, irrelevant information and symbol specifications to be used as a data preprocessing configuration file for inputting.
The method comprises the steps of constructing a matching mode and a processing rule which are suitable for invalid characters, special characters, irrelevant information and symbol specifications of own data characteristics, inputting the matching mode and the processing rule as configuration files, storing the configuration files in a json format, and dividing the configuration files into three parts, namely character processing, character string processing and specification processing.
Step S12: and cleaning and standardizing the mechanism data to be matched and the target mechanism table by utilizing a plurality of tools and combining regular expressions based on the data preprocessing configuration file.
Regular expressions are a logical formula for operating on character strings (including common characters (e.g., letters between a and z) and special characters (called meta characters)), and a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and is used for expressing a filtering logic for the character string. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.
Specifically, the method and the device can be used for cleaning and standardizing the mechanism data to be matched and the target mechanism table based on tools such as regular expressions, unicodedata, titlecase, clearo and the like. The matching mode and the processing rule of the regular expression are used as input according to a user-defined configuration file, invalid characters are extracted and filtered, different rules are adopted according to the positions of the invalid characters, and the standardization processing comprises the aspects of organization name standardization, organization name spelling standardization, space format standardization and character standardization.
For example, the pre-processed content includes, but is not limited to: any one or more of unresolved HTML content conversion, unification of symbols, cleansing of invalid characters located in the middle, removal of interfering characters at the beginning and end, cleansing of extraneous information, unification of spelling format into title specification, compaction of multiple spaces into one, specification of control space format, unification of full-angle characters into half-angle, and standardization of organization name for organization.
Unresolved HTML content transformations, such as: and the & amp, & lt, & quot, etc., can be implemented using unescape tools provided by an HTML parser.
Unification of symbols, such as: the various separation symbols are unified, obtainable by canonical expression [ |/║ |, | ║ ], [ - - - - - ]; the following steps are repeated: and is an and, can be replaced by a regular expression according to a rule.
Cleaning invalid characters in the middle, such as: scrambling codes (such as ' kilogram copy '), blank characters, symbols without semantic information and the like caused by inconsistent codes can be obtained by a regular expression [ < Lambda >/x 00- \ x7F \ xC0- \ xFF/| ║ | - - - - - - - - - [ alpha ] μ I ' ", ] in a manner of constructing an effective character set by counting the occurrence frequency of each character in the data set.
Remove the head and tail interference characters, such as: extracting the character set to be removed as [; - # - & ()/[ ] _ - ], which can be realized by a strip function;
cleaning irrelevant information, such as: irrelevant information such as email addresses with certain rules can be obtained by regular expression [ a-zA-Z0-9_. + - ] + @ [ a-zA-Z0-9- ] + \[ a-zA-Z0-9- ] +, "P \ s \ s _ \.
The spelling format is unified into a title specification, such as: the capital letters of the first words and the preposition capitalization of each word are set, and the method can be realized through a titlecase tool.
Multiple spaces are compressed into one, such as: can be realized by regular expression \ s {2 }.
The space format is controlled by the specification, such as: no space is set before the symbol, and there is only one space after the symbol, which can be achieved by regular expression \ s ([,.
The full-angle characters are unified as half angles, such as: the method can be realized by a normalize method provided by a unicodedata tool, and 'NFKC' is selected as a parameter.
Organization names are standardized, such as: removal indicates that the agency type terminology ltd, Corp, co, etc. can be achieved by basename in the clearo tool provided by https:// github.
It should be noted that, in the data preprocessing stage, the cleaning of the invalid information such as the mailbox address and the email is performed on the content having a certain rule, so that the p.o.box is prevented from being labeled as a regional entity by the machine learning model in step S2, and the accuracy of entity labeling is further improved.
In the present application, the purpose of cleaning and standardizing both the mechanism data to be matched and the target mechanism table (both applying the preprocessing flow) is to achieve that both the data are in the same standard, and at the same time, the preprocessing flow has a wider application range for the data quality of the target mechanism table. In addition, the pretreatment steps are all pluggable, and can be opened or closed during specific implementation.
For example, "- [ CENTERFOR DISEASE PREVENTION & ] in the mechanism data to be matched can be set by step S1 Control of Guangdong provision, p.o.box 100, Guangzhou 511430, China "clean as" Center for distance prediction and Control of Guangdong provision ", and" Toppan Printing co.
Step S13: and storing the preprocessed mechanism data to be matched and the target mechanism table into a MongoDB database for subsequent entity marking and mechanism matching.
Wherein, MongoDB is a database based on distributed file storage. Written in the C + + language. It is intended to provide an extensible high performance data storage solution for WEB applications. MongoDB is a product between relational databases and non-relational databases, and among the non-relational databases, the MongoDB has the most abundant functions and is most similar to the relational databases. The data structure supported by the method is very loose and is in a json-like bson format, so that more complex data types can be stored. MongoDB has the biggest characteristic that the supported query language is very strong, the grammar of the MongoDB is similar to the object-oriented query language, almost the most functions of single-table query of similar relational databases can be realized, and the index establishment of data is also supported.
Step S2: and carrying out entity labeling on the preprocessed mechanism data to be matched by utilizing a machine learning model, and extracting mechanism entities and area entities from an entity labeling result by combining a user-defined rule.
Briefly, a machine learning model is comprehensively used for carrying out entity marking on the preprocessed mechanism data to be matched, and the mechanism entities and the region entities are extracted by combining a user-defined rule method so as to realize matching with the target mechanism table.
In an embodiment of the present application, as shown in fig. 3, the step S2 specifically includes:
step S21: and segmenting a plurality of mechanism names in the preprocessed mechanism data to be matched.
Specifically, a plurality of mechanism names possibly existing in a piece of preprocessed mechanism data to be matched are segmented according to marks, so that the extraction results are merged after entity labeling is carried out.
It should be noted that, because a plurality of organization names may exist in one piece of mechanism data to be matched, splitting the mechanism data into a plurality of entity labels in sequence can improve the accuracy of model labeling.
Step S22: and (4) utilizing a machine learning model to label the entity of each organization name, and storing the labeling results corresponding to all the organization entities into a MongoDB database.
The machine learning model for labeling the entity includes but is not limited to: stanza, space, NLTK, etc.
Preferably, the complete 03 extraction model which is provided by a machine learning model Stanza model more suitable for mechanism entities and is suitable for English corps is selected for entity labeling, and processing results such as labeling results of mechanisms, regions, people, numbers and the like and position information of each entity in a mechanism data character string are stored in a MongoDB database.
It should be noted that the Stanza model has slight differences between the result of the labeling of the part of the chinese organization name corpus and the organization entity, such as: the hospital institution containing the numbers cannot label the numbers as a part of the institution, and the college institution containing the school zone information cannot label the school zone addresses as a part of the institution, so that the labeling result is further optimized by adding a custom rule. In addition, the automatic identification of the language can be carried out on the organization data containing multiple languages through language identification models such as space and the like, and then different entity marking models are selected according to different languages, so that the marking accuracy of different languages is improved.
Step S23: and processing the labeling result of the machine learning model according to a custom rule so as to add the unidentified organization name to the organization entity and store the organization name in the MongoDB database.
As shown in fig. 4, step S23 further includes:
s231: extracting region entities through a machine learning model;
s232: expanding a regional entity containing an organization name into the organization entity;
s233: and respectively storing the optimized organization entity and the optimized region entity into a MongoDB database.
Briefly, the entity result labeled by the machine learning model (such as Stanza model) is further processed by using a self-defined rule, and whether the digital entity and the region entity are added into the mechanism entity is determined by judging the property of the separation content between the mechanism entity and the digital entity and the region entity in the entity labeling result, for example, if the separator is a space, a bracket, a conjunction, a preposition, and the like, the region and digital information are added into the mechanism entity; then, storing the organization entity and the region entity in the processed entity information, constructing an organization name self-defining rule, and expanding the part containing the organization name in the region entity into the organization entity; in addition, the processed results of the organization entity and the region entity are saved in a MongoDB database so as to be convenient for matching.
For example, the Stanza model can mark the mechanism entity "Center for distance prediction and Control" and the regional entity "regional protocol" for the cleaned mechanism data to be matched "Center for distance prediction and Control of regional protocol" by step S22, and the optimization mechanism entity can be realized as "Center for distance prediction and Control of regional protocol" and the regional entity as "regional protocol" by step S23.
Step S3: and configuring weights for each organization data in the target organization table.
In an embodiment of the present application, since the target mechanism table has data of names of secondary mechanisms, step S3 constructs a weight configuration file of names of mechanisms in the target mechanism table to increase the weight of the primary mechanism, so that the matching is preferentially performed with the primary mechanism during the matching, where the specific weight configuration method is as follows:
A. setting the initial weight as a constant N;
B. the weight of the data containing the secondary mechanism (such as: school of collegeof, deparatmentof, Facultyof, etc.) is reduced to N-w 1;
C. adding the weight of data (such as university, accessibility, center, institute, hospital and the like) containing a primary organization to N + w2(w1< w 2);
D. preferentially reducing the weight of the data containing both the secondary mechanism and the primary mechanism; alternatively, the weight is increased or decreased only once for data containing a plurality of primary and secondary mechanisms.
Step S4: and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
In an embodiment of the present application, as shown in fig. 5, the directly matching the preprocessed mechanism data to be matched with the target mechanism table includes:
step S411: constructing a dictionary data type by the target mechanism table according to the weight;
step S412: directly matching the preprocessed mechanism data to be matched and the extracted mechanism entity by the dictionary data type acquisition method according to the configured weights from big to small;
step S413: and adding corresponding organization identifications for the successfully matched organization entities.
Specifically, the target organization table is constructed into a dictionary data type by weight, the weight is used as a key, the organization information dictionary is used as a value, the organization name of the target organization table in the organization information dictionary is used as a key, and the organization unique identifier is used as a value. According to the sequence of the weights configured in the step S3 from large to small, the preprocessed mechanism data to be matched and the extracted mechanism entity are directly matched by the dictionary data type obtaining method, and a corresponding mechanism identifier is added to the mechanism that is successfully matched.
In an embodiment of the present application, as shown in fig. 6, the fuzzy matching of the preprocessed mechanism data to be matched and the target mechanism table includes:
step S421: importing the weights of the target mechanism table and the configuration into an Elasticissearch to perform fuzzy matching;
step S422: sequentially fusing multi-step and multi-source mechanism name similarity calculation on the preprocessed mechanism data to be matched by utilizing an elastic search in fuzzy matching;
step S423: obtaining the final matching degree score between the organization names by combining the user-defined weight;
step S424: and sequentially adopting the preprocessed mechanism data to be matched, the entity marking result and the partitioned result of the preprocessed mechanism data to be matched to perform fuzzy search on the target mechanism table, and selecting the mechanism name corresponding to the matching degree score meeting the corresponding threshold value as a final matching result.
In the present application, in addition to direct matching, fuzzy matching can be performed by importing the target mechanism table and the weight data into the Elasticsearch.
Specifically, the Elasticissearch is utilized to sequentially match the mechanism data to be matched with multiple steps and multiple sources, and the final similarity between the mechanism names is obtained by combining the self-defined weight. For example, the similarity is modeled and calculated as follows:
A. and querying by using fuzzy matching provided by the Elasticissearch, and setting the edit distance parameter fuzziness as AUTO:6, 100. I.e. no edits are allowed for text of length less than 6, for 1 edit of length between 6-100, run 2 edits for text of length greater than 100, this parameter configuration essentially limits only 1 edit to be possible since the institutional data will not have words of length greater than 100.
B. Adopting a similarity module provided by an elastic search, selecting a built-in TF/IDF model to realize similarity scoring of the mechanism data to be matched and a target mechanism table, and obtaining an initial similarity score s 1;
C. and (3) obtaining a similarity value by using the fuzzy matching and similarity module, and setting a boost _ mode parameter to sum by matching with a weight search method function _ score of the elastic search, namely adding the initial score s1 and the configuration weight W to obtain a final matching score.
Further, based on the similarity model and the calculation method selected above, multi-step matching is performed, that is, different texts to be matched are introduced, and the flow is as follows:
firstly, fuzzy matching is preferentially carried out on preprocessed mechanism data to be matched, a matching result with a final matching degree score higher than K1 which is equal to 30 is selected, and a corresponding mechanism identifier is added;
secondly, fuzzy matching is carried out on the entity extraction results of the unmatched successful data, the matching results with the final matching degree score higher than that of K2-25 (K2< K1) are selected, and corresponding mechanism identifications are added;
and finally, separating the preprocessed texts of the unmatched successful data according to punctuations, then carrying out fuzzy matching, selecting a matching result with a final matching degree score higher than that of K3-20 (K3< K2), and adding a corresponding mechanism identifier.
It should be noted that if a database such as MongoDB is used as the storage engine, the matching speed of the step can be obviously improved by using the index.
For example, the step S4 can implement matching of "Center for distance prediction and Control of guiding planning procedure" and "guiding planning procedure Center for distance Control and prediction" in the mechanism data to be matched.
To sum up, compared with the prior art, the multi-method integrated mechanism matching method formed by the application has the advantages that:
1) semi-automation of the organization name data preprocessing process and the entity labeling is realized, labor cost in the preprocessing and entity labeling process is greatly saved, and the self-defined preprocessing configuration file is input, so that the method can be suitable for processing requirements of different data sets;
2) the adaptability to the data of a non-standardized target mechanism table and a mechanism to be matched is strong, the two parties can have the same data format specification through data cleaning, mechanism name standardization and weight configuration, and the model marking effect and the matching accuracy can be improved;
3) constructing an Elasticissearch-based fuzzy search module, performing fuzzy matching on the mechanism texts which cannot be directly matched, and obtaining a final matching score by combining configuration weights;
4) the preprocessed text and the entity labeled result are comprehensively used as the input of the matching stage, so that the problems caused by labeling of a machine learning model are avoided, and higher matching accuracy is realized;
5) the matching of the mechanism data to be matched and the target mechanism table is realized, and the method can be used for solving the problems of mechanism entity alignment, homonymy student disambiguation and the like.
Fig. 7 is a block diagram of a multi-method integrated mechanism matching device according to an embodiment of the present invention. As shown, the apparatus 700 includes:
the preprocessing module 701 is used for constructing a data preprocessing configuration file, and cleaning and standardizing mechanism data to be matched and a target mechanism table;
a processing module 702, configured to perform entity tagging on the preprocessed mechanism data to be matched by using a machine learning model, and extract a mechanism entity and a region entity from an entity tagging result by combining a custom rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiment described in the present application, the technical effect brought by the contents is the same as the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
It should be further noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these units can be implemented entirely in software, invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module 702 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the system in the form of program code, and a processing element of the apparatus calls and executes the functions of the processing module 702. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a System-on-a-Chip (SoC).
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown, the computer device 800 includes: a memory 801, and a processor 802; the memory 801 is used for storing computer instructions; the processor 802 executes computer instructions to implement the method described in fig. 1.
In some embodiments, the number of the memories 801 in the computer device 800 may be one or more, the number of the processors 802 may be one or more, and fig. 8 is taken as an example.
In an embodiment of the present application, the processor 802 in the computer device 800 loads one or more instructions corresponding to the processes of the application program into the memory 801 according to the steps described in fig. 1, and the processor 802 executes the application program stored in the memory 801, thereby implementing the method described in fig. 1.
The Memory 801 may include a Random Access Memory (RAM), or may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 801 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.
The Processor 802 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In some specific applications, the various components of the computer device 800 are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of illustration the various buses have been referred to as a bus system in figure 8.
In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method described in fig. 1.
The present application may be embodied as systems, methods, and/or computer program products, in any combination of technical details. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present application.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable programs described herein may be downloaded from a computer-readable storage medium to a variety of computing/processing devices, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
In summary, the multi-method integrated mechanism matching method, device, equipment and storage medium provided by the application clean and standardize the mechanism data to be matched and the target mechanism table by constructing a data preprocessing configuration file; entity labeling is carried out on the preprocessed mechanism data to be matched by utilizing a machine learning model, and a mechanism entity and a region entity are extracted from an entity labeling result by combining a user-defined rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
The application effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the invention. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present application.

Claims (10)

1. A multi-method fused mechanism matching method, the method comprising:
constructing a data preprocessing configuration file, and cleaning and standardizing mechanism data to be matched and a target mechanism table;
entity labeling is carried out on the preprocessed mechanism data to be matched by utilizing a machine learning model, and a mechanism entity and a region entity are extracted from an entity labeling result by combining a user-defined rule;
configuring weights for each mechanism data in the target mechanism table;
and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
2. The method of claim 1, wherein the constructing a data preprocessing profile, and the cleaning and standardizing the mechanism data to be matched and the target mechanism table comprises:
constructing a matching mode and a processing rule of invalid characters, special characters, irrelevant information and symbol specifications to be used as a data preprocessing configuration file for input;
cleaning and standardizing the mechanism data to be matched and the target mechanism table by utilizing a plurality of tools and combining regular expressions based on the data preprocessing configuration file;
and storing the preprocessed mechanism data to be matched and the target mechanism table into a MongoDB database for entity marking and mechanism matching.
3. The method according to claim 1 or 2, wherein the cleaning and standardizing comprises: any one or more of unresolved HTML content conversion, unification of symbols, cleansing of invalid characters located in the middle, removal of interfering characters at the beginning and end, cleansing of extraneous information, unification of spelling format into title specification, compaction of multiple spaces into one, specification of control space format, unification of full-angle characters into half-angle, and standardization of organization name for organization.
4. The method of claim 1, wherein the entity labeling is performed on the preprocessed mechanism data to be matched by using a machine learning model, and the extracting of the mechanism entity and the region entity from the entity labeling result by combining with the custom rule comprises:
segmenting a plurality of mechanism names in preprocessed mechanism data to be matched;
utilizing a machine learning model to label the entity of each organization name, and storing the labeling results corresponding to all the organization entities into a MongoDB database;
and processing the labeling result of the machine learning model according to a custom rule so as to add the unidentified organization name to the organization entity and store the organization name in the MongoDB database.
5. The method of claim 4, further comprising:
extracting region entities through a machine learning model;
expanding a regional entity containing an organization name into the organization entity;
and respectively storing the optimized organization entity and the optimized region entity into a MongoDB database.
6. The method according to claim 1, wherein the directly matching the preprocessed mechanism data to be matched with the target mechanism table comprises:
constructing a dictionary data type by the target mechanism table according to the weight;
directly matching the preprocessed mechanism data to be matched and the extracted mechanism entity by the dictionary data type acquisition method according to the configured weights from big to small;
and adding corresponding organization identifications for the successfully matched organization entities.
7. The method of claim 1, wherein fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table comprises:
importing the weights of the target mechanism table and the configuration into an Elasticissearch to perform fuzzy matching;
sequentially fusing multi-step and multi-source mechanism name similarity calculation on the preprocessed mechanism data to be matched by utilizing an elastic search in fuzzy matching;
obtaining the final matching degree score between the organization names by combining the user-defined weight;
and sequentially adopting the preprocessed mechanism data to be matched, the entity marking result and the partitioned result of the preprocessed mechanism data to be matched to perform fuzzy search on the target mechanism table, and selecting the mechanism name corresponding to the matching degree score meeting the corresponding threshold value as a final matching result.
8. A multi-method fused mechanism matching device, the device comprising:
the preprocessing module is used for constructing a data preprocessing configuration file, and cleaning and standardizing the mechanism data to be matched and the target mechanism table;
the processing module is used for carrying out entity labeling on the preprocessed mechanism data to be matched by utilizing the machine learning model and extracting mechanism entities and region entities from the entity labeling result by combining with a custom rule; configuring weights for each mechanism data in the target mechanism table; and directly matching or fuzzy matching the preprocessed mechanism data to be matched with the target mechanism table based on the entity labeling result and the weight of the target mechanism table to obtain a matching result.
9. A computer device, the device comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions which, when executed, perform the method of any one of claims 1 to 7.
CN202111192516.7A 2021-10-13 2021-10-13 Multi-method fused mechanism matching method, device, equipment and storage medium Pending CN113868312A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111192516.7A CN113868312A (en) 2021-10-13 2021-10-13 Multi-method fused mechanism matching method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111192516.7A CN113868312A (en) 2021-10-13 2021-10-13 Multi-method fused mechanism matching method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113868312A true CN113868312A (en) 2021-12-31

Family

ID=78999111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111192516.7A Pending CN113868312A (en) 2021-10-13 2021-10-13 Multi-method fused mechanism matching method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113868312A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520611A (en) * 2024-01-05 2024-02-06 梅州客商银行股份有限公司 Customer name matching method for banking system comprising full angle and half angle

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520611A (en) * 2024-01-05 2024-02-06 梅州客商银行股份有限公司 Customer name matching method for banking system comprising full angle and half angle

Similar Documents

Publication Publication Date Title
US20230142217A1 (en) Model Training Method, Electronic Device, And Storage Medium
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
CN107766483A (en) The interactive answering method and system of a kind of knowledge based collection of illustrative plates
CN110457676B (en) Evaluation information extraction method and device, storage medium and computer equipment
CN109947921B (en) Intelligent question-answering system based on natural language processing
CN111598702A (en) Knowledge graph-based method for searching investment risk semantics
CN105608232B (en) A kind of bug knowledge modeling method based on graphic data base
CN108228701A (en) A kind of system for realizing Chinese near-nature forest language inquiry interface
CN110555205B (en) Negative semantic recognition method and device, electronic equipment and storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN112699645B (en) Corpus labeling method, apparatus and device
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN102214189A (en) Data mining-based word usage knowledge acquisition system and method
CN113157860A (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
Tapsai Information processing and retrieval from CSV file by natural language
CN116303537A (en) Data query method and device, electronic equipment and storage medium
Kalo et al. Knowlybert-hybrid query answering over language models and knowledge graphs
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN113868312A (en) Multi-method fused mechanism matching method, device, equipment and storage medium
CN109446277A (en) Relational data intelligent search method and system based on Chinese natural language
EP3432161A1 (en) Information processing system and information processing method
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN109062913B (en) Internationalization resource intelligent acquisition method and storage medium
CN110717025A (en) Question answering method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination