CN114625875B

CN114625875B - Pattern matching method, device, storage medium and equipment for multiple data source information

Info

Publication number: CN114625875B
Application number: CN202210233064.0A
Authority: CN
Inventors: 徐啸
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2024-03-29
Anticipated expiration: 2042-03-09
Also published as: CN114625875A

Abstract

The invention relates to the technical field of digital medical treatment, and provides a pattern matching method, a device, a storage medium and computer equipment for multi-data source information. The method comprises the following steps: acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information, and the second data set comprises a plurality of groups of second user attribute information; converting the plurality of sets of first user attribute information into a plurality of first feature vectors, and converting the plurality of sets of second user attribute information into a plurality of second feature vectors; calculating the feature similarity between each first feature vector and each second feature vector respectively, and constructing a feature similarity matrix; and determining a bipartite graph matching result with the highest sum of the total similarity in the feature similarity matrix, and obtaining mode matching results between multiple groups of first user attribute information and multiple groups of second user attribute information according to the bipartite graph matching result. The method can effectively simplify the pattern matching process and improve the accuracy of pattern matching.

Description

Pattern matching method, device, storage medium and equipment for multiple data source information

Technical Field

The present invention relates to the field of digital medical technology, and in particular, to a method, an apparatus, a storage medium, and a computer device for pattern matching of multiple data source information.

Background

With the development of medical big data and the improvement of people's health consciousness, the diagnosis and treatment and physical examination data of patients are very rapidly growing. Since the same patient may be diagnosed and treated in multiple hospitals, it is important to obtain all the information of the patient, integrate the information into a complete portrait, and accurately fuse information with different data sources and different formats. For example, information of the same attribute value may be stored in different hospitals under different names, and if it cannot be accurately matched, identifying which columns represent the same attribute data in fact may seriously affect the fusion of patient information, and even may generate staggered data.

Therefore, the design of the automatic model matching method which is accurate for different information sources has very important practical application value. The existing method generally adopts an iterative mode to carry out pattern matching on information of different data sources, and continuously finds a matching sub-attribute set and an entity set with similarity larger than a given threshold value, and continuously iterates two tasks in a mutual promotion mode. However, this method has high time complexity, and setting of the threshold is generally difficult, and an accurate and reliable threshold cannot be given. Therefore, the existing pattern matching method has low matching efficiency and accuracy.

Disclosure of Invention

In view of this, the present application provides a pattern matching method, device, storage medium and computer device for multiple data source information, and mainly aims to solve the technical problems of low efficiency and accuracy of pattern matching in the prior art.

According to a first aspect of the present invention, there is provided a pattern matching method of multiple data source information, the method comprising:

acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information acquired from a first data source, and the second data set comprises a plurality of groups of second user attribute information acquired from a second data source;

converting the plurality of sets of first user attribute information into a plurality of first feature vectors, and converting the plurality of sets of second user attribute information into a plurality of second feature vectors;

calculating the feature similarity between each first feature vector and each second feature vector respectively, and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector;

and determining a bipartite graph matching result with the highest sum of the total similarity in the feature similarity matrix, and obtaining mode matching results between multiple groups of first user attribute information and multiple groups of second user attribute information according to the bipartite graph matching result.

According to a second aspect of the present invention, there is provided a pattern matching apparatus for multiple data source information, the apparatus comprising:

the data acquisition module is used for acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information, and the second data set comprises a plurality of groups of second user attribute information;

the data conversion module is used for converting a plurality of groups of first user attribute information into a plurality of first feature vectors and converting a plurality of groups of second user attribute information into a plurality of second feature vectors;

the data processing module is used for respectively calculating the feature similarity between each first feature vector and each second feature vector and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector;

and the result output module is used for determining a bipartite graph matching result with the highest sum of the total similarity in the feature similarity matrix and obtaining a mode matching result between a plurality of groups of first user attribute information and a plurality of groups of second user attribute information according to the bipartite graph matching result.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the pattern matching method of multiple data source information described above.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the pattern matching method of multiple data sources information described above when executing the program.

The invention provides a pattern matching method, a device, a storage medium and computer equipment for multi-data source information, which are characterized in that firstly, two data sets to be matched are obtained, then, multiple sets of user attribute information in the two data sets are respectively converted into multiple feature vectors, further, feature similarity between each feature vector is calculated, a feature similarity matrix is constructed, finally, a bipartite graph matching result with the highest sum of total similarity is determined in the feature similarity matrix, and the bipartite graph matching result is converted into a pattern matching result between the two sets of user attribute information. According to the method, the user attribute information of the two data sets is converted into the feature similarity matrix between the two sets of feature vectors, and the matching result between the two sets of user attribute information is obtained according to the feature similarity matrix, so that a lengthy iteration process can be eliminated, the continuous switching and iteration process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the method can obtain a globally optimal pattern matching result by taking the bipartite graph matching result with the highest sum of the total similarity in the characteristic similarity matrix as the pattern matching result, so that the pattern matching relationship of different data source information has the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

fig. 1 is a schematic flow chart of a pattern matching method of multiple data source information according to an embodiment of the present invention;

fig. 2 shows a schematic structural diagram of a pattern matching device for multiple data source information according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

In one embodiment, as shown in fig. 1, a pattern matching method of multiple data source information is provided, and the method is applied to computer devices such as a server, for example, where the server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligent platforms. The method comprises the following steps:

101. a first data set and a second data set to be matched are obtained, wherein the first data set comprises a plurality of groups of first user attribute information acquired from a first data source, and the second data set comprises a plurality of groups of second user attribute information acquired from a second data source.

Wherein, the data source refers to the source of data, the multi-data source information refers to the information of a plurality of different data sources, each data source information in the multi-data source information forms a data set, the data set is used for storing data, the user attribute information is a specific information in the data set, and the user attribute information comprises one or more attribute values. In this embodiment, the first data set and the second data set may specifically be patient information tables of two different medical institutions (such as hospitals or physical examination centers).

Specifically, when pattern matching is required for information of two data sources, a first data set P and a second data set Q to be matched may be acquired. In this embodiment, the first data set P includes a plurality of entity information (which may be understood as a plurality of patient information) of the first data source, and as shown in table 1 below, the first data set P is described by taking three entity information as an example (the three entity information is information of Zhang three, lifour and wang five respectively), and the plurality of entity information forms four sets of first user attribute information in the first data set P, which are attribute values corresponding to gender, identity card, native and blood type respectively, wherein the attribute values corresponding to each entity are shown in table 1. Further, the second data set Q includes a plurality of entity information of the second data source, as shown in table 2 below, which is illustrated by taking the second data set Q including three entity information (the three entity information are information of Zhang san, han Meimei and Li Lei respectively), and the plurality of entity information further forms four sets of second user attribute information in the second data set Q, which are attribute values corresponding to the certificate number, the household location, the blood type and the gender respectively, where the attribute value corresponding to each entity is shown in table 2, and the first data set P and the second data set Q to be matched each include four sets of user attribute information.

Name of name	Sex (sex)	Identity card	Is all right through	Blood group type
					Zhang San	Man's body	211202***	A province X City	A
Li Si	Man's body	211221***	A province X City	O
					Wang Wu	Female	211202***	A province Y City	AB

TABLE 1

TABLE 2

102. A plurality of sets of first user attribute information are converted into a plurality of first feature vectors, and second user attribute information are converted into a plurality of second feature vectors.

The first feature vector refers to an integral feature expression corresponding to a group of first user attribute information (including attribute values of a plurality of entities), and the second feature vector refers to an integral feature expression corresponding to a group of second user attribute information (including attribute values of a plurality of entities). In this embodiment, for each set of first user attribute information in the first data set P and each set of second user attribute information in the second data set Q, one set of user attribute information is converted into a feature vector, i.e., each set of first user attribute information is expressed by one feature vector, and each set of second user attribute information is also expressed by one feature vector.

Specifically, the computer device may convert multiple sets of first user attribute information in the first data set P into multiple first feature vectors and multiple sets of second user attribute information into multiple second feature vectors by means of feature extraction and feature fusion. In this embodiment, for example in step 101, the first data set P to be matched may be represented as p= { P _i I=1, 2,3}, where three data examples are p, respectively ₁ = { Zhang san, male, 211202, a province X city, a }; p is p ₂ = { li four, men, 211221, a province X city, O }; p is p ₃ The attribute names of the four sets of user attribute information in the first dataset P may be expressed as r= { R _j I j = 1,2,3,4}, where r ₁ Sex, r ₂ =identity card, r ₃ Native place =run, r ₄ The blood group is =blood group, one set of first user attribute information is { man, woman }, its corresponding feature vector is the feature vector corresponding to all information in the set of first user attribute information, and the representation modes of other sets of first user attribute information are similar to the same, and are not listed here. Similarly, the second data set Q may be represented as q= { Q _i I=1, 2,3}, where three data examples are q, respectively ₁ = { Zhang san, 211202, type a, province X, men }; q ₂ = { Han Meimei, 211202, aProvince Z city, AB type, female }; q ₃ = { Li Lei, 211221, type B, male }, group a, province Y, city; the attribute names of the four sets of user attribute information within the second data set Q may be expressed as s= { S _j |j＝1,2,3,4}，s ₁ =document number, s ₂ Home location, s ₃ Blood group, s ₄ The expression of the other sets of second user attribute information is similar to the expression of the corresponding feature vector of the { a-type, AB-type, B-type }, and is not listed here.

103. And calculating the feature similarity between each first feature vector and each second feature vector respectively, and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector.

The feature similarity refers to the similarity degree between feature vectors of two groups of data, and the feature similarity matrix refers to a matrix established by taking the feature similarity as an element.

In this embodiment, the similarity between each set of multidimensional feature vectors in the first data set P and each set of multidimensional feature vectors in the second data set Q may be calculated by using an existing similarity calculation method, and an m×m similarity matrix may be constructed according to the calculated similarity between each set of feature vectors.

104. And determining a bipartite graph matching result with the highest sum of the total similarity in the feature similarity matrix, and obtaining mode matching results between multiple groups of first user attribute information and multiple groups of second user attribute information according to the bipartite graph matching result.

The bipartite graph matching algorithm refers to giving a bipartite graph G, and in a subgraph M of G, any two edges in an edge set { E } of M are not attached to the same vertex, and then M is called as a match.

In this embodiment, after obtaining a m×m similarity matrix through similarity calculation, a maximum bipartite graph matching M can be determined on the similarity matrix through bipartite graph matching algorithm, so that the total similarity of the determined matchesAnd is the highest, after obtaining the largest bipartite graph matching M, the largest bipartite graph matching M can be translated into the corresponding user attribute information matching relation, namely, the mode matching result of the first user attribute information in the first data set P and the second user attribute information of the second data set Q can be obtained, and for each pair of matching (r _i ,s _j ) E M, represents attribute r _i Sum s _j Representing the same attribute. In this embodiment, the pattern matching result between the plurality of sets of first user attribute information and the plurality of sets of second user attribute information may specifically be a correspondence relationship between each attribute of the patient information tables of two different medical institutions.

According to the pattern matching method for the multi-data source information, firstly, two data sets to be matched are obtained, then, multiple groups of user attribute information in the two data sets are respectively converted into multiple feature vectors, further, feature similarity between each feature vector is calculated, a feature similarity matrix is constructed, finally, a bipartite graph matching result with the highest sum of total similarity is determined in the feature similarity matrix, and the bipartite graph matching result is converted into a pattern matching result between the two groups of user attribute information. According to the method, the user attribute information of the two data sets is converted into the feature similarity matrix between the two sets of feature vectors, and the matching result between the two sets of user attribute information is obtained according to the feature similarity matrix, so that a lengthy iteration process can be eliminated, the continuous switching and iteration process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the method can obtain a globally optimal pattern matching result by taking the bipartite graph matching result with the highest sum of the total similarity in the characteristic similarity matrix as the pattern matching result, so that the pattern matching relationship of different data source information has the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

In one embodiment, each set of first user attribute information includes a plurality of first attribute values, e.g., one set of first user attribute information is { Man, woman }, and each set of second user attribute information includes a plurality of second attribute values, e.g., one set of second user attribute information is { A-type, AB-type, B-type }. Based on this, the above step 102 may be specifically implemented by the following method: and converting each first attribute value into a corresponding feature vector aiming at each group of first user attribute information, summing the feature vectors corresponding to each first attribute value to obtain first feature vectors corresponding to each group of first user attribute information, converting each second user attribute value into a corresponding feature vector aiming at each group of second user attribute information, and summing the feature vectors corresponding to each second attribute value to obtain second feature vectors corresponding to each group of second user attribute information. In the above embodiment, when the user attribute information in the data set is arranged according to the columns, each group of user attribute information may select attribute values located in the same column in the data set, further, by using the existing feature vector mapping method, each attribute value may be converted into a feature vector with a predetermined length, and then the feature vectors located in the same column in each data set are summed to obtain a multidimensional feature vector, and for each column of user attribute information in the first data set P and the second data set Q, a corresponding multidimensional feature vector may be used to represent the user attribute information.

In one embodiment, the method for constructing the feature similarity matrix in step 103 may be implemented as follows: and calculating the feature similarity between each first feature vector and each second feature vector through a feature similarity algorithm, wherein the feature similarity can be any one of cosine similarity, pearson correlation coefficient and Jacquard similarity coefficient, and then constructing a feature similarity matrix by taking the feature similarity between each first feature vector and each second feature vector as an element. In the above embodiment, the similarity calculation is to compare the similarity of the two feature vectors, which is generally achieved by calculating the distance between the feature vectors, and if the calculated distance is small, it is indicated that the similarity between the two feature vectors is large; if the calculated distance is large, which means that the similarity between the two feature vectors is small, the common feature similarity calculating method includes cosine similarity, and the method can accurately construct an m-by-m feature similarity matrix based on the feature similarity between each first feature vector and each second feature vector by calculating a cosine value between the two feature vectors, wherein the smaller the cosine value is, the more similar the two feature vectors are, and the pearson correlation coefficient and the Jack-by-means of other algorithms can be adopted in the embodiment.

In one embodiment, the method for obtaining the pattern matching result between the plurality of sets of first user attribute information and the plurality of sets of second user attribute information in step 104 may be implemented as follows: and searching a matching result with the highest sum of total similarity in the feature similarity matrix through a Hungary algorithm to serve as a bipartite graph matching result, and then converting the bipartite graph matching result to obtain pattern matching results between multiple groups of first user attribute information and multiple groups of second user attribute information. In this embodiment, the hungarian algorithm is a combined optimization algorithm for solving the task allocation problem in polynomial time, and the method is capable of finding an extended path of the original match by finding the maximum match and continuously finding the extended path of the original match, further finding a larger new match, finding an edge more than the original match, using the hungarian algorithm to quickly find the optimal match result, and converting the obtained optimal match result, further obtaining the pattern match result between multiple sets of user attribute information in two sets of data, thereby avoiding the continuous switching and iteration process between entity match and attribute match in the prior art.

In one embodiment, after calculating the feature similarity between each first feature vector and each second feature vector in step 103, the method may further include the steps of: if the feature similarity between any one of the first feature vectors and each of the second feature vectors is smaller than the first similarity threshold, deleting the first feature vectors, and if the feature similarity between any one of the second feature vectors and each of the first feature vectors is smaller than the first similarity threshold, deleting the second feature vectors. In this embodiment, when the number of user attribute information sets in two data sets is unequal, after calculating the similarity between every two feature vectors in the two data sets, each feature similarity is sequentially determined, and if the similarity between one feature vector in one data set and all feature vectors in the other data set is lower than the first similarity threshold, the feature vector can be deleted to reduce the feature vectors in the data set which do not have a matching relationship, thereby reducing the calculation amount and improving the accuracy of pattern matching.

In one embodiment, after determining a bipartite graph matching result with the highest sum of total similarity in step 104, the method may further include the following steps: judging whether the similarity value between the first feature vector and the second feature vector with the matching relationship in the bipartite graph matching result is smaller than a second similarity threshold value, and deleting the matching relationship between the first feature vector and the second feature vector if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than the second similarity threshold value. In this embodiment, all the matching relationships in the bipartite graph matching M can be sequentially determined according to the set second similarity threshold, when the similarity between two feature vectors with the matching relationships in the bipartite graph matching M is lower than the second similarity threshold, the matching relationship can be deleted, and only the matching relationship with the user attribute information with high similarity is reserved.

In one embodiment, the following steps may also be included after step 104: and combining the first data set and the second data set according to the pattern matching result between the plurality of groups of first user attribute information and the plurality of groups of second user attribute information to obtain a fusion data set of multiple data sources. In this embodiment, after the matching result of the user attribute information in the two data sets is obtained, the data in the two data sets may be fused through the matching relationship between the user attribute information in the two data sets, so as to obtain the fused data set of the two data sources, and further complete data of the same data instance in different data sets may be obtained, so that the whole portrait of the data instance in all the data sets is complete. The data sets are fused through the pattern matching result obtained by the method, so that the obtained complete data of each data instance is more accurate, and the whole portrait is more accurate, for example, through the fusion of the data in the table 1 and the data in the table 2, other data in the third item can be accurately obtained under the condition that the data set in the table 1 and the data set in the table 2 are different in header.

Further, as a specific implementation of the method shown in fig. 1, the embodiment provides a pattern matching device for multiple data source information, as shown in fig. 2, where the device includes: a data acquisition module 21, a data conversion module 22, a data processing module 23 and a result output module 24.

A data acquisition module 21, configured to acquire a first data set and a second data set to be matched, where the first data set includes a plurality of sets of first user attribute information acquired from a first data source, and the second data set includes a plurality of sets of second user attribute information acquired from a second data source;

the data conversion module 22 is configured to convert a plurality of sets of first user attribute information into a plurality of first feature vectors, and convert a plurality of sets of second user attribute information into a plurality of second feature vectors;

the data processing module 23 is configured to calculate feature similarities between each first feature vector and each second feature vector, and construct a feature similarity matrix according to the feature similarities between each first feature vector and each second feature vector;

the result output module 24 is configured to determine a bipartite graph matching result with the highest sum of total similarities in the feature similarity matrix, and obtain a pattern matching result between the plurality of sets of first user attribute information and the plurality of sets of second user attribute information according to the bipartite graph matching result.

In a specific application scenario, each set of first user attribute information includes a plurality of first attribute values, each set of second user attribute information includes a plurality of second attribute values, and the data conversion module 22 is specifically configured to convert each first attribute value into a corresponding feature vector for each set of first user attribute information, and sum the feature vectors corresponding to each first attribute value to obtain a first feature vector corresponding to each set of first user attribute information.

In a specific application scenario, the data processing module 23 is specifically configured to calculate, by using a feature similarity algorithm, feature similarity between each first feature vector and each second feature vector, where the feature similarity is any one of cosine similarity, pearson correlation coefficient and jaccard similarity coefficient; and taking the feature similarity between each first feature vector and each second feature vector as an element to construct a feature similarity matrix.

In a specific application scenario, the result output module 24 is specifically configured to find, through a hungarian algorithm, a matching result with the highest sum of total similarities in the feature similarity matrix as a bipartite graph matching result; and converting the bipartite graph matching result to obtain mode matching results between multiple groups of first user attribute information and multiple groups of second user attribute information.

In a specific application scenario, the apparatus further includes a data deleting module 25, where the data deleting module 25 is specifically configured to delete the first feature vector if feature similarity between any one of the first feature vector and each of the second feature vectors is smaller than a first similarity threshold; and if the feature similarity between any one of the second feature vectors and each of the first feature vectors is smaller than the first similarity threshold, deleting the second feature vectors.

In a specific application scenario, the device further comprises a data comparison module 26, wherein the data comparison module 26 is specifically configured to determine whether a similarity value between a first feature vector and a second feature vector with a matching relationship in a bipartite graph matching result is smaller than a second similarity threshold; and if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than the second similarity threshold value, deleting the matching relationship between the first feature vector and the second feature vector.

In a specific application scenario, the apparatus further includes a data integration module 27, where the data integration module 27 is specifically configured to combine the first data set and the second data set according to a pattern matching result between multiple sets of first user attribute information and multiple sets of second user attribute information, so as to obtain a multiple-data-source fusion data set.

It should be noted that, in the present embodiment, other corresponding descriptions of each functional unit related to the pattern matching device for multiple data source information may refer to corresponding descriptions in fig. 1, which are not described herein again.

Based on the method shown in fig. 1, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the pattern matching method of the multiple data source information shown in fig. 1.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, where the software product to be identified may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disc, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method for implementing the multi-data source information mode of each scenario of the present application.

Based on the method shown in fig. 1 and the embodiment of the pattern matching device for multiple data source information shown in fig. 2, in order to achieve the above objective, the embodiment further provides a entity device for pattern matching of multiple data source information, which may specifically be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the entity device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the method as described above and shown in fig. 1.

Optionally, the physical device may further include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the structure of the entity device for pattern matching of multiple data source information provided in this embodiment is not limited to the entity device, and may include more or fewer components, or may combine certain components, or may be a different arrangement of components.

The storage medium may also include an operating system, a network communication module. The operating system is a program for managing the entity equipment hardware and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the technical scheme, two data sets to be matched are firstly obtained, then multiple sets of user attribute information in the two data sets are respectively converted into multiple feature vectors, further feature similarity between each feature vector is calculated, a feature similarity matrix is constructed, finally a bipartite graph matching result with the highest sum of total similarity is determined in the feature similarity matrix, and the bipartite graph matching result is converted into a pattern matching result between the two sets of user attribute information. According to the method, the user attribute information of the two data sets is converted into the feature similarity matrix between the two sets of feature vectors, and the matching result between the two sets of user attribute information is obtained according to the feature similarity matrix, so that a lengthy iteration process can be eliminated, the continuous switching and iteration process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the method can obtain a globally optimal pattern matching result by taking the bipartite graph matching result with the highest sum of the total similarity in the characteristic similarity matrix as the pattern matching result, so that the pattern matching relationship of different data source information has the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. A pattern matching method for multiple data source information, the method comprising:

acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information acquired from a first data source, the second data set comprises a plurality of groups of second user attribute information acquired from a second data source, the first data set and the second data set are patient information tables of two different diagnosis and treatment institutions, each group of first user attribute information comprises a plurality of first attribute values, and each group of second user attribute information comprises a plurality of second attribute values;

converting the multiple groups of first user attribute information into multiple first feature vectors, and converting the multiple groups of second user attribute information into multiple second feature vectors, wherein each first attribute value is converted into a corresponding feature vector for each group of first user attribute information, and the feature vectors corresponding to each first attribute value are summed to obtain a first feature vector corresponding to each group of first user attribute information; converting each second attribute value into a corresponding feature vector aiming at each group of second user attribute information, and summing the feature vectors corresponding to each second attribute value to obtain a second feature vector corresponding to each group of second user attribute information;

calculating the feature similarity between each first feature vector and each second feature vector, and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector; if the feature similarity between any one of the first feature vectors and each of the second feature vectors is smaller than a first similarity threshold, deleting the first feature vectors; if the feature similarity between any one of the second feature vectors and each of the first feature vectors is smaller than the first similarity threshold, deleting the second feature vectors;

determining a bipartite graph matching result with the highest sum of total similarity in the feature similarity matrix, and obtaining a pattern matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result; judging whether the similarity value between the first feature vector and the second feature vector with the matching relation in the bipartite graph matching result is smaller than a second similarity threshold value or not; if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than a second similarity threshold value, deleting the matching relationship between the first feature vector and the second feature vector; wherein for each pair of matches in the pattern matching resultRepresenting attribute->And->Representing the same attribute.

2. The method of claim 1, wherein the calculating feature similarities between each of the first feature vectors and each of the second feature vectors, and constructing a feature similarity matrix based on feature similarities between each of the first feature vectors and each of the second feature vectors, comprises:

calculating feature similarity between each first feature vector and each second feature vector through a feature similarity algorithm, wherein the feature similarity is any one of cosine similarity, pearson correlation coefficient and Jacquard similarity coefficient;

and constructing the feature similarity matrix by taking the feature similarity between each first feature vector and each second feature vector as an element.

3. The method of claim 1, wherein determining a bipartite graph matching result with highest sum of total similarity in the feature similarity matrix, and obtaining a pattern matching result between the plurality of sets of first user attribute information and the plurality of sets of second user attribute information according to the bipartite graph matching result, comprises:

searching a matching result with the highest sum of total similarity in the feature similarity matrix as a bipartite graph matching result through a Hungary algorithm;

and converting the bipartite graph matching result to obtain a mode matching result between the plurality of groups of first user attribute information and the plurality of groups of second user attribute information.

4. The method according to claim 1, wherein the method further comprises:

and combining the first data set and the second data set according to the pattern matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information to obtain a multi-data-source fusion data set.

5. A pattern matching apparatus for multiple data source information, said apparatus comprising:

the data acquisition module is used for acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information acquired from a first data source, the second data set comprises a plurality of groups of second user attribute information acquired from a second data source, the first data set and the second data set are patient information tables of two different diagnosis and treatment institutions, each group of first user attribute information comprises a plurality of first attribute values, and each group of second user attribute information comprises a plurality of second attribute values;

the data conversion module is used for converting the multiple groups of first user attribute information into multiple first feature vectors, converting the multiple groups of second user attribute information into multiple second feature vectors, wherein each first attribute value is converted into a corresponding feature vector for each group of first user attribute information, and summing the feature vectors corresponding to each first attribute value to obtain a first feature vector corresponding to each group of first user attribute information; converting each second attribute value into a corresponding feature vector aiming at each group of second user attribute information, and summing the feature vectors corresponding to each second attribute value to obtain a second feature vector corresponding to each group of second user attribute information;

the data deleting module is used for deleting the first feature vector if the feature similarity between any one of the first feature vector and each of the second feature vectors is smaller than a first similarity threshold; if the feature similarity between any one of the second feature vectors and each of the first feature vectors is smaller than the first similarity threshold, deleting the second feature vectors;

a result output module for determining a feature similarity matrixA bipartite graph matching result with highest sum of total similarity, and obtaining a pattern matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result, wherein for each pair of matching results in the pattern matching resultRepresenting attribute->And->Representing the same attribute;

the data comparison module is used for judging whether the similarity value between the first feature vector and the second feature vector with the matching relationship in the bipartite graph matching result is smaller than a second similarity threshold value; and if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than a second similarity threshold value, deleting the matching relationship between the first feature vector and the second feature vector.

6. A storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of any of claims 1 to 4.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 4.