CN114625875A

CN114625875A - Pattern matching method, device, storage medium and equipment for multi-data source information

Info

Publication number: CN114625875A
Application number: CN202210233064.0A
Authority: CN
Inventors: 徐啸
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-14
Anticipated expiration: 2042-03-09
Also published as: CN114625875B

Abstract

The invention relates to the technical field of digital medical treatment, and provides a pattern matching method and device for multi-data-source information, a storage medium and computer equipment. The method comprises the following steps: acquiring a first data set and a second data set to be matched, wherein the first data set comprises multiple groups of first user attribute information, and the second data set comprises multiple groups of second user attribute information; converting the multiple groups of first user attribute information into multiple first characteristic vectors, and converting the multiple groups of second user attribute information into multiple second characteristic vectors; respectively calculating the feature similarity between each first feature vector and each second feature vector, and constructing a feature similarity matrix; and determining a bipartite graph matching result with the highest total similarity sum in the characteristic similarity matrix, and obtaining a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result. The method can effectively simplify the process of pattern matching and improve the accuracy of pattern matching.

Description

Pattern matching method, device, storage medium and equipment for multi-data source information

Technical Field

The invention relates to the technical field of digital medical treatment, in particular to a pattern matching method and device for multi-data-source information, a storage medium and computer equipment.

Background

With the development of medical big data and the improvement of health consciousness of people, diagnosis and treatment and physical examination data of patients are rapidly increased. Since the same patient may be diagnosed and treated in multiple hospitals, it is important to accurately merge information of different data sources and formats in order to obtain all information of the patient and integrate the information into a complete picture of the patient. For example, information of the same attribute value may be stored in different hospitals with different names, and if the information cannot be accurately matched, the identification of which columns represent actually the same attribute data will seriously affect the fusion of the patient information, and even generate staggered data.

Therefore, the method for automatically matching the modes accurately and accurately for different information sources has very important practical application value. The existing method usually adopts an iterative mode to carry out pattern matching on information of different data sources, continuously finds a matching sub-attribute set and an entity set with similarity greater than a given threshold value, and continuously iterates two tasks in a mutual promotion mode. However, this method has high time complexity, and the setting of the threshold is usually difficult, and an accurate and reliable threshold cannot be given. Therefore, the matching efficiency and accuracy of the existing pattern matching method are low.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a storage medium, and a computer device for pattern matching of multiple data sources, and mainly aims to solve the technical problem of low efficiency and accuracy of pattern matching in the prior art.

According to a first aspect of the present invention, there is provided a pattern matching method for multiple data sources, the method comprising:

acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information collected from a first data source, and the second data set comprises a plurality of groups of second user attribute information collected from a second data source;

converting the multiple groups of first user attribute information into multiple first characteristic vectors, and converting the multiple groups of second user attribute information into multiple second characteristic vectors;

respectively calculating the feature similarity between each first feature vector and each second feature vector, and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector;

and determining a bipartite graph matching result with the highest total similarity sum in the characteristic similarity matrix, and obtaining a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result.

According to a second aspect of the present invention, there is provided an apparatus for pattern matching of multiple data sources, the apparatus comprising:

the data acquisition module is used for acquiring a first data set and a second data set to be matched, wherein the first data set comprises multiple groups of first user attribute information, and the second data set comprises multiple groups of second user attribute information;

the data conversion module is used for converting the multiple groups of first user attribute information into a plurality of first characteristic vectors and converting the multiple groups of second user attribute information into a plurality of second characteristic vectors;

the data processing module is used for respectively calculating the feature similarity between each first feature vector and each second feature vector and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector;

and the result output module is used for determining a bipartite graph matching result with the highest total similarity sum in the characteristic similarity matrix and obtaining a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described pattern matching method for multiple data source information.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned pattern matching method for multiple data source information when executing the program.

The invention provides a pattern matching method, a device, a storage medium and computer equipment for multi-data source information. According to the method, the user attribute information of the two data sets is converted into the characteristic similarity matrix between the two groups of characteristic vectors, and the matching result between the two groups of user attribute information is obtained according to the characteristic similarity matrix, so that the lengthy iterative process can be eliminated, the continuous switching and iterative process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the bipartite graph matching result with the highest sum of the total similarity in the characteristic similarity matrix is used as the pattern matching result, so that a globally optimal pattern matching result can be obtained, the pattern matching relationship of different data source information has the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic flowchart illustrating a method for pattern matching of multiple data sources of information according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram illustrating a pattern matching apparatus for multiple data source information according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In an embodiment, as shown in fig. 1, a pattern matching method for multiple data source information is provided, which is described by taking the method applied to computer devices such as a server as an example, where the server may be an independent server, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The method comprises the following steps:

101. the method comprises the steps of obtaining a first data set and a second data set to be matched, wherein the first data set comprises multiple groups of first user attribute information collected from a first data source, and the second data set comprises multiple groups of second user attribute information collected from a second data source.

The data source refers to a data source, the multiple data source information refers to information of multiple different data sources, each data source information in the multiple data source information forms a data set, the data set is used for storing data, the user attribute information is specific information in the data set, and the user attribute information comprises one or more attribute values. In this embodiment, the first data set and the second data set may be specifically patient information tables of two different medical institutions (such as a hospital or a physical examination center).

Specifically, when the information of the two data sources needs to be pattern-matched, the first data set P and the second data set Q to be matched may be obtained. In this embodiment, the first data set P contains a plurality of entity information (which may be understood as a plurality of patient information) of the first data source, as shown in table 1 below, it is described by taking an example that the first data set P includes three entity information (the three entity information are information of zhang san, lie san and wang wu, respectively), and the plurality of entity information constitute four sets of first user attribute information in the first data set P, which are attribute values corresponding to gender, identity card, native place and blood type, respectively, where the attribute value corresponding to each entity is shown in table 1. Further, the second data set Q includes a plurality of entity information of the second data source, as shown in table 2 below, and it is described by taking the example that the second data set Q includes three entity information (the three entity information are information of zhang san, hamamelis, and lilei respectively), and the plurality of entity information further constitutes four sets of second user attribute information in the second data set Q, which are attribute values corresponding to a certificate number, a location of the home, a blood type, and a gender respectively, where the attribute value corresponding to each entity is shown in table 2, and the first data set P and the second data set Q to be matched each include four sets of user attribute information.

Name (I)	Sex	Identity card	Native place	Blood type
					Zhang San	For male	211202***	X city of A province	A
Li Si	For male	211221***	Province A from X city	O
					Wang Wu	Woman	211202***	City of A province and Y city	AB

TABLE 1

TABLE 2

102. And converting the multiple groups of first user attribute information into a plurality of first characteristic vectors, and converting the second user attribute information into a plurality of second characteristic vectors.

The first feature vector is an overall feature expression corresponding to a group of first user attribute information (including attribute values of a plurality of entities), and the second feature vector is an overall feature expression corresponding to a group of second user attribute information (including attribute values of a plurality of entities). In this embodiment, for each group of first user attribute information in the first data set P and each group of second user attribute information in the second data set Q, a group of user attribute information is converted into a feature vector, that is, each group of first user attribute information is expressed by a feature vector, and each group of second user attribute information is also expressed by a feature vector.

Specifically, the computer device may convert multiple sets of first user attribute information in the first data set P into multiple first feature vectors and convert multiple sets of second user attribute information into multiple second feature vectors in a manner of feature extraction and feature fusion. In the present embodiment, for example in step 101, the first data set P to be matched can be represented as P ═ P_i1,2,3}, where three data are realExamples are each p₁(xix, a, 211202 ═ X, a province, X city, a }; p is a radical of₂(viii) (lesfour, male, 211221 ×, X city, a province, O }; p is a radical of₃The attribute names of the four sets of user attribute information in the first data set P may be represented as R ═ R { R_j1,2,3,4, where r₁Gender ═ r₂R as an identity card₃Native place, r₄The first user attribute information of one group is { male, female }, the corresponding feature vector is the feature vector corresponding to all the information in the first user attribute information of the group, and the representation modes of the first user attribute information of other groups are similar to this, and are not listed one by one here. Similarly, the second data set Q may be represented as Q ═ Q_i1,2,3, wherein the three data instances are q₁(xxix) { zhang san, 211202 × ma, city X, province a, type a, male }; q. q of₂(jamaica, 211202 × Z, Z city, AB type, woman }; q. q.s₃(lrie, 211221 x, city Y, province a, type B, male }; the attribute names of the four sets of user attribute information within the second data set Q may be represented as S ═ S_j|j＝1,2,3,4}，s₁Certificate number, s₂Place of family register, s₃Blood group, s₄The second user attribute information of one group is { a type, AB type, B type }, the corresponding feature vector is the feature vector corresponding to all the information in the second user attribute information of the group, and the representation manners of the second user attribute information of other groups are similar to this, and are not listed here.

103. And respectively calculating the feature similarity between each first feature vector and each second feature vector, and constructing a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector.

The feature similarity refers to the similarity between feature vectors of two groups of data, and the feature similarity matrix refers to a matrix established by taking the feature similarity as an element.

In this embodiment, the similarity between each group of multidimensional feature vectors in the first data set P and each group of multidimensional feature vectors in the second data set Q may be calculated by an existing similarity calculation method, and an m × m similarity matrix is constructed according to the calculated similarity between each group of feature vectors.

104. And determining a bipartite graph matching result with the highest total similarity sum in the characteristic similarity matrix, and obtaining a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information according to the bipartite graph matching result.

The bipartite graph matching algorithm means that a bipartite graph G is given, and in a sub graph M of the G, any two edges in an edge set { E } of the M do not depend on the same vertex, so that the M is called matching.

In this embodiment, after obtaining an M × M similarity matrix through similarity calculation, a maximum bipartite graph matching M may be determined on the similarity matrix through a bipartite graph matching algorithm, so that the total similarity sum of the determined matches is the highest, and after obtaining the maximum bipartite graph matching M, the maximum bipartite graph matching M may be translated into a corresponding user attribute information matching relationship, that is, a pattern matching result of the first user attribute information in the first data set P and the second user attribute information in the second data set Q may be obtained, and each pair of matching (r) in the matching results is matched (r_i,s_j) E.g. M, representing the attribute r_iAnd s_jThe same attribute is represented. In this embodiment, the pattern matching result between the multiple sets of first user attribute information and the multiple sets of second user attribute information may be a corresponding relationship between attributes of patient information tables of two different medical institutions.

The pattern matching method for multiple data sources information provided by this embodiment includes obtaining two data sets to be matched, converting multiple sets of user attribute information in the two data sets into multiple feature vectors, calculating feature similarity between each feature vector, constructing a feature similarity matrix, determining a bipartite graph matching result with the highest total similarity sum in the feature similarity matrix, and converting the bipartite graph matching result into a pattern matching result between two sets of user attribute information. According to the method, the user attribute information of the two data sets is converted into the feature similarity matrix between the two groups of feature vectors, and the matching result between the two groups of user attribute information is obtained according to the feature similarity matrix, so that the tedious iterative process can be eliminated, the continuous switching and iterative process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the bipartite graph matching result with the highest sum of the total similarity in the characteristic similarity matrix is used as the pattern matching result, so that a globally optimal pattern matching result can be obtained, the pattern matching relationship of different data source information has the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

In one embodiment, each set of first user attribute information includes a plurality of first attribute values, e.g., a set of first user attribute information is { male, female }, and each set of second user attribute information includes a plurality of second attribute values, e.g., a set of second user attribute information is { A-type, AB-type, B-type }. Based on this, the step 102 can be specifically realized by the following method: and for each group of first user attribute information, converting each first attribute value into a corresponding feature vector, summing the feature vectors corresponding to each first attribute value to obtain a first feature vector corresponding to each group of first user attribute information, converting each second user attribute value into a corresponding feature vector for each group of second user attribute information, and summing the feature vectors corresponding to each second attribute value to obtain a second feature vector corresponding to each group of second user attribute information. In the above embodiment, when the user attribute information in the data set is arranged according to columns, each group of user attribute information may select attribute values located in the same column in the data set, further, each attribute value may be converted into a feature vector with a predetermined length by using an existing feature vector embedding representation method, and then the feature vectors located in the same column in each data set are summed to obtain a multi-dimensional feature vector, and the user attribute information in each column of the first data set P and the second data set Q may be represented by a corresponding multi-dimensional feature vector.

In one embodiment, the method for constructing the feature similarity matrix in step 103 may be implemented as follows: respectively calculating the feature similarity between each first feature vector and each second feature vector by a feature similarity calculation method, wherein the feature similarity can be any one of cosine similarity, Pearson correlation coefficient and Jacard similarity coefficient, and then constructing a feature similarity matrix degree by taking the feature similarity between each first feature vector and each second feature vector as an element. In the above embodiment, the similarity calculation is to compare the similarity between two feature vectors, and may generally be implemented by calculating the distance between the feature vectors, and if the calculated distance is small, it indicates that the similarity between the two feature vectors is large; if the calculated distance is large, it indicates that the similarity between the two feature vectors is small, a common feature similarity calculation method includes cosine similarity, the smaller the cosine value of the two feature vectors is, the more similar the two feature vectors are, in this embodiment, a pearson correlation coefficient and a jaccard similarity coefficient or other algorithms may also be used.

In an embodiment, the method for obtaining the pattern matching result between the plurality of sets of first user attribute information and the plurality of sets of second user attribute information in step 104 may be implemented as follows: and searching a matching result with the highest sum of the total similarity in the characteristic similarity matrix as a bipartite graph matching result through Hungarian algorithm, and then converting the bipartite graph matching result to obtain a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information. In the embodiment, the hungarian algorithm is a combined optimization algorithm for solving the task allocation problem within polynomial time, the method can find an original matched augmented path and further find a larger new match by finding the maximum match and continuously finding the original matched augmented path, the new match has one more edge than the original match, the hungarian algorithm can be used for quickly finding the optimal matching result, the obtained optimal matching result is converted, and further the pattern matching result between multiple groups of user attribute information in two data sets can be obtained, so that the process of continuously switching and iterating between entity matching and attribute matching in the prior art is omitted.

In one embodiment, after the step 103 of calculating the feature similarity between each first feature vector and each second feature vector respectively, the following steps may be further included: and if the feature similarity between any one first feature vector and each second feature vector is smaller than a first similarity threshold, deleting the first feature vectors, and if the feature similarity between any one second feature vector and each first feature vector is smaller than the first similarity threshold, deleting the second feature vectors. In this embodiment, when there are different numbers of sets of user attribute information in two data sets, after calculating the similarity between each two feature vectors in the two data sets, each feature similarity may be sequentially determined, and if there is a similarity between one feature vector in one data set and all feature vectors in the other data set that is lower than the first similarity threshold, the feature vector may be deleted to reduce feature vectors in the data sets that do not have a matching relationship, thereby reducing the amount of calculation and improving the accuracy of pattern matching.

In one embodiment, after determining a bipartite graph matching result with the highest sum of total similarities in step 104, the method may further include the following steps: and judging whether the similarity value between the first feature vector and the second feature vector with the matching relationship in the bipartite graph matching result is smaller than a second similarity threshold, and deleting the matching relationship between the first feature vector and the second feature vector if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than the second similarity threshold. In this embodiment, according to the set second similarity threshold, all matching relationships in the bipartite graph matching M may be sequentially determined, and when the similarity between two feature vectors having matching relationships in the bipartite graph matching M is lower than the second similarity threshold, the matching relationships may be deleted, and only the matching relationship of the user attribute information having high similarity is retained.

In one embodiment, step 104 may be followed by the steps of: and merging the first data set and the second data set according to the mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information to obtain a fusion data set of multiple data sources. In this embodiment, after a matching result of matching the user attribute information in the two data sets is obtained, the data of the two data sets may be fused through a matching relationship between the user attribute information of the two data sets to obtain a fused data set of the two data sources, and then complete data of the same data instance in different data sets may be obtained, so as to complete an overall portrait of the data instance in all data sets. The data sets are fused through the pattern matching result obtained by the method, so that the obtained complete data of each data instance is more accurate, and the overall portrait is more accurate, for example, through the fusion of the data in the table 1 and the table 2, under the condition that the headers of the data sets in the table 1 and the table 2 are different, other data of Zhang III can be accurately obtained.

Further, as a specific implementation of the method shown in fig. 1, this embodiment provides a device for pattern matching of multiple data sources of information, and as shown in fig. 2, the device includes: a data acquisition module 21, a data conversion module 22, a data processing module 23 and a result output module 24.

A data obtaining module 21, configured to obtain a first data set and a second data set to be matched, where the first data set includes multiple sets of first user attribute information collected from a first data source, and the second data set includes multiple sets of second user attribute information collected from a second data source;

the data conversion module 22 may be configured to convert multiple sets of first user attribute information into multiple first feature vectors, and convert multiple sets of second user attribute information into multiple second feature vectors;

the data processing module 23 is configured to calculate a feature similarity between each first feature vector and each second feature vector, and construct a feature similarity matrix according to the feature similarity between each first feature vector and each second feature vector;

the result output module 24 may be configured to determine a bipartite graph matching result with the highest total similarity sum in the feature similarity matrix, and obtain a pattern matching result between the multiple sets of first user attribute information and the multiple sets of second user attribute information according to the bipartite graph matching result.

In a specific application scenario, each group of first user attribute information includes a plurality of first attribute values, each group of second user attribute information includes a plurality of second attribute values, and the data conversion module 22 is specifically configured to, for each group of first user attribute information, convert each first attribute value into a corresponding feature vector, and sum the feature vectors corresponding to each first attribute value to obtain a first feature vector corresponding to each group of first user attribute information.

In a specific application scenario, the data processing module 23 may be specifically configured to calculate, by a feature similarity algorithm, a feature similarity between each first feature vector and each second feature vector, where the feature similarity is any one of a cosine similarity, a pearson correlation coefficient, and a jaccard similarity coefficient; and constructing a feature similarity matrix degree by taking the feature similarity between each first feature vector and each second feature vector as an element.

In a specific application scenario, the result output module 24 is specifically configured to find a matching result with the highest total similarity sum in the feature similarity matrix through the hungarian algorithm, and use the matching result as a bipartite graph matching result; and converting the bipartite graph matching result to obtain a pattern matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information.

In a specific application scenario, the apparatus further includes a data deleting module 25, where the data deleting module 25 is specifically configured to delete any one of the first feature vectors if the feature similarity between the first feature vector and each of the second feature vectors is smaller than a first similarity threshold; and if the feature similarity between any one second feature vector and each first feature vector is smaller than the first similarity threshold, deleting the second feature vectors.

In a specific application scenario, the apparatus further includes a data comparison module 26, where the data comparison module 26 is specifically configured to determine whether a similarity value between a first feature vector and a second feature vector having a matching relationship in a bipartite graph matching result is smaller than a second similarity threshold; and if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than the second similarity threshold value, deleting the matching relationship between the first feature vector and the second feature vector.

In a specific application scenario, the apparatus further includes a data integration module 27, where the data integration module 27 is specifically configured to perform merging processing on the first data set and the second data set according to a pattern matching result between the multiple sets of first user attribute information and the multiple sets of second user attribute information, so as to obtain a fused data set of multiple data sources.

It should be noted that other corresponding descriptions of the functional units related to the pattern matching apparatus for multiple data source information provided in this embodiment may refer to the corresponding descriptions in fig. 1, and are not described herein again.

Based on the method shown in fig. 1, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for pattern matching of multiple data sources of information shown in fig. 1.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, and the software product to be identified may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and include several instructions to enable a computer device (which may be a personal computer, a server, or a network device, or the like) to execute the method for implementing the scenario multiple data source information mode of the present application.

Based on the method shown in fig. 1 and the embodiment of the apparatus for pattern matching of multiple data sources information shown in fig. 2, to achieve the above object, the present embodiment further provides an entity device for pattern matching of multiple data sources information, which may specifically be a personal computer, a server, a smart phone, a tablet computer, a smart watch, or other network devices, and the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program for implementing the above-described method as shown in fig. 1.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

Those skilled in the art will appreciate that the structure of the multi-data source information pattern matching entity device provided in the present embodiment does not constitute a limitation to the entity device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program for managing the hardware of the above-mentioned entity device and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the application, two data sets to be matched are obtained firstly, then a plurality of groups of user attribute information in the two data sets are converted into a plurality of feature vectors respectively, feature similarity between each feature vector is calculated, a feature similarity matrix is constructed, finally a bipartite graph matching result with the highest total similarity sum is determined in the feature similarity matrix, and the bipartite graph matching result is converted into a pattern matching result between two groups of user attribute information. According to the method, the user attribute information of the two data sets is converted into the feature similarity matrix between the two groups of feature vectors, and the matching result between the two groups of user attribute information is obtained according to the feature similarity matrix, so that the tedious iterative process can be eliminated, the continuous switching and iterative process between entity matching and attribute matching is avoided, and the efficiency of pattern matching is effectively improved. In addition, the bipartite graph matching result with the highest sum of the total similarities in the characteristic similarity matrix is used as the pattern matching result, and a globally optimal pattern matching result can be obtained, so that the pattern matching relations of different data source information have the globally maximum similarity, and the accuracy of pattern matching is effectively improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial number is merely for description and does not represent the superiority and inferiority of the implementation scenario. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A method for pattern matching of multiple data source information, the method comprising:

converting the multiple groups of first user attribute information into multiple first feature vectors, and converting the multiple groups of second user attribute information into multiple second feature vectors;

2. The method of claim 1, wherein each set of the first user attribute information comprises a plurality of first attribute values, and each set of the second user attribute information comprises a plurality of second attribute values;

the converting the multiple sets of first user attribute information into multiple first feature vectors and the converting the multiple sets of second user attribute information into multiple second feature vectors includes:

for each group of first user attribute information, converting each first attribute value into a corresponding feature vector, and summing the feature vectors corresponding to each first attribute value to obtain a first feature vector corresponding to each group of first user attribute information;

and converting each second attribute value into a corresponding feature vector aiming at each group of second user attribute information, and summing the feature vectors corresponding to each second attribute value to obtain the second feature vectors corresponding to each group of second user attribute information.

3. The method according to claim 1, wherein the calculating the feature similarity between each of the first feature vectors and each of the second feature vectors, respectively, and the constructing the feature similarity matrix according to the feature similarity between each of the first feature vectors and each of the second feature vectors comprises:

respectively calculating the feature similarity between each first feature vector and each second feature vector by a feature similarity algorithm, wherein the feature similarity is any one of cosine similarity, Pearson correlation coefficient and Jacard similarity coefficient;

and constructing the feature similarity matrix degree by taking the feature similarity between each first feature vector and each second feature vector as an element.

4. The method according to claim 1, wherein the determining a bipartite graph matching result with a highest total similarity sum in the feature similarity matrix, and obtaining a pattern matching result between the plurality of first user attribute information sets and the plurality of second user attribute information sets according to the bipartite graph matching result comprises:

through Hungarian algorithm, finding a matching result with the highest sum of total similarity in the characteristic similarity matrix as a bipartite graph matching result;

and converting the bipartite graph matching result to obtain a mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information.

5. The method of claim 1, wherein after calculating the feature similarity between each of the first feature vectors and each of the second feature vectors, respectively, the method further comprises:

if the feature similarity between any one first feature vector and each second feature vector is smaller than a first similarity threshold, deleting the first feature vector;

and if the feature similarity between any one second feature vector and each first feature vector is smaller than the first similarity threshold, deleting the second feature vectors.

6. The method of claim 1, wherein after determining the bipartite graph matching result with the highest sum of total similarities, the method further comprises:

judging whether the similarity value between the first feature vector and the second feature vector with the matching relation in the bipartite graph matching result is smaller than a second similarity threshold value or not;

and if the similarity value between the first feature vector and the second feature vector with the matching relationship is smaller than a second similarity threshold value, deleting the matching relationship between the first feature vector and the second feature vector.

7. The method of claim 1, further comprising:

and merging the first data set and the second data set according to the mode matching result between the multiple groups of first user attribute information and the multiple groups of second user attribute information to obtain a fused data set of multiple data sources.

8. An apparatus for pattern matching of multiple data sources, the apparatus comprising:

the data acquisition module is used for acquiring a first data set and a second data set to be matched, wherein the first data set comprises a plurality of groups of first user attribute information collected from a first data source, and the second data set comprises a plurality of groups of second user attribute information collected from a second data source;

9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, realizing the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.