CN113269248B

CN113269248B - Method, device, equipment and storage medium for data standardization

Info

Publication number: CN113269248B
Application number: CN202110567575.1A
Authority: CN
Inventors: 唐蕊; 蒋雪涵; 孙行智
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2023-06-23
Anticipated expiration: 2041-05-24
Also published as: CN113269248A

Abstract

The application relates to the field of big data, and discloses a data standardization method, which comprises the following steps: acquiring a first data item and a second data item; converting the first data item into a corresponding first vector literally representing the first data item and a corresponding second vector literally representing the meaning, and converting the second data item into a corresponding third vector literally representing the second data item and a corresponding fourth vector literally representing the meaning; calculating the similarity between the first vector and the third vector and between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized, and obtaining a first similarity matrix which literally represents the corresponding data item and a second similarity matrix which represents the corresponding data item; and carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized, and carrying out standardized classification on data items in the data set to be standardized. And the data standardization integration efficiency is improved.

Description

Method, device, equipment and storage medium for data standardization

Technical Field

The present invention relates to the field of big data, and in particular, to a method, an apparatus, a device, and a storage medium for data standardization.

Background

Nowadays, with the rapid development of medical informatization, more and more medical data are stored in an electronic form. However, different medical systems have different data specifications, that is, different representation modes of data, and there are a plurality of different representation names of the same medical entity, and because different systems have different representation names of the medical entities, the problem is caused in data exchange between the systems. The main method at present is to arrange data manually, and associate medical entities with different names but the same meaning in different systems to the same standardized entity name to unify the data. However, this method requires a lot of manpower, and is difficult to implement comprehensive integration of data, time-consuming, laborious and inefficient.

Disclosure of Invention

The main purpose of the application is to provide a data standardization method, which aims to solve the technical problem that comprehensive and automatic standardization of medical data under different systems cannot be realized at present.

The application provides a data standardization method, which comprises the following steps:

acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized;

Converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals;

calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector;

according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;

weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;

and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

Preferably, the step of converting the first data item into a corresponding first vector expressed literally and a corresponding second vector expressed literally, respectively, includes:

Acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item;

constructing the first vector according to the n-gram representation mode corresponding to the first data item;

capturing a context relation corresponding to the first data item in the item category;

and constructing the second vector according to the context relation corresponding to the first data item.

Preferably, the step of constructing the second vector according to the context relation corresponding to the first data item includes:

all data items contained in the item category corresponding to the first data item are formed into a data item set;

constructing data item pairs in the data item set in a pairwise combination manner;

and inputting the data item pair into a first neural network to obtain the second vector.

Preferably, the n-gram includes 1-gram, 2-gram and 3-gram, and the step of constructing the first vector according to the n-gram representation corresponding to the first data item includes:

correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively;

Sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination;

and inputting the characteristic combination into a second neural network to obtain the first vector.

Preferably, before the step of converting the first data item into a 1-gram representation, a 2-gram representation and a 3-gram representation, respectively, the method comprises:

judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English;

if yes, deleting the preset character;

judging whether capital English characters exist in the first data item for deleting the preset characters or not;

if yes, modifying the uppercase English character into lowercase English character.

Preferably, the step of calculating the similarity between the first vector and the third vector includes:

substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:

wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension;

And taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.

Preferably, the step of performing standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized includes:

judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;

if yes, the third data item is used as a new item category;

the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized;

and combining the fourth data item and the specified data item into the same item category.

The application also provides a device for data standardization, which comprises:

the system comprises an acquisition module, a data storage module and a data storage module, wherein the acquisition module is used for acquiring a first data item and a second data item, and the first data item and the second data item are any two data items in a data set to be standardized;

The conversion module is used for respectively converting the first data item into a first vector corresponding to literal representation and a second vector corresponding to meaning representation, and respectively converting the second data item into a third vector corresponding to literal representation and a fourth vector corresponding to meaning representation;

a first calculation module, configured to calculate a similarity between the first vector and the third vector, and calculate a similarity between the second vector and the fourth vector;

the second calculation module is used for respectively calculating the similarity between every two data items in the data set to be standardized according to the calculation mode of the similarity between the first data item and the second data item to obtain a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;

the fusion module is used for carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;

and the classification module is used for carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described method.

According to the method, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names and the like corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and a data standardization standard of the medical entity is obtained, so that data standardization is achieved, convenience is provided for automatic standardization of data between different systems, manpower is liberated, and data standardization integration efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for data normalization according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a learning model structure represented by the meaning of an embodiment of the present application;

FIG. 3 is a schematic diagram of a literally-represented learning model structure according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an apparatus for data normalization according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, a method for data normalization according to an embodiment of the present application includes:

s1: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized;

s2: converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals;

s3: calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector;

S4: according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;

s5: weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;

s6: and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

In this embodiment of the present application, the data set to be standardized is a data set in a medical field, and because different hospitals and different doctors have different description modes for the same medical scene, the medical data cannot be standardized. The data items include, but are not limited to, medical entities such as disease names, drug names, examination names, or diagnostic conclusions. The first vector corresponding to the literal representation is obtained based on character combination features, such as a word segmentation mode based on n-gram, so as to obtain character combination features corresponding to the character strings of the data item, and the character combination features are converted into vectors through a deep neural network. The meaning represents a corresponding second vector, is a meaning characteristic based on the character string expression and obtained based on the context, and is obtained by inputting the meaning characteristic into another deep neural network. The two deep neural networks have the same structure, different training data and different obtained network parameters, have different functions, and realize literal vector conversion and meaning vector conversion. The similarity of the two vectors considers both literal and meaning representations to improve the accuracy of the data classification. And calculating the similarity between two data items according to the literal represented vectors in pairs to obtain a first similarity matrix which takes all the data items as row information and column information at the same time. Similarly, the similarity between two data items is calculated according to the vectors with meaning, and a second similarity matrix which takes all the data items as row information and column information simultaneously is obtained. The arrangement order of the data items in the row information and the column information in the two similarity matrixes is the same, the similarity matrixes which simultaneously consider the literal representation and the meaning representation are obtained through weighted fusion, and then the classification and combination of the data items are realized according to the preset classification threshold value in the similarity matrixes and the maximum similarity among the data items, so that the data standardization is realized.

Further, the step S2 of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item includes:

s21: acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item;

s22: constructing the first vector according to the n-gram representation mode corresponding to the first data item;

s23: capturing a context relation corresponding to the first data item in the item category;

s24: and constructing the second vector according to the context relation corresponding to the first data item.

The n-gram representation mode can be selected according to requirements, for example, under the premise of considering calculation amount and classification accuracy, 1-gram, 2-gram and 3-gram are preferably used simultaneously, the 1-gram, 2-gram and 3-gram are spliced into input features, and a deep neural network is input to obtain a first vector.

The context relation refers to the meaning association between the data items, if the meaning between the two data items is similar, the data items with stronger context relation are considered, and the data item pair representing the context relation is taken as the characteristic input deep neural network to obtain a second vector.

Further, the step S24 of constructing the second vector according to the context corresponding to the first data item includes:

s241: all data items contained in the item category corresponding to the first data item are formed into a data item set;

s242: constructing data item pairs in the data item set in a pairwise combination manner;

s243: and inputting the data item pair into a first neural network to obtain the second vector.

The item categories include classification criteria such as diagnosis, disease, or patient, for example, medical information having the same diagnosis may be considered as the same category, and medical information belonging to the same patient may be considered as the same category. For example, the data are as follows: { patient 1: diagnosis 1}: { class 1: { data item A, data item B }, category 2: { data item C, data item D }; { patient 2: diagnosis 1}: { class 1: { data item E, data item F }, category 2: { data item G, data item H }. The diagnostics for the two above patients are identical, both "diagnostics 1", and the data for the two patients are combined to take into account the contextual relationship between the data items within the respective categories under the same diagnostics. The data from the same diagnosis were combined to give: diagnosis 1: { class 1: { data item A, data item B, data item E, data item F }, category 2: { data item C, data item D, data item G, data item H }. Pairs of data items are then constructed (input features, output tags) taking into account the context of the data items for "diagnosis 1-category 1" and "diagnosis 1-category 2", i.e. the respective categories of the same diagnosis. The constructed data item pairs are as follows: in "diagnosis 1-Category 1": (data item A, data item B), (data item A, data item E), (data item B, data item F), etc., i.e., all data items in "diagnosis 1-Category 1" are combined two by two. In "diagnosis 1-class 2": (data item C, data item D), (data item C, data item G), (data item D, data item H), etc., i.e., all data items in "diagnosis 1-category 2" are combined two by two.

The first neural network is a deep learning network with three fully-connected hidden layers, as shown in fig. 2, and is trained in an unsupervised learning mode, namely a learning model of meaning representation. The input vector of the first neural network is an input feature, the output vector is an output label, i.e. the input and the output are different. The input features and output labels are represented as one-hot encoding vectors (one-hot encoding) where each dimension of the vector corresponds to a data item, a value of 1 for each dimension indicates that there is a data item, and a value of 0 indicates that there is no data item. Specifically, the training data of the first neural network constructs features and their corresponding labels in two ways, namely capturing the context represented by the meaning of the data. One of the two modes is as follows: for data items associated to the same diagnosis or illness, a context is built in each category by category division, i.e. a pair of data items is built, the two data items in the pair of data items forming training data as input features and output tags, respectively. If the drug names corresponding to different data items under the same diagnosis are different, the drug names related to all the data items are summarized and de-duplicated to obtain all the drug sets corresponding to the diagnosis, and then the data pairs consisting of every two drug names (input characteristics and output labels) in the drug sets are used as training data. Another way is: for data items associated to one visit of the same patient, the context is built in each category by category division, i.e. pairs of data items are constructed as training data. Each data item is expressed as one-hot encoding (one-hot encoding), and a neural network is trained through an unsupervised learning mode to obtain a learning model of meaning expression, so that the data item can be vectorized from the meaning expression to obtain a vector of meaning expression.

Further, the n-gram includes 1-gram, 2-gram and 3-gram, and the step S22 of constructing the first vector according to the n-gram representation corresponding to the first data item includes:

s221: correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively;

s222: sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination;

s223: and inputting the characteristic combination into a second neural network to obtain the first vector.

For example, "hepatitis b e antibody assay anti hbe" data item, corresponding 1-gram representation, 2-gram representation and 3-gram representation, respectively, "b, liver, e, antibody, body, assay, anti hbe", "hepatitis b, liver e, e-antibody, body assay, anti hbe" and "hepatitis b e, liver e antibody, e-antibody, body assay, assay anti hbe". The 1-gram representation, 2-gram representation and 3-gram representation are sequentially combined into an input feature vector, and the input feature vector is used as training data to train a second neural network, and as shown in fig. 3, a low-density vector output by the middle hidden layer is used as an output vector, and is used as a literal representation vector corresponding to a data item.

The second neural network is a deep learning network of an automatic encoder structure of a three-layer fully-connected hidden layer, and a literal representation learning model is obtained through training of training data. The second neural network constructs characteristics by three expression modes of 1-gram,2-gram and 3-gram from the aspect of literal expression of the data items, namely character string composition, and the data items can be vectorized from the literal expression through training the network in an unsupervised learning mode to obtain vectors of literal expression.

The input features and the output vectors of the second neural network are identical, correspond to the coding of the self-coder, convert the high-dimensional sparse input vector into a low-dimensional dense output vector through learning by automatic learning of the input features, correspond to the output vector of the middle hidden layer, and acquire the important features of the input high-dimensional sparse vector by the obtained low-dimensional dense output vector, and then restore the low-dimensional dense vector into the high-dimensional sparse vector to complete model training. After the network training is completed, feature vectors constructed for 1-gram,2-gram and 3-gram of a certain data item are input, and the network middle hidden layer outputs a low-dimensional compact vector as a literal representation of the data item.

Further, before step S221 of converting the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, the method includes:

s2201: judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English;

s2202: if yes, deleting the preset character;

s2203: judging whether capital English characters exist in the first data item for deleting the preset characters or not;

s2204: if yes, modifying the uppercase English character into lowercase English character.

In the embodiment of the application, in order to improve the accuracy of constructing vectors, other characters except Chinese characters and English in names are removed first, english is unified into a lowercase form, and then words are segmented. For example, for the "hepatitis b e antibody assay (Anti-HBe)" data item, first, the data processing is expressed as "hepatitis b e antibody assay Anti HBe", and then, continuous english segments in the name are used as one character for word segmentation, resulting in corresponding 1-gram, 2-gram, and 3-gram expressions.

Further, the step S3 of calculating the similarity between the first vector and the third vector includes:

S31: substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:

s32: and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.

In the embodiment of the present application, each data item is represented by two vectors, i.e., a literally represented vector and a literally represented vector. The similarity of the vectors is calculated according to the literal vector and the meaning vector between every two data items in a one-to-one correspondence mode, and the calculation mode is as described above, so that the literal similarity matrix and the meaning similarity matrix between all data items are obtained. And then carrying out weighted fusion on the two similarity matrixes to obtain a final comprehensive similarity matrix. The weight proportion of the two similarity matrixes can be adjusted according to the data classification task requirement during the weighted fusion, but the final comprehensive similarity matrix can be obtained by keeping the weight sum of the two similarity matrixes equal to 1. The preferred weight ratio of the two is 0.5:0.5.

Further, the step S6 of performing standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized includes:

s61: judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;

s62: if yes, the third data item is used as a new item category;

s63: the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized;

s64: and combining the fourth data item and the specified data item into the same item category.

For example, if the predetermined threshold is 0.5, the data item with the similarity lower than 0.5 in the integrated similarity matrix is removed, and the removed data item indicates that the data item has a low similarity with other data items, so the data item should be used as a standard data item alone. And then, selecting the item with the highest similarity from the reserved data items, classifying and integrating the items, and aggregating the data items to finish data standardization.

Table 1

	Data item 1	Data item 2	Data item 3	Data item 4	Data item 5	Data item 6
							Data item 1	1	0.9	0.8	0.7	0.6	0.2
Data item 2	0.9	1	0.6	0.8	0.8	0.1
							Data item 3	0.8	0.6	1	0.6	0.7	0.2
Data item 4	0.7	0.8	0.6	1	0.9	0.3
							Data item 5	0.6	0.8	0.7	0.9	1	0.4
Data item 6	0.2	0.1	0.2	0.3	0.4	1

For example, in table 1, the similarity between the data item 6 and other data items is lower than the threshold value 0.5, i.e., the data item 6 and other data items do not have similarity, so the data item 6 alone is 1 data item and cannot be combined with other data items. And selecting the data item with the highest similarity from the rest 5 data items as the similar item, namely selecting the data item corresponding to the maximum value in the row for classification by row units. The

data items

1, 2 and 3 are aggregated together, the

data items

4 and 5 are aggregated together, the data items 6 which are lower than a preset threshold value are independently used as a cluster, and finally three standardized data items are obtained, so that the standardization of the data items is completed. Data item merge 1: data item 1, data item 2, data item 3; data item merge 2: a data item 4, a data item 5; data item merge 3: a data item 6.

Referring to fig. 4, an apparatus for data normalization according to an embodiment of the present application includes:

an acquisition module 1, configured to acquire a first data item and a second data item, where the first data item and the second data item are any two data items in a data set to be standardized;

a conversion module 2, configured to convert the first data items into a first vector corresponding to a literal representation and a second vector corresponding to a meaning representation, and convert the second data items into a third vector corresponding to a literal representation and a fourth vector corresponding to a meaning representation;

a first calculating module 3, configured to calculate a similarity between the first vector and the third vector, and calculate a similarity between the second vector and the fourth vector;

the second calculating module 4 is configured to calculate, according to a calculation manner of the similarity between the first data item and the second data item, the similarity between two data items in the data set to be standardized, so as to obtain a first similarity matrix literally represented by the data set to be standardized, and a second similarity matrix represented by the meaning of the data set to be standardized;

The fusion module 5 is used for carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;

and the classification module 6 is used for carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

The relevant explanation of the embodiments of the present application applies to the explanation of the corresponding method portions, and is not repeated.

Further, the conversion module 2 includes:

the first acquisition unit is used for acquiring the n-gram representation mode corresponding to the first data item and the item category corresponding to the first data item;

the first construction unit is used for constructing the first vector according to the n-gram representation mode corresponding to the first data item;

a capturing unit, configured to capture, in the item category, a context relationship corresponding to the first data item;

and the second construction unit is used for constructing the second vector according to the context relation corresponding to the first data item.

Further, the second construction unit includes:

a composition subunit, configured to compose all the data items included in the item category corresponding to the first data item into a data item set;

A construction subunit, configured to construct pairs of data items in a pairwise combination manner in the data item set;

and the first input subunit is used for inputting the data item pair into a first neural network to obtain the second vector.

Further, the n-gram includes 1-gram, 2-gram, and 3-gram, and the first construction unit includes:

a conversion subunit, configured to correspondingly convert the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, respectively;

a combination subunit, configured to sequentially combine the 1-gram representation, the 2-gram representation, and the 3-gram representation into a feature combination;

and the second input subunit is used for inputting the characteristic combination into a second neural network to obtain the first vector.

Further, the first construction unit includes:

a first judging subunit, configured to judge whether a preset character exists in the first data item, where the preset character is a character other than chinese and english;

a deleting subunit, configured to delete a preset character if the preset character exists;

a second judging subunit, configured to judge whether a capitalized english character exists in the first data item for deleting the preset character;

And the modification subunit is used for modifying the uppercase English character into lowercase English character if the uppercase English character exists.

Further, the first computing module 3 includes:

a substituting unit, configured to substitute the first vector and the third vector into a specified calculation formula, where the specified calculation formula is:

and a first unit configured to use an output value of the specified calculation formula as a similarity between the first vector and the third vector.

Further, the classification module 6 includes:

the judging unit is used for judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;

the second unit is used for taking the third data item as a new item category if yes;

A second obtaining unit, configured to obtain a specified data item with a maximum similarity to a fourth data item in a similarity matrix of the data set to be standardized, where the fourth data item and the specified data item are any data item except a third data item of the data set to be standardized;

and the merging unit is used for merging the fourth data item and the appointed data item into the same item category.

Referring to fig. 5, a computer device is further provided in an embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required by the data normalization process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data normalization.

The method for the processor to perform the data normalization comprises the following steps: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized; converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals; calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized; weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized; and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

According to the computer equipment, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and data standardization standards of the medical entity are obtained, so that data standardization is achieved, convenience is provided for automatic data standardization between different systems, manpower is liberated, and data standardization integration efficiency is improved.

In one embodiment, the step of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item includes: acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item; constructing the first vector according to the n-gram representation mode corresponding to the first data item; capturing a context relation corresponding to the first data item in the item category; and constructing the second vector according to the context relation corresponding to the first data item.

In one embodiment, the step of constructing the second vector by the processor according to the context corresponding to the first data item includes: all data items contained in the item category corresponding to the first data item are formed into a data item set; constructing data item pairs in the data item set in a pairwise combination manner; and inputting the data item pair into a first neural network to obtain the second vector.

In one embodiment, the n-gram includes a 1-gram, a 2-gram, and a 3-gram, and the processor constructs the first vector according to the n-gram representation corresponding to the first data item, including: correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively; sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination; and inputting the characteristic combination into a second neural network to obtain the first vector.

In one embodiment, before the step of converting the first data item into the 1-gram representation, the 2-gram representation, and the 3-gram representation, respectively, the processor includes: judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English; if yes, deleting the preset character; judging whether capital English characters exist in the first data item for deleting the preset characters or not; if yes, modifying the uppercase English character into lowercase English character.

In one embodiment, the step of calculating the similarity between the first vector and the third vector includes: substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:

wherein similarity represents similarity, A represents the first vector, and B representsShowing the third vector, the first vector and the third vector having the same vector dimension, n representing a vector dimension, i representing an ith vector dimension; and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.

In one embodiment, the step of performing standardized classification on the data items in the data set to be standardized by the processor according to the similarity matrix of the data set to be standardized includes: judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value; if yes, the third data item is used as a new item category; the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized; and combining the fourth data item and the specified data item into the same item category.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

An embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for data normalization, comprising: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized; converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals; calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized; weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized; and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.

According to the computer readable storage medium, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and data standardization standards of the medical entity are obtained, so that data standardization is achieved, convenience is provided for automatic data standardization between different systems, manpower is liberated, and data standardization integration efficiency is improved.

In one embodiment, the step of calculating the similarity between the first vector and the third vector includes: substituting the first vector and the third vector into a specified calculation formula,

wherein, the specified calculation formula is:

wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension; and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A method of data normalization, comprising:

according to the similarity matrix of the data set to be standardized, carrying out standardized classification on data items in the data set to be standardized;

the step of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item, respectively, includes:

constructing the second vector according to the context relation corresponding to the first data item;

the step of constructing the second vector according to the context relation corresponding to the first data item includes:

inputting the data item pair into a first neural network to obtain the second vector;

the n-gram comprises a 1-gram, a 2-gram and a 3-gram, and the step of constructing the first vector according to the n-gram representation corresponding to the first data item comprises the following steps:

2. The method of data normalization according to claim 1, wherein prior to the step of converting the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, respectively, comprising:

if yes, deleting the preset character;

3. The method of data normalization according to claim 1, characterized in that the step of calculating the similarity between the first vector and the third vector comprises:

4. The method of claim 1, wherein the step of normalized classifying data items in the data set to be normalized according to a similarity matrix of the data set to be normalized comprises:

if yes, the third data item is used as a new item category;

5. An apparatus for normalizing data, comprising:

The classification module is used for carrying out standardized classification on data items in the data set to be standardized according to the similarity matrix of the data set to be standardized;

further, the conversion module includes:

a second construction unit, configured to construct the second vector according to a context relationship corresponding to the first data item;

a second construction unit comprising:

a first input subunit, configured to input the pair of data items into a first neural network, and obtain the second vector;

The n-gram includes 1-gram, 2-gram, and 3-gram, and the first construction unit includes:

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.