CN113269248B - Method, device, equipment and storage medium for data standardization - Google Patents

Method, device, equipment and storage medium for data standardization Download PDF

Info

Publication number
CN113269248B
CN113269248B CN202110567575.1A CN202110567575A CN113269248B CN 113269248 B CN113269248 B CN 113269248B CN 202110567575 A CN202110567575 A CN 202110567575A CN 113269248 B CN113269248 B CN 113269248B
Authority
CN
China
Prior art keywords
data
data item
vector
item
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110567575.1A
Other languages
Chinese (zh)
Other versions
CN113269248A (en
Inventor
唐蕊
蒋雪涵
孙行智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110567575.1A priority Critical patent/CN113269248B/en
Publication of CN113269248A publication Critical patent/CN113269248A/en
Application granted granted Critical
Publication of CN113269248B publication Critical patent/CN113269248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of big data, and discloses a data standardization method, which comprises the following steps: acquiring a first data item and a second data item; converting the first data item into a corresponding first vector literally representing the first data item and a corresponding second vector literally representing the meaning, and converting the second data item into a corresponding third vector literally representing the second data item and a corresponding fourth vector literally representing the meaning; calculating the similarity between the first vector and the third vector and between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized, and obtaining a first similarity matrix which literally represents the corresponding data item and a second similarity matrix which represents the corresponding data item; and carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized, and carrying out standardized classification on data items in the data set to be standardized. And the data standardization integration efficiency is improved.

Description

Method, device, equipment and storage medium for data standardization
Technical Field
The present invention relates to the field of big data, and in particular, to a method, an apparatus, a device, and a storage medium for data standardization.
Background
Nowadays, with the rapid development of medical informatization, more and more medical data are stored in an electronic form. However, different medical systems have different data specifications, that is, different representation modes of data, and there are a plurality of different representation names of the same medical entity, and because different systems have different representation names of the medical entities, the problem is caused in data exchange between the systems. The main method at present is to arrange data manually, and associate medical entities with different names but the same meaning in different systems to the same standardized entity name to unify the data. However, this method requires a lot of manpower, and is difficult to implement comprehensive integration of data, time-consuming, laborious and inefficient.
Disclosure of Invention
The main purpose of the application is to provide a data standardization method, which aims to solve the technical problem that comprehensive and automatic standardization of medical data under different systems cannot be realized at present.
The application provides a data standardization method, which comprises the following steps:
acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized;
Converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals;
calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector;
according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;
weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
Preferably, the step of converting the first data item into a corresponding first vector expressed literally and a corresponding second vector expressed literally, respectively, includes:
Acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item;
constructing the first vector according to the n-gram representation mode corresponding to the first data item;
capturing a context relation corresponding to the first data item in the item category;
and constructing the second vector according to the context relation corresponding to the first data item.
Preferably, the step of constructing the second vector according to the context relation corresponding to the first data item includes:
all data items contained in the item category corresponding to the first data item are formed into a data item set;
constructing data item pairs in the data item set in a pairwise combination manner;
and inputting the data item pair into a first neural network to obtain the second vector.
Preferably, the n-gram includes 1-gram, 2-gram and 3-gram, and the step of constructing the first vector according to the n-gram representation corresponding to the first data item includes:
correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively;
Sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination;
and inputting the characteristic combination into a second neural network to obtain the first vector.
Preferably, before the step of converting the first data item into a 1-gram representation, a 2-gram representation and a 3-gram representation, respectively, the method comprises:
judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English;
if yes, deleting the preset character;
judging whether capital English characters exist in the first data item for deleting the preset characters or not;
if yes, modifying the uppercase English character into lowercase English character.
Preferably, the step of calculating the similarity between the first vector and the third vector includes:
substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:
Figure BDA0003081461740000031
wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension;
And taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.
Preferably, the step of performing standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized includes:
judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;
if yes, the third data item is used as a new item category;
the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized;
and combining the fourth data item and the specified data item into the same item category.
The application also provides a device for data standardization, which comprises:
the system comprises an acquisition module, a data storage module and a data storage module, wherein the acquisition module is used for acquiring a first data item and a second data item, and the first data item and the second data item are any two data items in a data set to be standardized;
The conversion module is used for respectively converting the first data item into a first vector corresponding to literal representation and a second vector corresponding to meaning representation, and respectively converting the second data item into a third vector corresponding to literal representation and a fourth vector corresponding to meaning representation;
a first calculation module, configured to calculate a similarity between the first vector and the third vector, and calculate a similarity between the second vector and the fourth vector;
the second calculation module is used for respectively calculating the similarity between every two data items in the data set to be standardized according to the calculation mode of the similarity between the first data item and the second data item to obtain a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;
the fusion module is used for carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
and the classification module is used for carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described method.
According to the method, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names and the like corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and a data standardization standard of the medical entity is obtained, so that data standardization is achieved, convenience is provided for automatic standardization of data between different systems, manpower is liberated, and data standardization integration efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart of a method for data normalization according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a learning model structure represented by the meaning of an embodiment of the present application;
FIG. 3 is a schematic diagram of a literally-represented learning model structure according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an apparatus for data normalization according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, a method for data normalization according to an embodiment of the present application includes:
s1: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized;
s2: converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals;
s3: calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector;
S4: according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;
s5: weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
s6: and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
In this embodiment of the present application, the data set to be standardized is a data set in a medical field, and because different hospitals and different doctors have different description modes for the same medical scene, the medical data cannot be standardized. The data items include, but are not limited to, medical entities such as disease names, drug names, examination names, or diagnostic conclusions. The first vector corresponding to the literal representation is obtained based on character combination features, such as a word segmentation mode based on n-gram, so as to obtain character combination features corresponding to the character strings of the data item, and the character combination features are converted into vectors through a deep neural network. The meaning represents a corresponding second vector, is a meaning characteristic based on the character string expression and obtained based on the context, and is obtained by inputting the meaning characteristic into another deep neural network. The two deep neural networks have the same structure, different training data and different obtained network parameters, have different functions, and realize literal vector conversion and meaning vector conversion. The similarity of the two vectors considers both literal and meaning representations to improve the accuracy of the data classification. And calculating the similarity between two data items according to the literal represented vectors in pairs to obtain a first similarity matrix which takes all the data items as row information and column information at the same time. Similarly, the similarity between two data items is calculated according to the vectors with meaning, and a second similarity matrix which takes all the data items as row information and column information simultaneously is obtained. The arrangement order of the data items in the row information and the column information in the two similarity matrixes is the same, the similarity matrixes which simultaneously consider the literal representation and the meaning representation are obtained through weighted fusion, and then the classification and combination of the data items are realized according to the preset classification threshold value in the similarity matrixes and the maximum similarity among the data items, so that the data standardization is realized.
According to the method, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names and the like corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and a data standardization standard of the medical entity is obtained, so that data standardization is achieved, convenience is provided for automatic standardization of data between different systems, manpower is liberated, and data standardization integration efficiency is improved.
Further, the step S2 of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item includes:
s21: acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item;
s22: constructing the first vector according to the n-gram representation mode corresponding to the first data item;
s23: capturing a context relation corresponding to the first data item in the item category;
s24: and constructing the second vector according to the context relation corresponding to the first data item.
The n-gram representation mode can be selected according to requirements, for example, under the premise of considering calculation amount and classification accuracy, 1-gram, 2-gram and 3-gram are preferably used simultaneously, the 1-gram, 2-gram and 3-gram are spliced into input features, and a deep neural network is input to obtain a first vector.
The context relation refers to the meaning association between the data items, if the meaning between the two data items is similar, the data items with stronger context relation are considered, and the data item pair representing the context relation is taken as the characteristic input deep neural network to obtain a second vector.
Further, the step S24 of constructing the second vector according to the context corresponding to the first data item includes:
s241: all data items contained in the item category corresponding to the first data item are formed into a data item set;
s242: constructing data item pairs in the data item set in a pairwise combination manner;
s243: and inputting the data item pair into a first neural network to obtain the second vector.
The item categories include classification criteria such as diagnosis, disease, or patient, for example, medical information having the same diagnosis may be considered as the same category, and medical information belonging to the same patient may be considered as the same category. For example, the data are as follows: { patient 1: diagnosis 1}: { class 1: { data item A, data item B }, category 2: { data item C, data item D }; { patient 2: diagnosis 1}: { class 1: { data item E, data item F }, category 2: { data item G, data item H }. The diagnostics for the two above patients are identical, both "diagnostics 1", and the data for the two patients are combined to take into account the contextual relationship between the data items within the respective categories under the same diagnostics. The data from the same diagnosis were combined to give: diagnosis 1: { class 1: { data item A, data item B, data item E, data item F }, category 2: { data item C, data item D, data item G, data item H }. Pairs of data items are then constructed (input features, output tags) taking into account the context of the data items for "diagnosis 1-category 1" and "diagnosis 1-category 2", i.e. the respective categories of the same diagnosis. The constructed data item pairs are as follows: in "diagnosis 1-Category 1": (data item A, data item B), (data item A, data item E), (data item B, data item F), etc., i.e., all data items in "diagnosis 1-Category 1" are combined two by two. In "diagnosis 1-class 2": (data item C, data item D), (data item C, data item G), (data item D, data item H), etc., i.e., all data items in "diagnosis 1-category 2" are combined two by two.
The first neural network is a deep learning network with three fully-connected hidden layers, as shown in fig. 2, and is trained in an unsupervised learning mode, namely a learning model of meaning representation. The input vector of the first neural network is an input feature, the output vector is an output label, i.e. the input and the output are different. The input features and output labels are represented as one-hot encoding vectors (one-hot encoding) where each dimension of the vector corresponds to a data item, a value of 1 for each dimension indicates that there is a data item, and a value of 0 indicates that there is no data item. Specifically, the training data of the first neural network constructs features and their corresponding labels in two ways, namely capturing the context represented by the meaning of the data. One of the two modes is as follows: for data items associated to the same diagnosis or illness, a context is built in each category by category division, i.e. a pair of data items is built, the two data items in the pair of data items forming training data as input features and output tags, respectively. If the drug names corresponding to different data items under the same diagnosis are different, the drug names related to all the data items are summarized and de-duplicated to obtain all the drug sets corresponding to the diagnosis, and then the data pairs consisting of every two drug names (input characteristics and output labels) in the drug sets are used as training data. Another way is: for data items associated to one visit of the same patient, the context is built in each category by category division, i.e. pairs of data items are constructed as training data. Each data item is expressed as one-hot encoding (one-hot encoding), and a neural network is trained through an unsupervised learning mode to obtain a learning model of meaning expression, so that the data item can be vectorized from the meaning expression to obtain a vector of meaning expression.
Further, the n-gram includes 1-gram, 2-gram and 3-gram, and the step S22 of constructing the first vector according to the n-gram representation corresponding to the first data item includes:
s221: correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively;
s222: sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination;
s223: and inputting the characteristic combination into a second neural network to obtain the first vector.
For example, "hepatitis b e antibody assay anti hbe" data item, corresponding 1-gram representation, 2-gram representation and 3-gram representation, respectively, "b, liver, e, antibody, body, assay, anti hbe", "hepatitis b, liver e, e-antibody, body assay, anti hbe" and "hepatitis b e, liver e antibody, e-antibody, body assay, assay anti hbe". The 1-gram representation, 2-gram representation and 3-gram representation are sequentially combined into an input feature vector, and the input feature vector is used as training data to train a second neural network, and as shown in fig. 3, a low-density vector output by the middle hidden layer is used as an output vector, and is used as a literal representation vector corresponding to a data item.
The second neural network is a deep learning network of an automatic encoder structure of a three-layer fully-connected hidden layer, and a literal representation learning model is obtained through training of training data. The second neural network constructs characteristics by three expression modes of 1-gram,2-gram and 3-gram from the aspect of literal expression of the data items, namely character string composition, and the data items can be vectorized from the literal expression through training the network in an unsupervised learning mode to obtain vectors of literal expression.
The input features and the output vectors of the second neural network are identical, correspond to the coding of the self-coder, convert the high-dimensional sparse input vector into a low-dimensional dense output vector through learning by automatic learning of the input features, correspond to the output vector of the middle hidden layer, and acquire the important features of the input high-dimensional sparse vector by the obtained low-dimensional dense output vector, and then restore the low-dimensional dense vector into the high-dimensional sparse vector to complete model training. After the network training is completed, feature vectors constructed for 1-gram,2-gram and 3-gram of a certain data item are input, and the network middle hidden layer outputs a low-dimensional compact vector as a literal representation of the data item.
Further, before step S221 of converting the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, the method includes:
s2201: judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English;
s2202: if yes, deleting the preset character;
s2203: judging whether capital English characters exist in the first data item for deleting the preset characters or not;
s2204: if yes, modifying the uppercase English character into lowercase English character.
In the embodiment of the application, in order to improve the accuracy of constructing vectors, other characters except Chinese characters and English in names are removed first, english is unified into a lowercase form, and then words are segmented. For example, for the "hepatitis b e antibody assay (Anti-HBe)" data item, first, the data processing is expressed as "hepatitis b e antibody assay Anti HBe", and then, continuous english segments in the name are used as one character for word segmentation, resulting in corresponding 1-gram, 2-gram, and 3-gram expressions.
Further, the step S3 of calculating the similarity between the first vector and the third vector includes:
S31: substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:
Figure BDA0003081461740000091
wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension;
s32: and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.
In the embodiment of the present application, each data item is represented by two vectors, i.e., a literally represented vector and a literally represented vector. The similarity of the vectors is calculated according to the literal vector and the meaning vector between every two data items in a one-to-one correspondence mode, and the calculation mode is as described above, so that the literal similarity matrix and the meaning similarity matrix between all data items are obtained. And then carrying out weighted fusion on the two similarity matrixes to obtain a final comprehensive similarity matrix. The weight proportion of the two similarity matrixes can be adjusted according to the data classification task requirement during the weighted fusion, but the final comprehensive similarity matrix can be obtained by keeping the weight sum of the two similarity matrixes equal to 1. The preferred weight ratio of the two is 0.5:0.5.
Further, the step S6 of performing standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized includes:
s61: judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;
s62: if yes, the third data item is used as a new item category;
s63: the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized;
s64: and combining the fourth data item and the specified data item into the same item category.
For example, if the predetermined threshold is 0.5, the data item with the similarity lower than 0.5 in the integrated similarity matrix is removed, and the removed data item indicates that the data item has a low similarity with other data items, so the data item should be used as a standard data item alone. And then, selecting the item with the highest similarity from the reserved data items, classifying and integrating the items, and aggregating the data items to finish data standardization.
Table 1
Data item 1 Data item 2 Data item 3 Data item 4 Data item 5 Data item 6
Data item 1 1 0.9 0.8 0.7 0.6 0.2
Data item 2 0.9 1 0.6 0.8 0.8 0.1
Data item 3 0.8 0.6 1 0.6 0.7 0.2
Data item 4 0.7 0.8 0.6 1 0.9 0.3
Data item 5 0.6 0.8 0.7 0.9 1 0.4
Data item 6 0.2 0.1 0.2 0.3 0.4 1
For example, in table 1, the similarity between the data item 6 and other data items is lower than the threshold value 0.5, i.e., the data item 6 and other data items do not have similarity, so the data item 6 alone is 1 data item and cannot be combined with other data items. And selecting the data item with the highest similarity from the rest 5 data items as the similar item, namely selecting the data item corresponding to the maximum value in the row for classification by row units. The data items 1, 2 and 3 are aggregated together, the data items 4 and 5 are aggregated together, the data items 6 which are lower than a preset threshold value are independently used as a cluster, and finally three standardized data items are obtained, so that the standardization of the data items is completed. Data item merge 1: data item 1, data item 2, data item 3; data item merge 2: a data item 4, a data item 5; data item merge 3: a data item 6.
Referring to fig. 4, an apparatus for data normalization according to an embodiment of the present application includes:
an acquisition module 1, configured to acquire a first data item and a second data item, where the first data item and the second data item are any two data items in a data set to be standardized;
a conversion module 2, configured to convert the first data items into a first vector corresponding to a literal representation and a second vector corresponding to a meaning representation, and convert the second data items into a third vector corresponding to a literal representation and a fourth vector corresponding to a meaning representation;
a first calculating module 3, configured to calculate a similarity between the first vector and the third vector, and calculate a similarity between the second vector and the fourth vector;
the second calculating module 4 is configured to calculate, according to a calculation manner of the similarity between the first data item and the second data item, the similarity between two data items in the data set to be standardized, so as to obtain a first similarity matrix literally represented by the data set to be standardized, and a second similarity matrix represented by the meaning of the data set to be standardized;
The fusion module 5 is used for carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
and the classification module 6 is used for carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
The relevant explanation of the embodiments of the present application applies to the explanation of the corresponding method portions, and is not repeated.
Further, the conversion module 2 includes:
the first acquisition unit is used for acquiring the n-gram representation mode corresponding to the first data item and the item category corresponding to the first data item;
the first construction unit is used for constructing the first vector according to the n-gram representation mode corresponding to the first data item;
a capturing unit, configured to capture, in the item category, a context relationship corresponding to the first data item;
and the second construction unit is used for constructing the second vector according to the context relation corresponding to the first data item.
Further, the second construction unit includes:
a composition subunit, configured to compose all the data items included in the item category corresponding to the first data item into a data item set;
A construction subunit, configured to construct pairs of data items in a pairwise combination manner in the data item set;
and the first input subunit is used for inputting the data item pair into a first neural network to obtain the second vector.
Further, the n-gram includes 1-gram, 2-gram, and 3-gram, and the first construction unit includes:
a conversion subunit, configured to correspondingly convert the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, respectively;
a combination subunit, configured to sequentially combine the 1-gram representation, the 2-gram representation, and the 3-gram representation into a feature combination;
and the second input subunit is used for inputting the characteristic combination into a second neural network to obtain the first vector.
Further, the first construction unit includes:
a first judging subunit, configured to judge whether a preset character exists in the first data item, where the preset character is a character other than chinese and english;
a deleting subunit, configured to delete a preset character if the preset character exists;
a second judging subunit, configured to judge whether a capitalized english character exists in the first data item for deleting the preset character;
And the modification subunit is used for modifying the uppercase English character into lowercase English character if the uppercase English character exists.
Further, the first computing module 3 includes:
a substituting unit, configured to substitute the first vector and the third vector into a specified calculation formula, where the specified calculation formula is:
Figure BDA0003081461740000121
wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension;
and a first unit configured to use an output value of the specified calculation formula as a similarity between the first vector and the third vector.
Further, the classification module 6 includes:
the judging unit is used for judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;
the second unit is used for taking the third data item as a new item category if yes;
A second obtaining unit, configured to obtain a specified data item with a maximum similarity to a fourth data item in a similarity matrix of the data set to be standardized, where the fourth data item and the specified data item are any data item except a third data item of the data set to be standardized;
and the merging unit is used for merging the fourth data item and the appointed data item into the same item category.
Referring to fig. 5, a computer device is further provided in an embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required by the data normalization process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data normalization.
The method for the processor to perform the data normalization comprises the following steps: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized; converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals; calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized; weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized; and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
According to the computer equipment, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and data standardization standards of the medical entity are obtained, so that data standardization is achieved, convenience is provided for automatic data standardization between different systems, manpower is liberated, and data standardization integration efficiency is improved.
In one embodiment, the step of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item includes: acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item; constructing the first vector according to the n-gram representation mode corresponding to the first data item; capturing a context relation corresponding to the first data item in the item category; and constructing the second vector according to the context relation corresponding to the first data item.
In one embodiment, the step of constructing the second vector by the processor according to the context corresponding to the first data item includes: all data items contained in the item category corresponding to the first data item are formed into a data item set; constructing data item pairs in the data item set in a pairwise combination manner; and inputting the data item pair into a first neural network to obtain the second vector.
In one embodiment, the n-gram includes a 1-gram, a 2-gram, and a 3-gram, and the processor constructs the first vector according to the n-gram representation corresponding to the first data item, including: correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively; sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination; and inputting the characteristic combination into a second neural network to obtain the first vector.
In one embodiment, before the step of converting the first data item into the 1-gram representation, the 2-gram representation, and the 3-gram representation, respectively, the processor includes: judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English; if yes, deleting the preset character; judging whether capital English characters exist in the first data item for deleting the preset characters or not; if yes, modifying the uppercase English character into lowercase English character.
In one embodiment, the step of calculating the similarity between the first vector and the third vector includes: substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:
Figure BDA0003081461740000151
wherein similarity represents similarity, A represents the first vector, and B representsShowing the third vector, the first vector and the third vector having the same vector dimension, n representing a vector dimension, i representing an ith vector dimension; and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.
In one embodiment, the step of performing standardized classification on the data items in the data set to be standardized by the processor according to the similarity matrix of the data set to be standardized includes: judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value; if yes, the third data item is used as a new item category; the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized; and combining the fourth data item and the specified data item into the same item category.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.
An embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for data normalization, comprising: acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized; converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals; calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector; according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized; weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized; and carrying out standardized classification on the data items in the data set to be standardized according to the similarity matrix of the data set to be standardized.
According to the computer readable storage medium, through a natural language processing technology and a deep neural network learning technology, data items such as medical entity names corresponding to entity data in the medical field are vectorized by combining literal representation as characteristics, vectorization is performed on contextual meaning representations of the medical entity names, similarity between every two vectors is calculated, a similarity matrix between all the data items is established, and data standardization standards of the medical entity are obtained, so that data standardization is achieved, convenience is provided for automatic data standardization between different systems, manpower is liberated, and data standardization integration efficiency is improved.
In one embodiment, the step of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item includes: acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item; constructing the first vector according to the n-gram representation mode corresponding to the first data item; capturing a context relation corresponding to the first data item in the item category; and constructing the second vector according to the context relation corresponding to the first data item.
In one embodiment, the step of constructing the second vector by the processor according to the context corresponding to the first data item includes: all data items contained in the item category corresponding to the first data item are formed into a data item set; constructing data item pairs in the data item set in a pairwise combination manner; and inputting the data item pair into a first neural network to obtain the second vector.
In one embodiment, the n-gram includes a 1-gram, a 2-gram, and a 3-gram, and the processor constructs the first vector according to the n-gram representation corresponding to the first data item, including: correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively; sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination; and inputting the characteristic combination into a second neural network to obtain the first vector.
In one embodiment, before the step of converting the first data item into the 1-gram representation, the 2-gram representation, and the 3-gram representation, respectively, the processor includes: judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English; if yes, deleting the preset character; judging whether capital English characters exist in the first data item for deleting the preset characters or not; if yes, modifying the uppercase English character into lowercase English character.
In one embodiment, the step of calculating the similarity between the first vector and the third vector includes: substituting the first vector and the third vector into a specified calculation formula,
wherein, the specified calculation formula is:
Figure BDA0003081461740000171
wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension; and taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.
In one embodiment, the step of performing standardized classification on the data items in the data set to be standardized by the processor according to the similarity matrix of the data set to be standardized includes: judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value; if yes, the third data item is used as a new item category; the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized; and combining the fourth data item and the specified data item into the same item category.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (7)

1. A method of data normalization, comprising:
acquiring a first data item and a second data item, wherein the first data item and the second data item are any two data items in a data set to be standardized;
Converting the first data items into corresponding first vectors and corresponding second vectors respectively represented by literals, and converting the second data items into corresponding third vectors and corresponding fourth vectors respectively represented by literals;
calculating the similarity between the first vector and the third vector, and calculating the similarity between the second vector and the fourth vector;
according to the calculation mode of the similarity between the first data item and the second data item, calculating the similarity between every two data items in the data set to be standardized respectively, and obtaining a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;
weighting and fusing the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
according to the similarity matrix of the data set to be standardized, carrying out standardized classification on data items in the data set to be standardized;
the step of converting the first data item into a first vector literally representing the first data item and a second vector literally representing the second data item, respectively, includes:
Acquiring an n-gram representation mode corresponding to the first data item and an item category corresponding to the first data item;
constructing the first vector according to the n-gram representation mode corresponding to the first data item;
capturing a context relation corresponding to the first data item in the item category;
constructing the second vector according to the context relation corresponding to the first data item;
the step of constructing the second vector according to the context relation corresponding to the first data item includes:
all data items contained in the item category corresponding to the first data item are formed into a data item set;
constructing data item pairs in the data item set in a pairwise combination manner;
inputting the data item pair into a first neural network to obtain the second vector;
the n-gram comprises a 1-gram, a 2-gram and a 3-gram, and the step of constructing the first vector according to the n-gram representation corresponding to the first data item comprises the following steps:
correspondingly converting the first data item into a 1-gram representation mode, a 2-gram representation mode and a 3-gram representation mode respectively;
sequentially combining the 1-gram representation, the 2-gram representation and the 3-gram representation into a feature combination;
And inputting the characteristic combination into a second neural network to obtain the first vector.
2. The method of data normalization according to claim 1, wherein prior to the step of converting the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, respectively, comprising:
judging whether preset characters exist in the first data item, wherein the preset characters are characters except Chinese and English;
if yes, deleting the preset character;
judging whether capital English characters exist in the first data item for deleting the preset characters or not;
if yes, modifying the uppercase English character into lowercase English character.
3. The method of data normalization according to claim 1, characterized in that the step of calculating the similarity between the first vector and the third vector comprises:
substituting the first vector and the third vector into a specified calculation formula, wherein the specified calculation formula is:
Figure FDA0004205972190000021
wherein similarity represents similarity, a represents the first vector, B represents the third vector, the first vector and the third vector have the same vector dimension, n represents a vector dimension, and i represents an ith vector dimension;
And taking the output value of the specified calculation formula as the similarity between the first vector and the third vector.
4. The method of claim 1, wherein the step of normalized classifying data items in the data set to be normalized according to a similarity matrix of the data set to be normalized comprises:
judging whether a third data item exists in the similarity matrix of the data set to be standardized, wherein the similarity between the third data item and data items except the third data item of the data set to be standardized is smaller than a preset threshold value;
if yes, the third data item is used as a new item category;
the method comprises the steps of obtaining a designated data item with maximum similarity to a fourth data item in a similarity matrix of a data set to be standardized, wherein the fourth data item and the designated data item are any data item except a third data item of the data set to be standardized;
and combining the fourth data item and the specified data item into the same item category.
5. An apparatus for normalizing data, comprising:
The system comprises an acquisition module, a data storage module and a data storage module, wherein the acquisition module is used for acquiring a first data item and a second data item, and the first data item and the second data item are any two data items in a data set to be standardized;
the conversion module is used for respectively converting the first data item into a first vector corresponding to literal representation and a second vector corresponding to meaning representation, and respectively converting the second data item into a third vector corresponding to literal representation and a fourth vector corresponding to meaning representation;
a first calculation module, configured to calculate a similarity between the first vector and the third vector, and calculate a similarity between the second vector and the fourth vector;
the second calculation module is used for respectively calculating the similarity between every two data items in the data set to be standardized according to the calculation mode of the similarity between the first data item and the second data item to obtain a first similarity matrix of literal representation of the data set to be standardized and a second similarity matrix of meaning representation of the data set to be standardized;
the fusion module is used for carrying out weighted fusion on the first similarity matrix and the second similarity matrix to obtain a similarity matrix of the data set to be standardized;
The classification module is used for carrying out standardized classification on data items in the data set to be standardized according to the similarity matrix of the data set to be standardized;
further, the conversion module includes:
the first acquisition unit is used for acquiring the n-gram representation mode corresponding to the first data item and the item category corresponding to the first data item;
the first construction unit is used for constructing the first vector according to the n-gram representation mode corresponding to the first data item;
a capturing unit, configured to capture, in the item category, a context relationship corresponding to the first data item;
a second construction unit, configured to construct the second vector according to a context relationship corresponding to the first data item;
a second construction unit comprising:
a composition subunit, configured to compose all the data items included in the item category corresponding to the first data item into a data item set;
a construction subunit, configured to construct pairs of data items in a pairwise combination manner in the data item set;
a first input subunit, configured to input the pair of data items into a first neural network, and obtain the second vector;
The n-gram includes 1-gram, 2-gram, and 3-gram, and the first construction unit includes:
a conversion subunit, configured to correspondingly convert the first data item into a 1-gram representation, a 2-gram representation, and a 3-gram representation, respectively;
a combination subunit, configured to sequentially combine the 1-gram representation, the 2-gram representation, and the 3-gram representation into a feature combination;
and the second input subunit is used for inputting the characteristic combination into a second neural network to obtain the first vector.
6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.
CN202110567575.1A 2021-05-24 2021-05-24 Method, device, equipment and storage medium for data standardization Active CN113269248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110567575.1A CN113269248B (en) 2021-05-24 2021-05-24 Method, device, equipment and storage medium for data standardization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110567575.1A CN113269248B (en) 2021-05-24 2021-05-24 Method, device, equipment and storage medium for data standardization

Publications (2)

Publication Number Publication Date
CN113269248A CN113269248A (en) 2021-08-17
CN113269248B true CN113269248B (en) 2023-06-23

Family

ID=77232564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110567575.1A Active CN113269248B (en) 2021-05-24 2021-05-24 Method, device, equipment and storage medium for data standardization

Country Status (1)

Country Link
CN (1) CN113269248B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597856A (en) * 2018-12-05 2019-04-09 北京知道创宇信息技术有限公司 A kind of data processing method, device, electronic equipment and storage medium
US10565498B1 (en) * 2017-02-28 2020-02-18 Amazon Technologies, Inc. Deep neural network-based relationship analysis with multi-feature token model
CN112527970A (en) * 2020-12-24 2021-03-19 上海浦东发展银行股份有限公司 Data dictionary standardization processing method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832655B2 (en) * 2011-09-29 2014-09-09 Accenture Global Services Limited Systems and methods for finding project-related information by clustering applications into related concept categories

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565498B1 (en) * 2017-02-28 2020-02-18 Amazon Technologies, Inc. Deep neural network-based relationship analysis with multi-feature token model
CN109597856A (en) * 2018-12-05 2019-04-09 北京知道创宇信息技术有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN112527970A (en) * 2020-12-24 2021-03-19 上海浦东发展银行股份有限公司 Data dictionary standardization processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113269248A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN110807154B (en) Recommendation method and system based on hybrid deep learning model
CN108563653B (en) Method and system for constructing knowledge acquisition model in knowledge graph
CN112036154B (en) Electronic medical record generation method and device based on inquiry dialogue and computer equipment
CN108108354B (en) Microblog user gender prediction method based on deep learning
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN112131883A (en) Language model training method and device, computer equipment and storage medium
CN113821635A (en) Text abstract generation method and system for financial field
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
CN111768820A (en) Paper medical record digitization and target detection model training method, device and storage medium
CN111428502A (en) Named entity labeling method for military corpus
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN113269248B (en) Method, device, equipment and storage medium for data standardization
CN117290478A (en) Knowledge graph question-answering method, device, equipment and storage medium
CN111723572A (en) Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN116910190A (en) Method, device and equipment for acquiring multi-task perception model and readable storage medium
CN113779994B (en) Element extraction method, element extraction device, computer equipment and storage medium
CN113468874B (en) Biomedical relation extraction method based on graph convolution self-coding
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN114491122A (en) Graph matching method for searching similar images
CN113553326A (en) Spreadsheet data processing method, device, computer equipment and storage medium
CN114064888A (en) Financial text classification method and system based on BERT-CNN
Chen et al. Text classification based on a new joint network
CN114626378A (en) Named entity recognition method and device, electronic equipment and computer readable storage medium
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant