CN111696636B - Data processing method and device based on deep neural network - Google Patents

Data processing method and device based on deep neural network Download PDF

Info

Publication number
CN111696636B
CN111696636B CN202010412571.1A CN202010412571A CN111696636B CN 111696636 B CN111696636 B CN 111696636B CN 202010412571 A CN202010412571 A CN 202010412571A CN 111696636 B CN111696636 B CN 111696636B
Authority
CN
China
Prior art keywords
vector
category
medical record
quality
record data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010412571.1A
Other languages
Chinese (zh)
Other versions
CN111696636A (en
Inventor
李彦轩
唐蕊
孙行智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010412571.1A priority Critical patent/CN111696636B/en
Priority to PCT/CN2020/099539 priority patent/WO2021114637A1/en
Publication of CN111696636A publication Critical patent/CN111696636A/en
Application granted granted Critical
Publication of CN111696636B publication Critical patent/CN111696636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application relates to the technical field of artificial intelligence, and discloses a data processing method and device based on a deep neural network, wherein the method comprises the following steps: obtaining at least 2 training samples, sequentially inputting the at least 2 training samples into a constructed deep neural network DNN model for training, reducing the loss function of the DNN model to a preset fluctuation range after training, enabling the loss function of the DNN model to be a four-element loss function, inputting the feature vector of medical record data to be predicted into the trained DNN model for processing, obtaining a target embedded vector corresponding to the medical record data to be predicted, and determining the quality of the medical record data to be predicted according to the distance between the target embedded vector and a quality embedded vector and a preset quality abnormal distance. By adopting the embodiment of the application, the quality of the medical record data can be screened from multiple aspects/angles, and the accuracy of quality screening is improved. In addition, the application can be applied to the field of intelligent medical treatment, thereby promoting the construction of intelligent cities.

Description

Data processing method and device based on deep neural network
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a data processing method and device based on a deep neural network.
Background
Electronic medical records are digitized medical records that are stored, managed, transmitted, and reproduced using electronic equipment, recording the overall process of a patient undergoing diagnosis and treatment at a hospital. However, in the recording process of the electronic medical record, medical record quality problems, such as unqualified medical record or abnormal medical record, are often caused by professional factors such as misdiagnosis and the like or non-professional factors such as recording errors and the like.
With the development of computer technology, a computer can be used for screening quality problems of electronic medical records. However, at present, the computer is mainly used for screening based on an objective rule formulated by human, so that the coverage of the computer screening is narrower, and the screening accuracy is low.
Disclosure of Invention
The embodiment of the application provides a data processing method and device based on a deep neural network, which can screen the quality of medical record data from multiple aspects/angles and improve the accuracy of quality screening.
In a first aspect, an embodiment of the present application provides a data processing method based on a deep neural network, where the method includes:
acquiring at least 2 training samples, wherein each training sample in the at least 2 training samples is a quadruple, the quadruple comprises a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample, the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different categories as the anchor point and qualified quality, and the false sample is medical record data with unqualified quality;
Sequentially inputting the at least 2 training samples into a constructed deep neural network DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, wherein the loss function of the DNN model is a four-element loss function, and differences between an embedded vector obtained by inputting the feature vector of the anchor point into the DNN model and an embedded vector obtained by inputting the feature vector of the positive sample, the feature vector of the negative sample and the feature vector of the false sample into the DNN model are determined;
inputting the feature vector of the medical record data to be predicted into a trained DNN model for processing to obtain a target embedded vector corresponding to the medical record data to be predicted;
and determining the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance.
With reference to the first aspect, in one possible implementation manner, the four-tuple loss function is:
L=d(a,p)-d(a,n)-k*d(a,F);
wherein, L represents the four-element loss function, a represents the embedded vector obtained after the feature vector of the anchor point is input into the DNN model, p represents the embedded vector obtained after the feature vector of the positive sample is input into the DNN model, n represents the embedded vector obtained after the feature vector of the negative sample is input into the DNN model, F represents the embedded vector obtained after the feature vector of the false sample is input into the DNN model, k is a coefficient, d (a, p) represents the distance between a and p, d (a, n) represents the distance between a and n, and d (a, F) represents the distance between a and F.
With reference to the first aspect, in one possible implementation manner, determining the quality of the medical record data to be predicted according to the distance between the target embedding vector and the quality embedding vector and the preset quality anomaly distance includes:
if the distance between the target embedded vector and the quality embedded vector is greater than or equal to the preset quality abnormal distance, determining that the quality of the medical record data to be predicted is unqualified; and if the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, determining that the quality of the medical record data to be predicted is qualified.
With reference to the first aspect, in one possible implementation manner, before determining the quality of the medical record data to be predicted according to the distance between the target embedding vector and the quality embedding vector and the preset quality anomaly distance, the method further includes:
sequentially inputting the feature vectors of all the false samples in the at least 2 training samples into a trained DNN model for processing to obtain embedded vectors corresponding to all the false samples, wherein one false sample corresponds to one embedded vector; and determining the average value vector among the embedded vectors corresponding to all the false samples as a quality embedded vector.
With reference to the first aspect, in one possible implementation manner, after determining that the quality of the medical record data to be predicted is qualified, the method further includes: and determining the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector.
With reference to the first aspect, in one possible implementation manner, determining the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector includes: if the distance between the target embedded vector and the class embedded vector w in each class embedded vector is smaller than or equal to the class distance corresponding to the class embedded vector w, determining the class of the medical record data to be predicted as a first class, wherein the first class is the class corresponding to the class embedded vector w.
With reference to the first aspect, in a possible implementation manner, the method further includes: if the distance between the target embedded vector and each category embedded vector is larger than the category distance corresponding to each category embedded vector, determining the category of the medical record data to be predicted as a second category, wherein the second category is different from the category corresponding to each category embedded vector.
In a second aspect, an embodiment of the present application provides a data classification apparatus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring at least 2 training samples, each training sample in the at least 2 training samples is a quadruple, the quadruple comprises a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample, the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different categories from the anchor point and qualified quality, and the false sample is medical record data with unqualified quality;
the training unit is used for inputting the at least 2 training samples into the constructed deep neural network DNN model in sequence to train, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, the loss function of the DNN model is a four-component loss function, and the difference between an embedded vector obtained by inputting the characteristic vector of the anchor point into the DNN model and an embedded vector obtained by inputting the characteristic vector of the positive sample, the characteristic vector of the negative sample and the characteristic vector of the false sample into the DNN model is determined;
The processing unit is used for inputting the feature vector of the medical record data to be predicted into the trained DNN model for processing to obtain a target embedded vector corresponding to the medical record data to be predicted;
the first determining unit is used for determining the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance.
With reference to the second aspect, in one possible implementation manner, the four-tuple loss function is:
L=d(a,p)-d(a,n)-k*d(a,F);
wherein, L represents the four-element loss function, a represents the embedded vector obtained after the feature vector of the anchor point is input into the DNN model, p represents the embedded vector obtained after the feature vector of the positive sample is input into the DNN model, n represents the embedded vector obtained after the feature vector of the negative sample is input into the DNN model, F represents the embedded vector obtained after the feature vector of the false sample is input into the DNN model, k is a coefficient, d (a, p) represents the distance between a and p, d (a, n) represents the distance between a and n, and d (a, F) represents the distance between a and F.
With reference to the second aspect, in one possible implementation manner, the first determining unit is specifically configured to: when the distance between the target embedded vector and the quality embedded vector is greater than or equal to a preset quality abnormal distance, determining that the quality of the medical record data to be predicted is unqualified; and when the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, determining that the quality of the medical record data to be predicted is qualified.
With reference to the second aspect, in one possible implementation manner, the processing unit is further configured to sequentially input feature vectors of all the dummy samples in the at least 2 training samples into the trained DNN model to obtain embedded vectors corresponding to the all the dummy samples, where one dummy sample corresponds to one embedded vector; the data classification device further comprises a second determining unit, configured to determine a mean value vector between the embedding vectors corresponding to all the dummy samples as a quality embedding vector.
With reference to the second aspect, in a possible implementation manner, the first determining unit is further configured to: and determining the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector.
With reference to the second aspect, in a possible implementation manner, the first determining unit is further specifically configured to: when the distance between the target embedded vector and the class embedded vector w in each class embedded vector is smaller than or equal to the class distance corresponding to the class embedded vector w, determining the class of the medical record data to be predicted as a first class, wherein the first class is the class corresponding to the class embedded vector w.
With reference to the second aspect, in a possible implementation manner, the first determining unit is further configured to: when the distance between the target embedded vector and each category embedded vector is larger than the category distance corresponding to each category embedded vector, determining the category of the medical record data to be predicted as a second category, wherein the second category is different from the category corresponding to each category embedded vector.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is configured to store a computer program supporting a terminal to perform the method described above, the computer program including program instructions, and the processor is configured to invoke the program instructions to perform the data processing method based on a deep neural network of the first aspect described above.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the deep neural network-based data processing method of the first aspect described above.
According to the embodiment of the application, at least 2 training samples are acquired, the at least 2 training samples are sequentially input into a constructed deep neural network DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, the loss function of the DNN model is a four-tuple loss function, the feature vector of medical record data to be predicted is input into the trained DNN model for processing, the target embedded vector corresponding to the medical record data to be predicted is obtained, the quality of the medical record data to be predicted is determined according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance, the quality of the medical record data to be predicted can be screened from multiple aspects/angles, and the quality screening accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a DNN model according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a data processing method based on a deep neural network according to an embodiment of the present application;
FIG. 3 is another schematic flow chart of a data processing method based on a deep neural network according to an embodiment of the present application;
FIG. 4 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
It should be further appreciated that reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
In order to better understand the data processing method based on the deep neural network provided by the embodiment of the present application, the architecture of the deep neural network (Deep Neural Network, DNN) provided by the embodiment of the present application will be briefly described below.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a DNN model according to an embodiment of the present application. The DNNs may be divided by the location of the different layers, e.g., the neural network layers within the DNNs may be divided into three categories, an input layer, a hidden layer, and an output layer. As shown in fig. 1, the first layer of the DNN model is an input layer (input layer), the last layer is an output layer (output layer), and the intermediate layers are all hidden layers (hidden layers), such as hidden layer 1 (hidden layer 1), hidden layer 2 (hidden layer 2), and hidden layer 3 (hidden layer 3) in fig. 1. The layers of the DNN model are fully connected, that is, any neuron of the i-th layer is connected to any neuron of the i+1-th layer. Where the neurons of the output layer may have more than one output, there may be multiple outputs, such DNN models may be flexibly applied to classification regression, and other machine learning fields such as descent and clustering. It can be appreciated that the neurons of the output layer of the DNN model of the embodiments of the present application have multiple outputs, mainly for the dimension reduction and clustering in the machine learning field. It is further understood that fig. 1 is only a schematic diagram, and the number of hidden layers of the DNN model is not limited in the embodiment of the present application.
The following will describe a data processing method and apparatus based on a deep neural network according to an embodiment of the present application with reference to fig. 2 to 5. The data processing method based on the deep neural network can be applied to the intelligent medical field, and can overcome the problems of low coverage, efficiency and accuracy in the traditional manual screening process in the screening process of the electronic medical record, realize screening the quality of medical record data in multiple aspects/angles, improve the accuracy of quality screening, and further promote the construction of intelligent cities.
It can be understood that, the medical record data with qualified quality mentioned in the embodiment of the application refers to medical record data without quality problems such as misdiagnosis and recording errors, and the medical record data with unqualified quality refers to medical record data with quality problems such as misdiagnosis or recording errors.
Referring to fig. 2, fig. 2 is a schematic flowchart of a data processing method based on a deep neural network according to an embodiment of the present application. As shown in fig. 2, the data processing method based on the deep neural network may include:
s201, the electronic device acquires at least 2 training samples.
In some possible embodiments, each of the at least 2 training samples is a quadruple. Each quadruple comprises 4 characteristics, namely, a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample. The anchor point in the embodiment of the application refers to medical record data with qualified quality, the positive sample refers to medical record data with the same category as the anchor point and qualified quality, the negative sample refers to medical record data with different categories as the anchor point and qualified quality, and the false sample refers to medical record data with unqualified quality.
In some possible implementations, the electronic device can randomly extract N pieces of medical record data from the medical record database. The developer can label the N pieces of medical record data, and mark whether the quality of each piece of medical record data in the N pieces of medical record data is qualified or not and the type of the medical record data with qualified quality in the N pieces of medical record data respectively. For ease of description, the following description refers to qualified medical record data and unqualified medical record data as abnormal medical records. Optionally, the developer may classify the qualified medical records in the N medical record data according to the department or the disease portion recorded in the medical record data. For example, classification by department may be classified as internal, surgical, gynecological, pediatric, penta-functional, oncology or infectious, etc. It is understood that the above department classification may be further detailed, for example, internal medicine may be classified into respiratory medicine, digestive medicine, blood medicine, etc., and surgery may be classified into general surgery, cardiothoracic surgery, cardiovascular surgery, breast surgery, hepatobiliary surgery, etc. For another example, the disease location can be classified into heart, liver, spleen, lung, kidney, ear, nose, throat, eye, etc.
The electronic device can use the N medical record data marked with the quality and the category as a training data set. The electronic device can randomly select one qualified medical record from the K qualified medical records in the training data set as an anchor point, and extract the feature vector of the anchor point. The electronic device can randomly select a qualified medical record with the same category as the anchor point from K-1 qualified medical records in the training data set as a positive sample, and extract a feature vector of the positive sample. The electronic device can randomly select a qualified medical record with different categories from the anchor points from K-2 qualified medical records in the training data set as a negative sample, and extract the feature vector of the negative sample. The electronic device can randomly select an abnormal medical record from N-K abnormal medical records in the training data set as a false sample, and extract a feature vector of the false sample. The electronic device may combine the feature vector of the anchor point, the feature vector of the positive sample, the feature vector of the negative sample, and the feature vector of the false sample into a quadruple, and may use the quadruple as a training sample. According to the method, the electronic equipment determines at least 2 training samples from the training data set, and each training sample is a quadruple. The feature vector is used for describing feature information of medical record data, for example, the feature vector can include feature information of symptoms, inspection results, diagnosis and the like. The feature vector of the anchor point, the feature vector of the positive sample, the feature vector of the negative sample and the feature dimension and the feature category between the feature vectors of the false sample are the same.
For example, assume that each piece of medical record data includes 5 features, feature A, feature B, feature C, feature D, and feature E, respectively. Feature vector X of anchor point i i =(A i ,B i ,C i ,D i ,E i ) Feature vector X of positive sample j j =(A j ,B j ,C j ,D j ,E j ) Feature vector X of negative sample h h =(A h ,B h ,C h ,D h ,E h ) Feature vector X of dummy sample g g =(A g ,B g ,C g ,D g ,E g ). Therefore, the four-tuple of the feature vector of the anchor point i, the feature vector of the positive sample j, the feature vector of the negative sample h, and the feature vector of the false sample g is (X) i ,X j ,X h ,X g ) I.e. the training sample is (X) i ,X j ,X h ,X g )。
S202, the electronic equipment sequentially inputs at least 2 training samples into the constructed deep neural network DNN model for training.
In some possible embodiments, the loss function of the DNN model described above may be a four-tuple loss function. The four-component loss function can be determined by the difference between the embedded vector obtained after the feature vector of the anchor point is input into the DNN model and the embedded vector obtained after the feature vector of the positive sample, the feature vector of the negative sample and the feature vector of the false sample are input into the DNN model.
In some possible implementations, the electronic device may construct a DNN model including an input layer, one or more hidden layers, and an output layer with full connectivity between layers of the DNN model, according to developer settings (e.g., number of hidden layers, number of neurons of the input layer, number of neurons of the output layer, or loss function, etc.). The electronic equipment can sequentially input the at least 2 training samples (namely the quadruple) into the constructed DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range. The loss function of the DNN model is a quaternion loss function, and the quaternion loss function is used for restraining a quaternion embedded vector output by the DNN model in the training process. The four-element embedding vector comprises an embedding vector corresponding to an anchor point, an embedding vector corresponding to a positive sample, an embedding vector corresponding to a negative sample and an embedding vector corresponding to a false sample.
Optionally, the four-tuple loss function satisfies formula (1-1):
L=d(a,p)-d(a,n)-k*d(a,F), (1-1)
where L represents the four-tuple loss function and d (x, y) represents the L2 distance of x and y in sample space. a represents an embedded vector obtained after the feature vector of the anchor point is input into the DNN model, p represents an embedded vector obtained after the feature vector of the positive sample is input into the DNN model, n represents an embedded vector obtained after the feature vector of the negative sample is input into the DNN model, F represents an embedded vector obtained after the feature vector of the false sample is input into the DNN model, and k is a coefficient. d (a, p) represents the L2 distance between a and p, d (a, n) represents the L2 distance between a and n, and d (a, F) represents the L2 distance between a and F.
Alternatively, the L2 distance satisfies the formula (1-2):
wherein Q represents the number of elements included in x, y, x i Represents the ith element, y, in vector x i Representing vectorsThe i-th element in y. For example, assuming x= (1, 2,3, 4), y= (5, 6,7, 8), then
In the training process, the DNN model minimizes the value of the four-tuple loss function L, so that the distance between the embedded vector corresponding to the anchor point and the embedded vector corresponding to the positive sample in the sample space is as close as possible, namely: the value of d (a, p) in the quadruple loss function L is made as small as possible (indicating that the embedded vectors corresponding to qualified medical records of the same category are closely spaced in the sample space). Meanwhile, the distance between the embedded vector corresponding to the anchor point and the embedded vector corresponding to the negative sample and the distance between the embedded vector corresponding to the anchor point and the embedded vector corresponding to the false sample in the sample space are as far as possible, namely: d (a, n) and k x d (a, F) in the quadruple loss function L are made as large as possible (indicating that the distances of the embedded vectors corresponding to qualified medical records of different categories are far in the sample space, and the distances of the embedded vectors corresponding to qualified medical records and abnormal medical records are far in the sample space). The DNN model minimizes the value of the four-tuple loss function L, and may further enable the distance between the anchor point and the dummy sample, between the positive sample and the dummy sample, and between the negative sample and the dummy sample, to be far greater than the distance between the anchor point and the embedded vector corresponding to the negative sample, that is: let d (a, F), d (p, F) and d (n, F) in the four-tuple loss function L be much larger than d (a, n).
It can be appreciated that since the anchor point, the positive sample and the negative sample are all qualified medical records and the dummy sample is an abnormal medical record, when the DNN model minimizes the four-element loss function, d (a, F), d (p, F) and d (n, F) are far greater than d (a, n), so that the DNN model can learn the difference between the qualified medical record and the abnormal medical record, thereby identifying the abnormal medical record. Since the anchor point is a qualified medical record and the positive sample is a qualified medical record with the same category as the anchor point, the DNN model can learn the distribution (or characteristics) of qualified medical records with the same category by making the value of d (a, p) as small as possible when minimizing the four-element loss function. It can be further understood that the DNN model learns the mapping relationship between the input feature vector and the output embedded vector by minimizing the value of the four-tuple loss function L in the training process, that is, adjusts the value of each dimension element in the embedded vector, so as to gradually constrain the output result of the DNN model to the distribution corresponding to the model learning target.
The dimension of the feature vector input into the DNN model is larger than that of the embedded vector output by the DNN model, and the features in the embedded vector belong to the features of the feature vector. For example, the feature vector is a 1000-dimensional vector, and the embedded vector is a fixed 100-dimensional vector. As another example, the feature vector input to the DNN model includes five features, A, B, C, D and E, and the embedded vector output from the DNN model may include three features, B, D and E.
It can be understood that, according to the training process, qualified medical records of the same category are concentrated in the same sample cluster, qualified medical records of different categories are distributed in different sample clusters within a certain range, and abnormal medical records are distributed at positions far away from the qualified medical records.
In some possible embodiments, when the value of the four-tuple loss function L is no longer reduced (or fluctuates within a certain preset fluctuation range) in the training process, it is indicated that the DNN model tends to be stable at this time, and the constraint condition of the four-tuple loss function L is satisfied, and the DNN model training is completed. It can be appreciated that the more training samples used in the training process, the better the performance of the trained DNN model.
S203, the electronic equipment inputs the feature vector of the medical record data to be predicted into a trained DNN model for processing, and a target embedded vector corresponding to the medical record data to be predicted is obtained.
In some possible embodiments, the electronic device may randomly acquire a piece of medical record data to be predicted from the medical record database, and may extract a feature vector of the medical record data to be predicted. The medical record data to be predicted refers to medical record data with unknown quality and/or unknown category. The feature vector of the medical record data to be predicted comprises the same features as the feature vector of the anchor point, the positive sample, the negative sample and the false sample in the training process (the feature types and the feature sequences are the same here), and the dimensions are the same. For example, the dimension of the feature vector used in the training process is 1000 dimensions, and the dimension of the feature vector of the medical record data to be predicted is 1000 dimensions; assuming that the feature vectors used in the training process include five features A, B, C, D and E, the feature vectors of the predicted medical record data also include five features A, B, C, D and E.
The electronic equipment can input the feature vector of the medical record data to be predicted into the trained DNN model for processing, and the trained DNN model maps the input feature vector and outputs a target embedded vector corresponding to the medical record data to be predicted. It will be appreciated that the dimension of the embedded vector is lower and denser than the dimension of the feature vector, and that the DNN model projects the feature vector into a feature space of lower dimension to yield the embedded vector.
S204, the electronic equipment determines the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance.
In some possible embodiments, the electronic device may obtain the quality embedding vector and may obtain the preset quality anomaly distance. The electronic device may calculate a distance between the target embedded vector and the quality embedded vector, and may compare a magnitude relation between the distance between the target embedded vector and the quality embedded vector and a preset abnormal distance of the quality. If the distance between the target embedded vector and the quality embedded vector is greater than or equal to the preset quality abnormal distance, the electronic equipment determines that the quality of the medical record data to be predicted is unqualified, namely the medical record data to be predicted is abnormal medical record. If the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, the electronic equipment determines that the quality of the medical record data to be predicted is qualified, namely the medical record data to be predicted is qualified medical record. Wherein the quality embedding vector can be used to reflect characteristics of the abnormal medical record. The distance in the embodiment of the present application may refer to an L2 distance. The preset quality anomaly distance may be determined based on medical record data in the training dataset.
According to the embodiment of the application, the deep learning model is trained by using the electronic medical record data, so that the model can learn potential distribution of medical record data with qualified quality, and quality evaluation is carried out according to whether the medical record data accords with the distribution learned by the model, thereby expanding coverage of quality evaluation, screening the quality of the medical record data from multiple aspects/angles and improving accuracy of quality screening.
In some possible embodiments, the method for the electronic device to acquire the quality embedded vector specifically includes: the electronic device may extract the feature vectors of all the dummy samples (i.e., the N-K abnormal medical records) in the at least 2 training samples, and may sequentially input the feature vectors of all the dummy samples in the at least 2 training samples into the trained DNN model for processing, to obtain the embedded vectors corresponding to all the dummy samples. One of the dummy samples corresponds to one of the embedded vectors (N-K abnormal medical records correspond to N-K embedded vectors). The electronic device can calculate the mean value vector between the embedded vectors corresponding to all the false samples (i.e., N-K embedded vectors corresponding to N-K abnormal medical records), and can use the mean value vector as the quality embedded vector.
Optionally, in order to ensure the reliability and privacy of medical record data, the medical record data (including training samples and medical record data to be predicted) may be uploaded to a blockchain node in the blockchain system in advance, and when the data processing method based on the deep neural network of the present application is executed, relevant data of the training samples may be obtained from the blockchain node in the blockchain system, the DNN model may be trained, the medical record data to be predicted may be obtained from the blockchain node, the DNN model may be input to determine a target embedding vector, and then the quality of the medical record data to be predicted may be determined according to the target embedding vector. The quality evaluation of the medical record data of the patient is accurately, safely and privately realized.
Optionally, the data processing method based on the deep neural network in the application can also be executed based on intelligent contracts deployed in a blockchain system, for example, after the DNN model training is completed, the distance between the target embedded vector and the quality embedded vector can be judged through the intelligent contracts, and the quality of the medical record data with prediction can be determined according to the distance and the preset quality abnormal distance through the intelligent contracts. Further optionally, after the quality of the medical record to be predicted is determined, the quality of the medical record to be predicted determined by the intelligent contract can be uploaded to the blockchain, so that the reliability and privacy of the medical record data are ensured.
It should be noted that, the blockchain referred to in the present application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc. The blockchain is essentially a decentralised database, which is a series of data blocks generated by cryptographic methods, each data block containing a batch of information of network transactions for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
In the embodiment of the application, the electronic equipment acquires at least 2 training samples, sequentially inputs the at least 2 training samples into the constructed DNN model for training, reduces the loss function of the DNN model to a preset fluctuation range after training, inputs the feature vector of the medical record data to be predicted into the trained DNN model for processing, and obtains the target embedded vector corresponding to the medical record data to be predicted, and determines the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance, so that the coverage of quality evaluation can be enlarged, the quality of the medical record data can be screened from multiple aspects/angles, and the accuracy of quality screening is improved.
Referring to fig. 3, fig. 3 is another schematic flowchart of a data processing method based on a deep neural network according to an embodiment of the present application. As shown in fig. 3, the data processing method based on the deep neural network may include:
s301, the electronic device acquires at least 2 training samples.
S302, the electronic equipment sequentially inputs at least 2 training samples into the constructed deep neural network DNN model for training.
S303, the electronic equipment inputs the feature vector of the medical record data to be predicted into a trained DNN model for processing, and a target embedded vector corresponding to the medical record data to be predicted is obtained.
In some possible implementations, the implementation manners of step S301 to step S303 in the embodiment of the present application may refer to the implementation manners of step S201 to step S203 in the embodiment shown in fig. 2, which are not described herein.
S304, the electronic equipment sequentially inputs the feature vectors of all the false samples in the at least 2 training samples into the trained DNN model for processing to obtain embedded vectors corresponding to all the false samples, wherein one false sample corresponds to one embedded vector.
And S305, the electronic equipment determines the vector average value among the embedded vectors corresponding to all the false samples as a quality embedded vector.
In some possible embodiments, each of the at least 2 training samples is a quadruple. Each quadruple comprises 4 characteristics, namely, a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample. The anchor point in the embodiment of the application refers to medical record data with qualified quality, the positive sample refers to medical record data with the same category as the anchor point and qualified quality, the negative sample refers to medical record data with different categories as the anchor point and qualified quality, and the false sample refers to medical record data with unqualified quality. For ease of description, the following description refers to qualified medical record data and unqualified medical record data as abnormal medical records.
In some possible embodiments, the electronic device may extract feature vectors of all the dummy samples in the at least 2 training samples, and may sequentially input the feature vectors of all the dummy samples in the at least 2 training samples into the trained DNN model for processing, to obtain the embedded vectors corresponding to all the dummy samples. One of the dummy samples corresponds to one of the embedded vectors. The electronic device may calculate a mean vector between the embedded vectors corresponding to all the dummy samples, and may use the mean vector as a quality embedded vector.
S306, if the distance between the target embedded vector and the quality embedded vector is greater than or equal to the preset quality abnormal distance, the electronic equipment determines that the quality of the medical record data to be predicted is unqualified.
S307, if the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, the electronic device determines that the quality of the medical record data to be predicted is qualified.
In some possible embodiments, after obtaining the quality embedding vector, the electronic device may obtain a preset quality anomaly distance. The electronic device may calculate a distance between the target embedded vector and the quality embedded vector, and may compare a magnitude relation between the distance between the target embedded vector and the quality embedded vector and a preset abnormal distance of the quality. If the distance between the target embedded vector and the quality embedded vector is greater than or equal to the preset quality abnormal distance, the electronic equipment determines that the quality of the medical record data to be predicted is unqualified, namely the medical record data to be predicted is abnormal medical record. If the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, the electronic equipment determines that the quality of the medical record data to be predicted is qualified, namely the medical record data to be predicted is qualified medical record. Wherein the quality embedding vector can be used to reflect characteristics of the abnormal medical record. The distance in the embodiment of the present application may refer to an L2 distance. The preset quality anomaly distance may be determined based on medical record data in the training dataset.
And S308, under the condition that the quality of the medical record data to be predicted is determined to be qualified, the electronic equipment determines the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector.
In some possible embodiments, when the quality of the medical record data to be predicted is determined to be qualified, the electronic device may obtain each category embedded vector, and may obtain a preset distance between each category. One of the class-embedded vectors corresponds to one class distance. The electronic device may calculate the distance between the target embedded vector and each of the class embedded vectors, and may compare the magnitude relation between the distance between the target embedded vector and each of the class embedded vectors, and the class distance corresponding to each of the class embedded vectors. If the distance between the target embedded vector and the class embedded vector w in the class embedded vectors is smaller than or equal to the class distance corresponding to the class embedded vector w, the electronic device may determine that the class of the medical record data to be predicted is a first class, where the first class may be the class corresponding to the class embedded vector w. If the distance between the target embedded vector and the category embedded vector w is greater than the category distance corresponding to the category embedded vector w, the category of the medical record data to be predicted is different from the category corresponding to the category embedded vector.
According to the embodiment of the application, on one hand, the deep learning model is trained by using the electronic medical record data, so that the model can learn potential distribution of medical record data with qualified quality, and quality evaluation is carried out according to whether the medical record data accords with the distribution learned by the model, thereby expanding coverage of quality evaluation, screening the quality of the medical record data from multiple aspects/angles and improving accuracy of quality screening. On the other hand, according to the embodiment of the application, the embedding vector output by the model is constrained through the four-element loss function, so that qualified medical record data can be distinguished from unqualified medical record data, and the qualified medical record data can be classified according to medical record categories.
Optionally, if the distances between the target embedded vector and the category embedded vectors are all greater than the category distances corresponding to the category embedded vectors, which indicates that the category of the medical record data to be predicted does not belong to any existing category, the electronic device takes the category of the medical record data to be predicted as the second category. The second category is different from the category corresponding to each category embedding vector.
For example, it is assumed that there are 4 category-embedded vectors, namely, category-embedded vector S1, category-embedded vector S2, category-embedded vector S3, and category-embedded vector S4. Assuming that there are 4 category distances, category distances 1, 2, 3, and 4; category embedding vector S1 corresponds to category distance 1, category embedding vector S2 corresponds to category distance 2; the category embedding vector S3 corresponds to the category distance 3 and the category embedding vector S4 corresponds to the category distance 4. Let the class corresponding to the class embedding vectors S1, S2, S3, and S4 be class 1, class 2, class 3, and class 4, respectively. The electronic device sequentially calculates distances D1, D2, D3, D4 between the target embedding vector and the class embedding vector S1, the class embedding vector S2, the class embedding vector S3, and the class embedding vector S4. The electronic device compares the magnitude relation between the distances D1, D2, D3, and D4 between the target embedding vector and the category embedding vectors S1, S2, S3, and S4. If D1 is less than or equal to category distance 1, D2 is greater than category distance 2, D3 is greater than category distance 3, and D4 is greater than category distance 4, the electronic device determines that the category of the medical record data to be predicted is the category corresponding to the category embedded vector S1, i.e., category 1. If the class distance D1 is greater than the class distance 1, the class distance D2 is greater than the class distance 2, the class distance D3 is greater than the class distance 3, and the class distance D4 is greater than the class distance 4, it is indicated that the classes of the medical record data to be predicted are different from the classes corresponding to the class embedding vectors S1, S2, S3, S4, and it is also indicated that the medical record data to be predicted does not belong to any existing class, the electronic device takes the class of the medical record data to be predicted as a separate class, such as a second class. It can be appreciated that if D1 is less than or equal to the category distance 1, D2 is greater than the category distance 2, D3 is also less than or equal to the category distance 3, and D4 is greater than the category distance 4, the electronic device determines that the category of the medical record data to be predicted is the category corresponding to the category embedded vector S1, i.e., the category 1, and is the category corresponding to the category embedded vector S3, i.e., the category 3.
Alternatively, the electronic device may compare the magnitude relation between the distance between the target embedded vector and one of the class embedded vectors and the class distance corresponding to the one of the class embedded vectors, every time the distance between the target embedded vector and the one of the class embedded vectors is calculated. For example, the electronic device calculates a distance D1 between the target embedding vector and the category embedding vector S1, and compares the magnitude relation between the distance D1 and the category distance 1 corresponding to the category embedding vector S1. If D1 is smaller than or equal to the category distance 1, the electronic device determines that the category of the medical record data to be predicted is the category corresponding to the category embedding vector S1, namely the category 1. If D1 is greater than the category distance 1, the electronic device calculates a distance D2 between the target embedded vector and the category embedded vector S2, and compares a magnitude relationship between the distance D2 and the category distance 2 corresponding to the category embedded vector S2. If D2 is less than or equal to the category distance 2, the electronic device determines that the category of the medical record data to be predicted is the category corresponding to the category embedding vector S2, namely the category 2. If D2 is greater than the category distance 2, the electronic device calculates a distance D3 between the target embedded vector and the category embedded vector S3, compares the magnitude relation between the distance D3 and the category distance 3 corresponding to the category embedded vector S3, and so on until the electronic device determines the category of the medical record data to be predicted.
In some possible embodiments, if the distances between the target embedded vector and each of the class embedded vectors are greater than the class distances corresponding to each of the class embedded vectors, which indicates that the class of the medical record data to be predicted does not belong to any existing class, the electronic device may calculate an absolute difference between the distances between the target embedded vector and each of the class embedded vectors and the class distances corresponding to each of the class embedded vectors. The electronic equipment determines the category of the medical record data to be predicted as the category corresponding to the minimum absolute difference value in the absolute difference values. For example, assume that a distance D1 between the target embedding vector and the category embedding vector S1 is greater than a category distance 1, a distance D2 between the target embedding vector and the category embedding vector S2 is greater than a category distance 2, a distance D3 between the target embedding vector and the category embedding vector S3 is greater than a category distance 3, and a distance D4 between the target embedding vector and the category embedding vector S4 is greater than a category distance 4. The electronic device calculates an absolute difference A1 of the distance D1 and the category distance 1, an absolute difference A2 of the distance D2 and the category distance 2, an absolute difference A3 of the distance D3 and the category distance 3, and an absolute difference A4 of the distance D4 and the category distance 4, respectively. The electronic device determines the minimum absolute difference value from the absolute difference values A1, A2, A3 and A4, and determines the category of the medical record data to be predicted as the category corresponding to the minimum absolute difference value. Assuming that the minimum absolute difference value is A3, and the category corresponding to the A3 is the category 3, the category of the medical record data to be predicted is the category corresponding to the A3, namely the category 3.
In some possible embodiments, the electronic device obtaining each category embedding vector specifically includes: the electronic device can extract the feature vectors of the plurality of qualified medical records belonging to the same category in the at least 2 training samples, and can sequentially input the feature vectors of the plurality of qualified medical records belonging to the same category into the trained DNN model for processing to obtain a plurality of embedded vectors corresponding to the plurality of qualified medical records of the same category. One of the qualified medical records corresponds to one of the embedded vectors. The electronic device can calculate the mean vector of a plurality of embedded vectors corresponding to a plurality of qualified medical records in the same category, and the mean vector is used as the category embedded vector of the category. For example, assuming that at least 2 training samples include 4 categories, namely category 1, category 2, category 3, and category 4, the electronic device determines 4 category-embedded vectors, namely category 1, category 2, category 3, and category 4.
In the embodiment of the application, the electronic equipment acquires at least 2 training samples, sequentially inputs the at least 2 training samples into the constructed DNN model for training, reduces the loss function of the DNN model to a preset fluctuation range after training, wherein the loss function of the DNN model is a four-tuple loss function, inputs the feature vector of medical record data to be predicted into the trained DNN model for processing, so as to obtain a target embedded vector corresponding to the medical record data to be predicted, sequentially inputs the feature vectors of all the false samples in the at least 2 training samples into the trained DNN model for processing, so as to obtain embedded vectors corresponding to all the false samples, determines vector average values between the embedded vectors corresponding to all the false samples as quality embedded vectors, determines that the quality of the medical record data to be predicted is unqualified when the distance between the target embedded vectors and the quality embedded vectors is larger than or equal to a preset quality abnormal distance, and determines that the quality of the medical record data to be predicted is qualified when the distance between the target embedded vectors and the quality embedded vectors is smaller than the quality abnormal distance, and determines that the quality of the medical record data to be predicted is qualified according to the target embedded vectors and the corresponding class of the medical record data to be predicted. The quality evaluation coverage can be enlarged, the quality of medical record data can be screened from multiple aspects/angles, the accuracy of quality screening is improved, and the medical record data with qualified quality can be classified according to medical record categories.
Referring to fig. 4, fig. 4 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 4, the data processing apparatus according to the embodiment of the present application may include: an acquisition unit 10, a training unit 20, a processing unit 30 and a first determination unit 40.
An obtaining unit 10, configured to obtain at least 2 training samples, where each training sample in the at least 2 training samples is a quadruple, and the quadruple includes a feature vector of an anchor point, a feature vector of a positive sample, a feature vector of a negative sample, and a feature vector of a dummy sample, where the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different category from the anchor point and qualified quality, and the dummy sample is medical record data with unqualified quality;
the training unit 20 is configured to sequentially input the at least 2 training samples into a constructed deep neural network DNN model for training, so that a loss function of the DNN model after training is reduced to a preset fluctuation range, where the loss function of the DNN model is a four-tuple loss function, and differences between an embedded vector obtained by inputting a feature vector of the anchor point into the DNN model and an embedded vector obtained by inputting a feature vector of the positive sample, a feature vector of the negative sample, and a feature vector of the dummy sample into the DNN model are determined;
The processing unit 30 is configured to input the feature vector of the medical record data to be predicted into a trained DNN model for processing, so as to obtain a target embedded vector corresponding to the medical record data to be predicted;
the first determining unit 40 is configured to determine the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and a preset quality anomaly distance.
In some possible embodiments, the four-tuple loss function is:
L=d(a,p)-d(a,n)-k*d(a,F);
wherein, L represents the four-element loss function, a represents the embedded vector obtained after the feature vector of the anchor point is input into the DNN model, p represents the embedded vector obtained after the feature vector of the positive sample is input into the DNN model, n represents the embedded vector obtained after the feature vector of the negative sample is input into the DNN model, F represents the embedded vector obtained after the feature vector of the false sample is input into the DNN model, k is a coefficient, d (a, p) represents the distance between a and p, d (a, n) represents the distance between a and n, and d (a, F) represents the distance between a and F.
In some possible embodiments, the first determining unit 40 is specifically configured to: when the distance between the target embedded vector and the quality embedded vector is greater than or equal to a preset quality abnormal distance, determining that the quality of the medical record data to be predicted is unqualified; and when the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, determining that the quality of the medical record data to be predicted is qualified.
In some possible embodiments, the above-mentioned data sorting apparatus further comprises a second determining unit 50. The processing unit 30 is further configured to sequentially input feature vectors of all the dummy samples in the at least 2 training samples into a trained DNN model for processing, so as to obtain embedded vectors corresponding to the dummy samples, where one dummy sample corresponds to one embedded vector; the second determining unit 50 is configured to determine, as a quality embedding vector, a mean vector between embedding vectors corresponding to the all false samples.
In some possible embodiments, the first determining unit 40 is further configured to: and determining the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector.
In some possible embodiments, the first determining unit 40 is further specifically configured to: when the distance between the target embedded vector and the class embedded vector w in each class embedded vector is smaller than or equal to the class distance corresponding to the class embedded vector w, determining the class of the medical record data to be predicted as a first class, wherein the first class is the class corresponding to the class embedded vector w.
In some possible embodiments, the first determining unit 40 is further configured to: when the distance between the target embedded vector and each category embedded vector is larger than the category distance corresponding to each category embedded vector, determining the category of the medical record data to be predicted as a second category, wherein the second category is different from the category corresponding to each category embedded vector.
The acquiring unit 10, the training unit 20, the processing unit 30, the first determining unit 40, and the second determining unit 50 may be integrated into one module, such as a processing module.
In a specific implementation, the data processing apparatus may execute, by using the above modules, the implementation provided by each step in the implementation provided in fig. 2 or fig. 3, to implement the functions implemented in the above embodiments, and in particular, reference may be made to corresponding descriptions provided by each step in the method embodiment shown in fig. 2 or fig. 3, which are not repeated herein.
In the embodiment of the application, the data processing device sequentially inputs at least 2 training samples into the constructed deep neural network DNN model to train, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, the loss function of the DNN model is a four-component loss function, the feature vector of medical record data to be predicted is input into the trained DNN model to be processed, the target embedded vector corresponding to the medical record data to be predicted is obtained, the quality of the medical record data to be predicted is determined according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance, the quality of the medical record data to be predicted can be screened from multiple aspects/angles, and the quality screening accuracy is improved.
Referring to fig. 5, fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device in the embodiment of the present application may include: one or more processors 501 and a memory 502. The processor 501 and the memory 502 are connected via a bus 503. The memory 502 is used for storing a computer program comprising program instructions, and the processor 501 is used for executing the program instructions stored in the memory 502. Wherein the processor 501 is configured to invoke the program instruction to execute:
acquiring at least 2 training samples, wherein each training sample in the at least 2 training samples is a quadruple, the quadruple comprises a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample, the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different categories as the anchor point and qualified quality, and the false sample is medical record data with unqualified quality;
sequentially inputting the at least 2 training samples into a constructed deep neural network DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, wherein the loss function of the DNN model is a four-element loss function, and differences between an embedded vector obtained by inputting the feature vector of the anchor point into the DNN model and an embedded vector obtained by inputting the feature vector of the positive sample, the feature vector of the negative sample and the feature vector of the false sample into the DNN model are determined;
Inputting the feature vector of the medical record data to be predicted into a trained DNN model for processing to obtain a target embedded vector corresponding to the medical record data to be predicted;
and determining the quality of the medical record data to be predicted according to the distance between the target embedded vector and the quality embedded vector and the preset quality abnormal distance.
It should be appreciated that in embodiments of the present application, the processor 501 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 502 may include read only memory and random access memory and provide instructions and data to the processor 501. A portion of memory 502 may also include non-volatile random access memory. For example, the memory 502 may also store information of device type.
In a specific implementation, the processor 501 described in the embodiment of the present application may execute an implementation manner of the data processing method based on the deep neural network provided in the embodiment of the present application, or may execute an implementation manner of the data processing apparatus described in the embodiment of the present application, which is not described herein again.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when executed by a processor, implement the data processing method based on a deep neural network shown in fig. 2 or fig. 3, and details are described in the embodiment shown in fig. 2 or fig. 3, and are not repeated herein.
The computer readable storage medium may be the data processing apparatus or the internal storage unit of the electronic device according to any of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Although the application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (7)

1. A data processing method based on a deep neural network, comprising:
acquiring at least 2 training samples, wherein each training sample in the at least 2 training samples is a quadruple, the quadruple comprises a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample, the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different categories as the anchor point and qualified quality, and the false sample is medical record data with unqualified quality;
sequentially inputting the at least 2 training samples into a constructed deep neural network DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, wherein the loss function of the DNN model is a four-component loss function, and differences between an embedded vector obtained by inputting the characteristic vector of the anchor point into the DNN model and an embedded vector obtained by inputting the characteristic vector of the anchor point into the DNN model, the characteristic vector of the negative sample and the characteristic vector of the false sample into the DNN model are determined; the four-tuple loss function is:
The method comprises the steps of obtaining a four-tuple loss function, wherein L represents an embedded vector obtained after a feature vector of an anchor point is input into a DNN model, a p represents an embedded vector obtained after a feature vector of a positive sample is input into the DNN model, n represents an embedded vector obtained after a feature vector of a negative sample is input into the DNN model, F represents an embedded vector obtained after a feature vector of a false sample is input into the DNN model, k is a coefficient, d (a, p) represents a distance between a and p, d (a, n) represents a distance between a and n, and d (a, F) represents a distance between a and F;
inputting the feature vector of the medical record data to be predicted into a trained DNN model for processing to obtain a target embedded vector corresponding to the medical record data to be predicted;
sequentially inputting the feature vectors of all the false samples in the at least 2 training samples into a trained DNN model for processing to obtain embedded vectors corresponding to all the false samples, wherein one false sample corresponds to one embedded vector;
determining average value vectors among the embedded vectors corresponding to all the false samples as quality embedded vectors;
if the distance between the target embedded vector and the quality embedded vector is greater than or equal to a preset quality abnormal distance, determining that the quality of the medical record data to be predicted is unqualified;
And if the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, determining that the quality of the medical record data to be predicted is qualified.
2. The method of claim 1, wherein after the determining that the quality of the medical record data to be predicted is acceptable, the method further comprises:
and determining the category of the medical record data to be predicted according to the distance between the target embedded vector and each category embedded vector and the category distance corresponding to each category embedded vector.
3. The method according to claim 2, wherein determining the category of the medical record data to be predicted according to the distance between the target embedding vector and each category embedding vector and the category distance corresponding to each category embedding vector includes:
if the distance between the target embedded vector and the category embedded vector w in each category embedded vector is smaller than or equal to the category distance corresponding to the category embedded vector w, determining the category of the medical record data to be predicted as a first category, wherein the first category is the category corresponding to the category embedded vector w.
4. A method according to claim 3, characterized in that the method further comprises:
If the distances between the target embedded vector and each category embedded vector are larger than the category distances corresponding to each category embedded vector, determining the category of the medical record data to be predicted as a second category, wherein the categories corresponding to the second category and each category embedded vector are different.
5. A data processing apparatus, comprising:
the device comprises an acquisition unit, a judgment unit and a judgment unit, wherein the acquisition unit is used for acquiring at least 2 training samples, each training sample in the at least 2 training samples is a quadruple, the quadruple comprises a characteristic vector of an anchor point, a characteristic vector of a positive sample, a characteristic vector of a negative sample and a characteristic vector of a false sample, the anchor point is medical record data with qualified quality, the positive sample is medical record data with the same category as the anchor point and qualified quality, the negative sample is medical record data with different categories and qualified quality, and the false sample is medical record data with unqualified quality;
the training unit is used for sequentially inputting the at least 2 training samples into a constructed deep neural network DNN model for training, so that the loss function of the DNN model after training is reduced to a preset fluctuation range, the loss function of the DNN model is a four-element loss function, and the difference between an embedded vector obtained by inputting the feature vector of the anchor point into the DNN model and an embedded vector obtained by inputting the feature vector of the positive sample, the feature vector of the negative sample and the feature vector of the false sample into the DNN model is determined; the four-tuple loss function is:
The method comprises the steps of obtaining a four-tuple loss function, wherein L represents an embedded vector obtained after a feature vector of an anchor point is input into a DNN model, a p represents an embedded vector obtained after a feature vector of a positive sample is input into the DNN model, n represents an embedded vector obtained after a feature vector of a negative sample is input into the DNN model, F represents an embedded vector obtained after a feature vector of a false sample is input into the DNN model, k is a coefficient, d (a, p) represents a distance between a and p, d (a, n) represents a distance between a and n, and d (a, F) represents a distance between a and F;
the processing unit is used for inputting the feature vector of the medical record data to be predicted into the trained DNN model for processing to obtain a target embedded vector corresponding to the medical record data to be predicted;
the processing unit is further configured to sequentially input feature vectors of all the dummy samples in the at least 2 training samples into a trained DNN model for processing, so as to obtain embedded vectors corresponding to the dummy samples, where one dummy sample corresponds to one embedded vector; determining average value vectors among the embedded vectors corresponding to all the false samples as quality embedded vectors;
The first determining unit is used for determining that the quality of the medical record data to be predicted is unqualified when the distance between the target embedded vector and the quality embedded vector is larger than or equal to a preset quality abnormal distance; and when the distance between the target embedded vector and the quality embedded vector is smaller than the quality abnormal distance, determining that the quality of the medical record data to be predicted is qualified.
6. An electronic device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is adapted to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-4.
7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-4.
CN202010412571.1A 2020-05-15 2020-05-15 Data processing method and device based on deep neural network Active CN111696636B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010412571.1A CN111696636B (en) 2020-05-15 2020-05-15 Data processing method and device based on deep neural network
PCT/CN2020/099539 WO2021114637A1 (en) 2020-05-15 2020-06-30 Deep neural network-based method and device for data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010412571.1A CN111696636B (en) 2020-05-15 2020-05-15 Data processing method and device based on deep neural network

Publications (2)

Publication Number Publication Date
CN111696636A CN111696636A (en) 2020-09-22
CN111696636B true CN111696636B (en) 2023-09-22

Family

ID=72477848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010412571.1A Active CN111696636B (en) 2020-05-15 2020-05-15 Data processing method and device based on deep neural network

Country Status (2)

Country Link
CN (1) CN111696636B (en)
WO (1) WO2021114637A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883222B (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Text data error detection method and device, terminal equipment and storage medium
CN112099739B (en) * 2020-11-10 2021-02-23 大象慧云信息技术有限公司 Classified batch printing method and system for paper invoices

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359669A (en) * 2018-09-10 2019-02-19 平安科技(深圳)有限公司 Method for detecting abnormality, device, computer equipment and storage medium are submitted an expense account in medical insurance
CN110597878A (en) * 2019-09-16 2019-12-20 广东工业大学 Cross-modal retrieval method, device, equipment and medium for multi-modal data
CN110598006A (en) * 2019-09-17 2019-12-20 南京医渡云医学技术有限公司 Model training method, triplet embedding method, apparatus, medium, and device
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal
CN111062495A (en) * 2019-11-28 2020-04-24 深圳市华尊科技股份有限公司 Machine learning method and related device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103076334B (en) * 2013-01-25 2014-12-17 上海理工大学 Method for quantitatively evaluating perceived quality of digital printed lines and texts
CN106484681B (en) * 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
CN108615044A (en) * 2016-12-12 2018-10-02 腾讯科技(深圳)有限公司 A kind of method of disaggregated model training, the method and device of data classification
CN110232675B (en) * 2019-03-28 2022-11-11 昆明理工大学 Texture surface defect detection and segmentation device and method in industrial environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359669A (en) * 2018-09-10 2019-02-19 平安科技(深圳)有限公司 Method for detecting abnormality, device, computer equipment and storage medium are submitted an expense account in medical insurance
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal
CN110597878A (en) * 2019-09-16 2019-12-20 广东工业大学 Cross-modal retrieval method, device, equipment and medium for multi-modal data
CN110598006A (en) * 2019-09-17 2019-12-20 南京医渡云医学技术有限公司 Model training method, triplet embedding method, apparatus, medium, and device
CN111062495A (en) * 2019-11-28 2020-04-24 深圳市华尊科技股份有限公司 Machine learning method and related device

Also Published As

Publication number Publication date
WO2021114637A1 (en) 2021-06-17
CN111696636A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
US11854194B2 (en) Method and system for analyzing image
US11630985B2 (en) Method and system for analyzing image
CN109948680B (en) Classification method and system for medical record data
CN111696636B (en) Data processing method and device based on deep neural network
CN111461168A (en) Training sample expansion method and device, electronic equipment and storage medium
CN110738235B (en) Pulmonary tuberculosis judging method, device, computer equipment and storage medium
JPWO2017017722A1 (en) Processing apparatus, processing method, and program
CN109800781A (en) A kind of image processing method, device and computer readable storage medium
TWI814154B (en) Method for predicting disease based on medical image
CN112101162A (en) Image recognition model generation method and device, storage medium and electronic equipment
CN111883222B (en) Text data error detection method and device, terminal equipment and storage medium
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
JP2018032071A (en) Verification device, verification method and verification program
CN112733724A (en) Relativity relationship verification method and device based on discrimination sample meta-digger
Lamia et al. Detection of pneumonia infection by using deep learning on a mobile platform
CN111126566A (en) Abnormal furniture layout data detection method based on GAN model
CN111652277A (en) False positive filtering method, electronic device and computer readable storage medium
CN110428012A (en) Brain method for establishing network model, brain image classification method, device and electronic equipment
CN115240843A (en) Fairness prediction system based on structure causal model
JP2024508852A (en) Lesion analysis methods in medical images
CN110689112A (en) Data processing method and device
Khozama et al. Study the Effect of the Risk Factors in the Estimation of the Breast Cancer Risk Score Using Machine Learning
CN113011462A (en) Classification and device of tumor cell images
JP2020081542A (en) Device, method and program
Meng et al. A deep tongue image features analysis model for medical application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40030005

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant