CN111599431A - Report sheet-based data coding model generation method, system and equipment - Google Patents

Report sheet-based data coding model generation method, system and equipment Download PDF

Info

Publication number
CN111599431A
CN111599431A CN202010242017.3A CN202010242017A CN111599431A CN 111599431 A CN111599431 A CN 111599431A CN 202010242017 A CN202010242017 A CN 202010242017A CN 111599431 A CN111599431 A CN 111599431A
Authority
CN
China
Prior art keywords
report
initial training
data
training model
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010242017.3A
Other languages
Chinese (zh)
Inventor
陶然
赵利伟
杨苗
刘敏
吴佳丽
续静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan Kingmed Clinic Examination Co ltd
Original Assignee
Taiyuan Kingmed Clinic Examination Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan Kingmed Clinic Examination Co ltd filed Critical Taiyuan Kingmed Clinic Examination Co ltd
Priority to CN202010242017.3A priority Critical patent/CN111599431A/en
Publication of CN111599431A publication Critical patent/CN111599431A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a report form-based data coding model generation method, which comprises the following steps: initializing network parameters in a pre-constructed initial training model; the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters; enabling the initial training model to enter one cycle iteration according to the first preset cycle times; calculating a loss value of a preset loss function; using the loss value to modify the network parameter by a back propagation algorithm; enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number; splitting the initial training model to split the encoder from the initial training model as a data encoding model. The invention also discloses a data coding model generation system, equipment and a storage medium based on the report sheet. The data coding model generated by the embodiment of the invention can learn the nonlinear characteristic representation, and is beneficial to the improvement of the effect of the subsequent task algorithm.

Description

Report sheet-based data coding model generation method, system and equipment
Technical Field
The invention relates to the field of data coding, in particular to a report-sheet-based data coding model generation method, system and device.
Background
Currently, result analysis corresponding to a medical detection report mainly analyzes result values of detection items in a certain type of report, and the detected result values are compared with statistical reference values to obtain a final report result. Most of the results of the report are documented through extensive testing and clinical performance during patient treatment, but there is still much research and mining space to examine the results of the report. At a certain specific time point, the examinees are detected by a plurality of detection methods, so that the accuracy of the detection result can be provided, the current state of the organism can be more comprehensively known, and more detailed physical data of the patient can be provided for clinical treatment. But as the number of test items and accumulated reports increases, the challenges become greater. The main reason is that human biological status information is projected into a high-dimensional data space through a detection result, and it is increasingly difficult to analyze correlations between detection items and clinical manifestations through a conventional statistical method, and the characteristic engineering efficiency of the detection items is low, resulting in a long and expensive whole detection item data analysis process, and therefore, it is urgent to wait for a data coding model to code the detection item data to extract data characteristics of a detection report.
Disclosure of Invention
The embodiment of the invention aims to provide a report-sheet-based data coding model generation method, a report-sheet-based data coding model generation system, report-sheet-based data coding model generation equipment and a report-sheet-based storage medium, wherein the generated data coding model can learn nonlinear feature representation, and is beneficial to improving the effect of a subsequent task algorithm.
In order to achieve the above object, an embodiment of the present invention provides a method for generating a data coding model based on a report, including:
initializing network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
enabling the initial training model to enter one cycle iteration according to the first preset cycle times;
calculating a loss value of a preset loss function;
using the loss value to modify the network parameter by a back propagation algorithm;
enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number;
splitting the initial training model to split the encoder from the initial training model as a data encoding model.
Compared with the prior art, the data coding model generation method based on the report disclosed by the embodiment of the invention comprises the following steps of firstly, initializing network parameters in a pre-constructed initial training model; then, enabling the initial training model to enter a first cycle iteration according to a first preset cycle number, calculating a loss value of a preset loss function by using the initial training model, using the loss value for correcting the network parameter through a back propagation algorithm, and enabling the initial training model to enter a second cycle iteration according to a second preset cycle number; and finally, splitting the initial training model, and splitting the encoder from the initial training model to be used as a data coding model. The data coding model generated by the report-sheet-based data coding model generation method can learn nonlinear feature representation, is beneficial to improving the effect of a subsequent task algorithm, adopts an unsupervised algorithm, is convenient to operate, and can save a large amount of labor marking cost.
As an improvement of the above solution, the encoder is configured to input result list data obtained by encoding data in a report in advance, so as to output a mean and a variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process.
As an improvement of the above scheme, the calculating a loss value of the preset loss function specifically includes:
sampling a group of random numbers in a preset standard normal distribution;
adding the mean value and the variance to the random number respectively to obtain a latent variable;
and inputting the latent variable into the decoder, and calculating a loss value through a preset loss function.
As an improvement of the above, the decoder is configured to input the latent variable to output regenerated result list data.
As an improvement of the above, the method further comprises:
and adjusting network parameters of the initial training model by using a random gradient descent algorithm.
In order to achieve the above object, an embodiment of the present invention further provides a system for generating a data coding model based on a report, including:
the network parameter initialization module is used for initializing network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
the primary cycle iteration module is used for enabling the initial training model to enter primary cycle iteration according to a first preset cycle number;
the loss value calculating module is used for calculating the loss value of the preset loss function;
a network parameter correction module for using the loss value to correct the network parameter through a back propagation algorithm;
the secondary cycle iteration module is used for enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number;
and the data coding model generation module is used for splitting the initial training model so as to split the encoder from the initial training model as a data coding model.
Compared with the prior art, the data coding model generation system based on the report disclosed by the embodiment of the invention comprises the following steps that firstly, a network parameter initialization module initializes network parameters in a pre-constructed initial training model; then, the primary cycle iteration module enables the initial training model to enter primary cycle iteration according to a first preset cycle number, the loss value calculation module calculates a loss value of a preset loss function, the network parameter correction module uses the loss value for correcting the network parameter through a back propagation algorithm, and the secondary cycle iteration module enables the initial training model to enter secondary cycle iteration according to a second preset cycle number; and finally, splitting the initial training model by a data coding model generating module, and splitting the encoder serving as a data coding model in the initial training model. The data coding model generated by the report-based data coding model generation system can learn the nonlinear feature representation, is beneficial to improving the effect of the subsequent report task algorithm, adopts the unsupervised algorithm, is convenient to operate, and can save a large amount of labor marking cost.
As an improvement of the above solution, the encoder is configured to input result list data obtained by encoding data centered in advance, so as to output a mean and a variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process.
As an improvement of the above scheme, the loss value calculating module is specifically configured to:
sampling a group of random numbers in a preset standard normal distribution;
adding the mean value and the variance to the random number respectively to obtain a latent variable;
and inputting the latent variable into the decoder, and calculating a loss value through a preset loss function.
In order to achieve the above object, an embodiment of the present invention further provides a report-based data coding model generating device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the report-based data coding model generating device implements the report-based data coding model generating method according to any one of the above embodiments
In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the report-sheet-based data coding model generating method according to any one of the above embodiments.
Drawings
FIG. 1 is a flow chart of a report-based data coding model generation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a main frame of a variational self-encoder according to an embodiment of the present invention;
FIG. 3 is a block diagram of an initial training model according to an embodiment of the present invention;
FIG. 4 is a block diagram of a report-based data coding model generation system according to an embodiment of the present invention;
fig. 5 is a block diagram of a data coding model generating device based on a report according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a report-based data coding model generation method according to an embodiment of the present invention; the report-sheet-based data coding model generation method comprises the following steps:
s1, initializing network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
s2, enabling the initial training model to enter one cycle iteration according to the first preset cycle times;
s3, calculating the loss value of the preset loss function;
s4, using the loss value for correcting the network parameter through a back propagation algorithm;
s5, enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number;
and S6, splitting the initial training model to split the encoder as a data coding model from the initial training model.
It should be noted that the report-based data coding model generation method according to the embodiment of the present invention is used for generating a data coding model, and the data coding model can code data in a report so as to complete analysis of feature information in the report. Illustratively, the report sheet is a detection report of a patient, and the report sheet can be an electronic version report sheet or an electronic version report sheet generated after a paper version report sheet (handwritten by doctors/patients) is automatically identified by a machine, so that information in the report sheet can be automatically extracted, and further detailed data in the report sheet can be determined. It should be noted that, the process of identifying/extracting information from the report sheet may refer to a data processing process in the prior art, and the present invention is not limited thereto.
It should be noted that the initial training model is evolved from a variational self-coder, which is abbreviated as VAE, and is one of the most promising methods for unsupervised learning. The VAE body architecture is shown in fig. 2, inherits the architecture of a conventional autoencoder, which consists of two parts, an encoder and a decoder. The encoder is an inference model and mainly completes data encoding and feature extraction, and the decoder is a generation model and completes data sampling generation. A data generation distribution is learned using a variational self-encoder (VAE) and allows samples to be randomly drawn from the underlying space, which can then be decoded using a decoder network to generate unique data with features similar to those of the training network.
In order to solve the variation inference problem, VAE is mainly a Monte Carlo Markov Chain (MCMC) and a Variation Inference (VI) method. A distribution q (z) is used in the variational inference to approximate the posterior distribution p (z | x), and the model is optimized by minimizing the KL divergence between the two distributions q (z) and p (z | x), the formula for the KL divergence being shown below:
Figure BDA0002432887810000061
since p (x) is an unknown constant, ELBO can be indirectly maximized to optimize the entire network. Since the VAE optimizes the model by optimizing the ELBO, its encoding mode and decoding model can be trained simultaneously. The variational encoder uses a re-parameterization technique to solve the problems of calculation and gradient back-transmission of the KL divergence. Unlike conventional self-coder models, instead of generating one implicit vector at a time, two vectors are generated, one representing the mean and one representing the standard deviation, and the implicit vector is then synthesized from the two statistics and a random noise that follows a standard normal distribution.
Referring to fig. 3, fig. 3 is a block diagram of an initial training model according to an embodiment of the present invention. The initial training model consists of an encoder and a decoder, both of which are comprised of processing of data using a deep neural network. Illustratively, the left side of the dashed box below N (0, 1) in fig. 3 is the encoder, and the right side is the decoder.
The encoder employs a conventional 2D convolution operation or a full join operator. The encoder is used for inputting result list data obtained by encoding data in a report in advance so as to output the mean value and the variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process. When the encoder inputs the coded nominal variable and the coded detection result data, the nominal variable and the detection result data are coded in advance, and the encoder codes the coded nominal variable and the coded detection result data again to generate a mean value and a variance of the result list data.
Specifically, the process of encoding the nominal variables and the detection result data includes steps S11 to S15.
And S11, acquiring nominal variables in the detection items, and coding the nominal variables according to the value number of the nominal variables.
Determining the value quantity of each nominal variable according to a preset value rule; judging whether the value quantity of the current nominal variable is larger than or equal to a preset value quantity threshold value or not; if so, encoding the nominal variable by adopting a Hash encoding mode; if not, coding the nominal variable by adopting a single hot coding mode.
And S12, acquiring detection result data in the detection items, and preprocessing the detection result data according to the type of the detection result data.
When the type of the detection result data is continuous data, carrying out normalization processing on the detection result data; and when the type of the detection result data is discrete data, carrying out space equidistant coding processing on the detection result data within a preset set value.
And S13, encoding the detection result data after the preprocessing. The coding modes are four, and are respectively as follows: vector dimension encoding, time dimension encoding, matrix dimension encoding, tensor dimension encoding.
The first scheme is as follows: vector dimension coding, namely transversely arranging the detection result data according to a preset detection item; the detection result data corresponding to the detection items which are not detected currently are empty, and the positions of the detection result data in the arrangement are reserved; the detection items, i.e. the unique identifications of the detection items in the laboratory, are generally arranged in order, so that the writing and reading back of the program coding result are convenient.
Scheme II: and time dimension coding, namely sequencing the detection result data according to the time for generating the detection result data. But the items without detection result need to be eliminated. For example, for 2000 detection items, 7 barcodes are detected, and then only 7 detection result data subjected to normalization/space equidistant coding processing are in the vector.
The third scheme is as follows: matrix dimension coding, namely arranging the detection result data according to a preset arrangement rule; and the preset arrangement rule is used for carrying out hierarchical division according to the category, department and/or subject of the detection item corresponding to the detection result data. Specifically, the detection result data of the master barcode is arranged in a two-dimensional table manner. Because the results of the detection items have correlation, whether the arrangement of the detection items in the two-dimensional table is reasonable may hinder the neural network from extracting the relevant information, and the arrangement rule of the detection items needs to be specially designed.
And the scheme is as follows: tensor dimension coding, namely sequencing the detection result data according to a preset three-dimensional model; wherein the three-dimensional model is presented in the form of a three-dimensional table (tensor), said three-dimensional model comprising a number of slices (channels) representing different test packages, each of said slices comprising a number of said test result data.
And S14, randomly scrambling the coded detection result data.
The analysis result of the report sheet of the same main bar code should not be influenced by the arrangement of the detection items, that is, the arrangement sequence in the schemes 1 to 4 should not influence the whole analysis result, so that the encoded data is allowed to be randomly disordered in different dimensions before being sent to the deep learning model. For example, in the scheme 2, the sequence of the detection items should be randomly adjustable, the subject in the scheme 3 may be randomly disordered left and right, the subject in the scheme 4 is randomly disordered in the slice dimension (channel), and the analysis values before and after the disorder can ensure the self-consistency.
And S15, combining the coded nominal variable, the coded detection result data and the randomly scrambled coded detection result data to output the coding result of the detection item.
The network of the decoder adopts conventional 2D convolution operation or full join operator, and does not contain BN operation operator. The decoder is used for inputting the latent variable to output regenerated result list data.
Illustratively, the whole network model loss function mainly comprises two parts, namely a reconstruction loss function and a regularization loss function, wherein the reconstruction loss function mainly has the function of ensuring that the distribution of data generated by a decoder and the distribution of real data are consistent as much as possible, the loss function is calculated by adopting cross entropy, and the regularization loss function mainly has the function of restraining the data distribution of latent variables sampled by an encoder and the standard normal distribution to be consistent. The overall loss function is represented as a sub-formula:
Figure BDA0002432887810000091
wherein x represents the input data, i.e. the encoded result list; z represents the value of the encoded latent variable, i.e. the result of the sampling step of the normal distribution in the structure diagram; θ represents a parameter of the decoder; phi represents the parameters of the encoder; pθRepresenting a decoder or generating a network; q. q.sφRepresents an encoder; p (z) represents the sampling from a standard normal distribution; x-pdata represents sampled data from the result list dataset that needs to be trained; z to qφ(zx) represents the result of sampling the latent variable z when the input data is x, β isThe hyper-parameters, which are used primarily to adjust the weighting of the KL divergence loss in the overall loss, can be used to control the unwrapping strength between different dimensions of the resulting manifest latent variables.
Specifically, in step S1, the encoder parameters and decoder parameters are initialized using truncated random gaussians.
Specifically, in step S2, an iterative loop is entered, where the number of the first preset loop is n epochs, and a specific value of n is an empirical parameter.
Specifically, in step S3, the result list data of a batch in the training set is read into the memory, and the loss value of the predetermined loss function is calculated. The predetermined loss function is the loss function in the above equation (2).
Preferably, the calculating the loss value of the preset loss function specifically includes: sampling a group of random numbers in a preset standard normal distribution; adding the mean value and the variance to the random number respectively to obtain a latent variable; and inputting the latent variable into the decoder, and calculating a loss value through a preset loss function.
Specifically, in step S4, the loss value is used to correct the encoder parameter and the decoder parameter by a back propagation algorithm.
Specifically, in step S5, after the back propagation algorithm is completed, a loop is performed until the number of iterations reaches a second preset number of loops.
Specifically, in step S6, the trained initial training model is frozen and pruned, the split encoder is the best available data encoder model, the model inputs result list data, and outputs the dense feature vector after dimensionality reduction. The split decoder is a result list generation model, random noise of multivariate Gaussian is input into the model, and the output is the generated result list model.
Further, a random gradient descent algorithm is used for carrying out network parameter optimization adjustment on the initial training model. Illustratively, the stochastic gradient descent algorithm is SGD, and the learning rate is 0.0001.
Compared with the prior art, the data coding model generation method based on the report disclosed by the embodiment of the invention comprises the following steps of firstly, initializing network parameters in a pre-constructed initial training model; then, enabling the initial training model to enter a first cycle iteration according to a first preset cycle number, calculating a loss value of a preset loss function by using the initial training model, using the loss value for correcting the network parameter through a back propagation algorithm, and enabling the initial training model to enter a second cycle iteration according to a second preset cycle number; and finally, splitting the initial training model, and splitting the encoder from the initial training model to be used as a data coding model.
The data coding model generated by the report-sheet-based data coding model generation method can learn nonlinear feature representation, is beneficial to improving the effect of a subsequent task algorithm, adopts an unsupervised algorithm, is convenient to operate, and can save a large amount of labor marking cost; compared with the early self-coding feature learning method, more feature information can be extracted, and the spatial detangling performance of latent variables is superior to that of the traditional self-coder; compared with the GANS network method, the variational self-coder method has more stable training process and less time required; the dimension reduction can be carried out on the data, and the length of the learned characteristic variable can be adjusted according to the actual requirement.
Referring to fig. 4, fig. 4 is a block diagram of a coding model generation system 10 according to an embodiment of the present invention, where the coding model generation system 10 includes:
a network parameter initialization module 11, configured to initialize network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
a first iteration-by-loop module 12, configured to make the initial training model enter a first iteration-by-loop according to a first preset number of cycles;
a loss value calculation module 13, configured to calculate a loss value of a preset loss function;
a network parameter modification module 14, configured to use the loss value to modify the network parameter through a back propagation algorithm;
the secondary cycle iteration module 15 is configured to enable the initial training model to enter secondary cycle iteration according to a second preset cycle number;
and a data coding model generating module 16, configured to split the initial training model, so as to split the encoder from the initial training model as a data coding model.
It should be noted that the report-based data coding model generation system 10 according to the embodiment of the present invention is used for generating a data coding model, which is capable of coding data in a report so as to complete analysis of characteristic information in the report. Illustratively, the report sheet is a detection report of a patient, and the report sheet can be an electronic version report sheet or an electronic version report sheet generated after a paper version report sheet (handwritten by doctors/patients) is automatically identified by a machine, so that information in the report sheet can be automatically extracted, and further detailed data in the report sheet can be determined. It should be noted that, the process of identifying/extracting information from the report sheet may refer to a data processing process in the prior art, and the present invention is not limited thereto.
The initial training model includes an encoder and a decoder, both of which are used for data processing using a deep neural network. The encoder employs a conventional 2D convolution operation or a full join operator. The encoder is used for inputting result list data obtained by encoding data in a report in advance so as to output the mean value and the variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process. When the encoder inputs the coded nominal variable and the coded detection result data, the nominal variable and the detection result data are coded in advance, and the encoder codes the coded nominal variable and the coded detection result data again to generate a mean value and a variance of the result list data.
Specifically, the process of encoding the nominal variables and the detection result data includes steps S11 to S15.
And S11, acquiring nominal variables in the detection items, and coding the nominal variables according to the value number of the nominal variables.
Determining the value quantity of each nominal variable according to a preset value rule; judging whether the value quantity of the current nominal variable is larger than or equal to a preset value quantity threshold value or not; if so, encoding the nominal variable by adopting a Hash encoding mode; if not, coding the nominal variable by adopting a single hot coding mode.
And S12, acquiring detection result data in the detection items, and preprocessing the detection result data according to the type of the detection result data.
When the type of the detection result data is continuous data, carrying out normalization processing on the detection result data; and when the type of the detection result data is discrete data, carrying out space equidistant coding processing on the detection result data within a preset set value.
And S13, encoding the detection result data after the preprocessing. The coding modes are four, and are respectively as follows: vector dimension encoding, time dimension encoding, matrix dimension encoding, tensor dimension encoding.
The first scheme is as follows: vector dimension coding, namely transversely arranging the detection result data according to a preset detection item; the detection result data corresponding to the detection items which are not detected currently are empty, and the positions of the detection result data in the arrangement are reserved; the detection items, i.e. the unique identifications of the detection items in the laboratory, are generally arranged in order, so that the writing and reading back of the program coding result are convenient.
Scheme II: and time dimension coding, namely sequencing the detection result data according to the time for generating the detection result data. But the items without detection result need to be eliminated. For example, for 2000 detection items, 7 barcodes are detected, and then only 7 detection result data subjected to normalization/space equidistant coding processing are in the vector.
The third scheme is as follows: matrix dimension coding, namely arranging the detection result data according to a preset arrangement rule; and the preset arrangement rule is used for carrying out hierarchical division according to the category, department and/or subject of the detection item corresponding to the detection result data. Specifically, the detection result data of the master barcode is arranged in a two-dimensional table manner. Because the results of the detection items have correlation, whether the arrangement of the detection items in the two-dimensional table is reasonable may hinder the neural network from extracting the relevant information, and the arrangement rule of the detection items needs to be specially designed.
And the scheme is as follows: tensor dimension coding, namely sequencing the detection result data according to a preset three-dimensional model; wherein the three-dimensional model is presented in the form of a three-dimensional table (tensor), said three-dimensional model comprising a number of slices (channels) representing different test packages, each of said slices comprising a number of said test result data.
And S14, randomly scrambling the coded detection result data.
The analysis result of the report sheet of the same main bar code should not be influenced by the arrangement of the detection items, that is, the arrangement sequence in the schemes 1 to 4 should not influence the whole analysis result, so that the encoded data is allowed to be randomly disordered in different dimensions before being sent to the deep learning model. For example, in the scheme 2, the sequence of the detection items should be randomly adjustable, the subject in the scheme 3 may be randomly disordered left and right, the subject in the scheme 4 is randomly disordered in the slice dimension (channel), and the analysis values before and after the disorder can ensure the self-consistency.
And S15, combining the coded nominal variable, the coded detection result data and the randomly scrambled coded detection result data to output the coding result of the detection item.
The network of the decoder adopts conventional 2D convolution operation or full join operator, and does not contain BN operation operator. The decoder is used for inputting the latent variable to output regenerated result list data.
Illustratively, the whole network model loss function mainly comprises two parts, namely a reconstruction loss function and a regularization loss function, wherein the reconstruction loss function mainly has the function of ensuring that the distribution of data generated by a decoder and the distribution of real data are consistent as much as possible, the loss function is calculated by adopting cross entropy, and the regularization loss function mainly has the function of restraining the data distribution of latent variables sampled by an encoder and the standard normal distribution to be consistent. The overall loss function is represented as a sub-formula:
Figure BDA0002432887810000141
wherein x represents the input data, i.e. the encoded result list; z represents the value of the encoded latent variable, i.e. the result of the sampling step of the normal distribution in the structure diagram; θ represents a parameter of the decoder; phi represents the parameters of the encoder; pθRepresenting a decoder or generating a network; q. q.sφRepresents an encoder; p (z) represents the sampling from a standard normal distribution; x-pdata represents sampled data from the result list dataset that needs to be trained; z to qφAnd β is a hyper-parameter which is mainly used for adjusting the weight of KL divergence loss in the overall loss and can be used for controlling the unwrapping strength between different dimensions of the latent variable of the result list.
Specifically, the network parameter initialization module 11 initializes the encoder parameters and the decoder parameters using a truncated random gaussian. The primary cycle iteration module 12 makes the initial training model enter primary cycle iteration according to a first preset cycle number, where the first preset cycle number is n epochs, and a specific value of n is an empirical parameter. The loss value calculation module 13 reads the result list data of a batch in the training set to the memory, and calculates the loss value of the preset loss function. The loss value calculation module 13 first samples a group of random numbers in a preset standard normal distribution, then adds the mean and the variance to the random numbers respectively to obtain a latent variable, and finally inputs the latent variable into the decoder to calculate a loss value through a preset loss function.
The network parameter modification module 14 uses the penalty values to modify the encoder parameters and the decoder parameters using a back propagation algorithm. And the secondary cycle iteration module 15 makes the initial training model enter secondary cycle iteration according to a second preset cycle number. The data coding model generation module 16 freezes and branches off the trained initial training model, the split encoder is the best available data encoder model, the model inputs result list data, and dense feature vectors after dimension reduction are output. The split decoder is a result list generation model, random noise of multivariate Gaussian is input into the model, and the output is the generated result list model.
Further, the network parameter modification module 14 is further configured to perform network parameter optimization adjustment on the initial training model by using a stochastic gradient descent algorithm. Illustratively, the stochastic gradient descent algorithm is SGD, and the learning rate is 0.0001.
Compared with the prior art, the coding model generation system 10 disclosed by the embodiment of the invention comprises the following steps that firstly, a network parameter initialization module 11 initializes network parameters in a pre-constructed initial training model; then, the primary cycle iteration module 12 makes the initial training model enter primary cycle iteration according to a first preset cycle number, the loss value calculation module 13 calculates a loss value of a preset loss function, the network parameter correction module 14 uses the loss value for correcting the network parameter through a back propagation algorithm, and the secondary cycle iteration module 15 makes the initial training model enter secondary cycle iteration according to a second preset cycle number; finally, the data coding model generation module 16 splits the initial training model, and splits the encoder from the initial training model as a data coding model.
The data coding model generated by the report-based data coding model generation system 10 can learn the nonlinear feature representation, is beneficial to improving the effect of a subsequent task algorithm, adopts an unsupervised algorithm, is convenient to operate, and can save a large amount of labor marking cost; compared with the early self-coding feature learning method, more feature information can be extracted, and the spatial detangling performance of latent variables is superior to that of the traditional self-coder; compared with the GANS network method, the variational self-coder method has more stable training process and less time required; the dimension reduction can be carried out on the data, and the length of the learned characteristic variable can be adjusted according to the actual requirement.
Referring to fig. 5, fig. 5 is a block diagram of a data coding model generating device 20 based on a report according to an embodiment of the present invention. The report sheet-based data encoding model generation device 20 of this embodiment includes: a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The processor 21, when executing the computer program, implements the steps of the above-mentioned report-based data coding model generation method embodiment, such as steps S1 to S6 shown in fig. 1. Alternatively, the processor 21, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the network parameter initialization module 11.
Illustratively, the computer program may be divided into one or more modules/units, which are stored in the memory 22 and executed by the processor 21 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution process of the computer program in the report-based data coding model generation apparatus 20. For example, the computer program may be divided into a network parameter initialization module 11, a primary loop iteration module 12, a loss value calculation module 13, a network parameter correction module 14, a secondary loop iteration module 15, and a data coding model generation module 16, and specific functions of each module refer to a specific working process of the report-based data coding model generation system 10 described in the foregoing embodiment, which is not described herein again.
The report-based data encoding model generating device 20 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The report-based data encoding model generating device 20 may include, but is not limited to, a processor 21 and a memory 22. Those skilled in the art will appreciate that the schematic diagram is merely an example of the report based data coding model generating device 20, does not constitute a limitation of the report based data coding model generating device 20, and may include more or less components than those shown, or combine some components, or different components, for example, the report based data coding model generating device 20 may further include an input output device, a network access device, a bus, etc.
The Processor 21 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor 21 may be any conventional processor or the like, and the processor 21 is a control center of the report based data coding model generating device 20, and various interfaces and lines are used to connect various parts of the entire report based data coding model generating device 20.
The memory 22 may be used for storing the computer programs and/or modules, and the processor 21 implements various functions of the report-based data coding model generation apparatus 20 by running or executing the computer programs and/or modules stored in the memory 22 and calling data stored in the memory 22. The memory 22 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the modules/units integrated by the report-based data coding model generation device 20 can be stored in a computer readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by the processor 21 to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A report-based data coding model generation method is characterized by comprising the following steps:
initializing network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
enabling the initial training model to enter one cycle iteration according to the first preset cycle times;
calculating a loss value of a preset loss function;
using the loss value to modify the network parameter by a back propagation algorithm;
enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number;
splitting the initial training model to split the encoder from the initial training model as a data encoding model.
2. The report-based data coding model generation method of claim 1, wherein the encoder is configured to input result list data obtained by previously encoding data in a report to output a mean and a variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process.
3. The report-sheet-based data coding model generation method according to claim 2, wherein the calculating a loss value of the preset loss function specifically includes:
sampling a group of random numbers in a preset standard normal distribution;
adding the mean value and the variance to the random number respectively to obtain a latent variable;
and inputting the latent variable into the decoder, and calculating a loss value through a preset loss function.
4. The report-based data coding model generation method of claim 3, wherein the decoder is configured to input the latent variable to output regenerated result manifest data.
5. The report-sheet based data coding model generation method of claim 1, wherein the method further comprises:
and adjusting network parameters of the initial training model by using a random gradient descent algorithm.
6. A report-based data coding model generation system, comprising:
the network parameter initialization module is used for initializing network parameters in a pre-constructed initial training model; wherein the initial training model comprises an encoder and a decoder, and the network parameters comprise encoder parameters and decoder parameters;
the primary cycle iteration module is used for enabling the initial training model to enter primary cycle iteration according to a first preset cycle number;
the loss value calculating module is used for calculating the loss value of the preset loss function;
a network parameter correction module for using the loss value to correct the network parameter through a back propagation algorithm;
the secondary cycle iteration module is used for enabling the initial training model to enter secondary cycle iteration according to a second preset cycle number;
and the data coding model generation module is used for splitting the initial training model so as to split the encoder from the initial training model as a data coding model.
7. The report-based data coding model generation system of claim 6, wherein the encoder is configured to input result list data obtained by previously encoding data in a report to output a mean and a variance of the result list data; wherein the data in the report sheet comprises nominal variables of the detection items and detection result data, and the nominal variables comprise at least one of units of the detection items, names of adopted reagents and names of used detection equipment in the detection process.
8. The report-based data coding model generation system of claim 7, wherein the loss value calculation module is specifically configured to:
sampling a group of random numbers in a preset standard normal distribution;
adding the mean value and the variance to the random number respectively to obtain a latent variable;
and inputting the latent variable into the decoder, and calculating a loss value through a preset loss function.
9. A report-based data coding model generation device, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the report-based data coding model generation method according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the report-based data coding model generation method according to any one of claims 1 to 5.
CN202010242017.3A 2020-03-31 2020-03-31 Report sheet-based data coding model generation method, system and equipment Pending CN111599431A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010242017.3A CN111599431A (en) 2020-03-31 2020-03-31 Report sheet-based data coding model generation method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010242017.3A CN111599431A (en) 2020-03-31 2020-03-31 Report sheet-based data coding model generation method, system and equipment

Publications (1)

Publication Number Publication Date
CN111599431A true CN111599431A (en) 2020-08-28

Family

ID=72181612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010242017.3A Pending CN111599431A (en) 2020-03-31 2020-03-31 Report sheet-based data coding model generation method, system and equipment

Country Status (1)

Country Link
CN (1) CN111599431A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735544A (en) * 2020-12-30 2021-04-30 杭州依图医疗技术有限公司 Medical record data processing method and device and storage medium
CN112837739A (en) * 2021-01-29 2021-05-25 西北大学 Hierarchical feature phylogenetic model based on self-encoder and Monte Carlo tree
CN112988854A (en) * 2021-05-20 2021-06-18 创新奇智(成都)科技有限公司 Complaint data mining method and device, electronic equipment and storage medium
CN113642716A (en) * 2021-08-31 2021-11-12 南方电网数字电网研究院有限公司 Depth variation autoencoder model training method, device, equipment and storage medium
CN117155403A (en) * 2023-10-31 2023-12-01 广东鑫钻节能科技股份有限公司 Data coding method of digital energy air compression station
CN117312161A (en) * 2023-10-07 2023-12-29 中国通信建设集团有限公司数智科创分公司 Intelligent detection system and method based on automatic login technology

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene
CN109543943A (en) * 2018-10-17 2019-03-29 国网辽宁省电力有限公司电力科学研究院 A kind of electricity price inspection execution method based on big data deep learning
CN109784249A (en) * 2019-01-04 2019-05-21 华南理工大学 A kind of scramble face identification method based on variation cascaded message bottleneck
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110910982A (en) * 2019-11-04 2020-03-24 广州金域医学检验中心有限公司 Self-coding model training method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene
CN109543943A (en) * 2018-10-17 2019-03-29 国网辽宁省电力有限公司电力科学研究院 A kind of electricity price inspection execution method based on big data deep learning
CN109784249A (en) * 2019-01-04 2019-05-21 华南理工大学 A kind of scramble face identification method based on variation cascaded message bottleneck
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110910982A (en) * 2019-11-04 2020-03-24 广州金域医学检验中心有限公司 Self-coding model training method, device, equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735544A (en) * 2020-12-30 2021-04-30 杭州依图医疗技术有限公司 Medical record data processing method and device and storage medium
CN112837739A (en) * 2021-01-29 2021-05-25 西北大学 Hierarchical feature phylogenetic model based on self-encoder and Monte Carlo tree
CN112837739B (en) * 2021-01-29 2022-12-02 西北大学 Hierarchical feature phylogenetic model based on self-encoder and Monte Carlo tree
CN112988854A (en) * 2021-05-20 2021-06-18 创新奇智(成都)科技有限公司 Complaint data mining method and device, electronic equipment and storage medium
CN113642716A (en) * 2021-08-31 2021-11-12 南方电网数字电网研究院有限公司 Depth variation autoencoder model training method, device, equipment and storage medium
CN117312161A (en) * 2023-10-07 2023-12-29 中国通信建设集团有限公司数智科创分公司 Intelligent detection system and method based on automatic login technology
CN117155403A (en) * 2023-10-31 2023-12-01 广东鑫钻节能科技股份有限公司 Data coding method of digital energy air compression station
CN117155403B (en) * 2023-10-31 2023-12-29 广东鑫钻节能科技股份有限公司 Data coding method of digital energy air compression station

Similar Documents

Publication Publication Date Title
CN111599431A (en) Report sheet-based data coding model generation method, system and equipment
Ruehle Data science applications to string theory
Toker et al. Information integration in large brain networks
Lemhadri et al. Lassonet: Neural networks with feature sparsity
CN111582348B (en) Training method, device, equipment and storage medium for condition generation type countermeasure network
CN110910982A (en) Self-coding model training method, device, equipment and storage medium
Muscinelli et al. How single neuron properties shape chaotic dynamics and signal transmission in random neural networks
CN111881926A (en) Image generation method, image generation model training method, image generation device, image generation equipment and image generation medium
Cessac et al. PRANAS: a new platform for retinal analysis and simulation
CN111489803B (en) Report form coding model generation method, system and equipment based on autoregressive model
EP3893159A1 (en) Training a convolutional neural network
Bilodeau et al. Tests of mutual or serial independence of random vectors with applications
CN111613287B (en) Report coding model generation method, system and equipment based on Glow network
Băltoiu et al. Sparse Bayesian learning algorithm for separable dictionaries
CN111666991A (en) Convolutional neural network-based pattern recognition method and device and computer equipment
Huang et al. Variational convolutional neural networks classifiers
Wang et al. Sparse kernel feature extraction via support vector learning
CN113850632B (en) User category determination method, device, equipment and storage medium
EP4116853A1 (en) Computer-readable recording medium storing evaluation program, evaluation method, and information processing device
CN111489802B (en) Report coding model generation method, system, equipment and storage medium
Nguyen et al. Detecting differentially expressed genes with RNA-seq data using backward selection to account for the effects of relevant covariates
CN111462915B (en) Automatic labeling method for medical text data
CN114528968A (en) Neural network training method and device
Zola et al. Parallel information theory based construction of gene regulatory networks
CN111160487A (en) Method and device for expanding face image data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 030000 floor 3-6, block a, No.2, Longsheng street, Tanghuai Park, Taiyuan comprehensive reform demonstration area, Shanxi Province

Applicant after: Taiyuan Jinyu clinical laboratory Co.,Ltd.

Address before: 030000 floor 3-6, block a, No.2, Longsheng street, Tanghuai Park, Taiyuan comprehensive reform demonstration area, Shanxi Province

Applicant before: TAIYUAN KINGMED CLINIC EXAMINATION Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20200828

RJ01 Rejection of invention patent application after publication