CN115424725A

CN115424725A - Data analysis method and device, storage medium and processor

Info

Publication number: CN115424725A
Application number: CN202210986021.XA
Authority: CN
Inventors: 孟令康; 贺勇; 张惠顺; 张顺; 叶旭辉; 曾震宇
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-12-02

Abstract

The application discloses a data analysis method and device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring first data information from a target medical record of a target object; mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; and obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label. The method and the device solve the technical problem that the accuracy of case label prediction of the neural network model through the characteristic information in the medical record in the related technology is low.

Description

Data analysis method and device, storage medium and processor

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data analysis method and apparatus, a storage medium, and a processor.

Background

With the aging degree of China increasing year by year, the prediction of disease risk is very important. Disease risk prediction is generated by a combination of artificial intelligence and medicine. With the development and application of machine learning technology, a method based on machine learning gradually becomes a mainstream method, but a neural network model in the related technology is obtained by training through personal historical medical record data, the problem of group attributes is not repeatedly considered, and the existing neural network model only predicts a case label through characteristic information of diseases in medical records, so that the accuracy of the obtained prediction probability value is not very high, and the disease risk of a certain medical record is difficult to predict.

Aiming at the problem that the accuracy of case label prediction of a neural network model through characteristic information in a medical record in the related technology is low, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the application provides a data analysis method and device, a storage medium and a processor, which are used for at least solving the technical problem that the accuracy of case label prediction of a neural network model through characteristic information in a medical record in the related technology is low.

According to an aspect of an embodiment of the present application, there is provided a method for analyzing data, including: acquiring first data information from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in a target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

Further, mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence includes: randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area; mapping first medical treatment information except the medical treatment date in the target medical record into a first initial vector sequence according to the first vector set; encoding the date through absolute time to obtain a second vector set; mapping the visit date into a second initial vector sequence according to the second vector set; randomly generating a third vector set of the preset dimension based on the attribute information stored in the target storage area; mapping the first attribute information into a third initial vector sequence according to the third vector set; and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain the first target vector sequence.

Further, obtaining target information corresponding to the target medical record according to the second target vector sequence includes: predicting through the second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; taking an initial case label with a probability value meeting a preset requirement as the first target case label; the first target case label and the probability value of the first target case label are taken as the target information.

Further, after obtaining the initial information, the method further includes: and if the probability values of the initial case labels do not accord with the preset requirements, the target prediction model outputs a first preset prompt.

Further, after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences; and predicting the time information of each first target case label through the target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

Further, the target prediction model is obtained by training through the following steps: determining a sample medical record; acquiring a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information; acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of preset various case labels in a case history set of a first target group from the historical medical data information, wherein the first target group is a case group corresponding to cases in the sample case history; calculating to obtain a historical visit time interval in the sample medical record according to the fourth target vector sequence; constructing a mask according to the number of the fourth target vector sequences, and performing shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence; and taking the processed fourth vector target sequence, the probability value of the occurrence of preset various types of case labels in the medical record set of the first target group and the historical visit time interval as a training set, and training an initial prediction model according to the training set to generate the target prediction model.

Further, training an initial prediction model according to the training set, and generating the target prediction model includes: inputting the training set into the initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; performing loss calculation according to each predicted probability value and probability values of various preset case labels appearing in the medical record set of the first target group to obtain a first predicted loss function; performing loss calculation according to the time information of the various types of case labels and the historical visit time interval to obtain a second prediction loss function; taking the first predictive loss function and the second predictive loss function as target loss functions; and training the initial prediction model according to the target loss function to obtain the target prediction model.

Further, after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: and predicting case labels of a second target group and probability values of the case labels according to target information corresponding to the target medical records to obtain a prediction result, wherein the second target group is a case group corresponding to cases in the target medical records.

Further, after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label; inputting a vector sequence corresponding to the second target case label obtained from the first target vector sequence into the target prediction model for time prediction to obtain time information of the second target case label; and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuing to process the updated second target vector sequence through the target prediction model until a second preset prompt is output by the target prediction model or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

According to another aspect of the embodiments of the present application, there is also provided a method for analyzing data, including: acquiring first data information which is sent by a client and acquired from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information; mapping the first attribute information and the first visit information in a cloud server through a feature embedding module in a target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label; and returning the target information to the client.

According to another aspect of the embodiments of the present application, there is also provided an apparatus for analyzing data, including: a first obtaining unit, configured to obtain first data information from a target medical record of a target object, where the first data information at least includes: first attribute information and first visit information; the mapping unit is used for mapping the first attribute information and the first visit information through a feature embedding module in a target prediction model to obtain a first target vector sequence; the coding unit is used for carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; an analysis unit, configured to obtain target information corresponding to the target medical record according to the second target vector sequence, where the target information at least includes: a first target case label and a probability value for the first target case label.

Further, the mapping unit includes: the first generation module is used for randomly generating a first vector set with preset dimensionality based on the diagnostic data information stored in the target storage area; the first mapping module is used for mapping first medical treatment information except the treatment date in the target medical record into a first initial vector sequence according to the first vector set; the encoding module is used for encoding the date through absolute time to obtain a second vector set; the second mapping module is used for mapping the visit date into a second initial vector sequence according to the second vector set; the second generation module is used for randomly generating a third vector set of the preset dimension based on the attribute information stored in the target storage area; a third mapping module, configured to map the first attribute information into a third initial vector sequence according to the third vector set; and the splicing module is used for splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain the first target vector sequence.

Further, the analysis unit includes: the prediction module is used for predicting through the second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; the first determination module is used for taking an initial case label with a probability value meeting a preset requirement as the first target case label; a second determination module to take the first target case label and a probability value of the first target case label as the target information.

Further, the apparatus further comprises: and the output unit is used for outputting a first preset prompt by the target prediction model after the initial information is obtained and if the probability values of the initial case labels do not accord with the preset requirements.

Further, the apparatus further comprises: the selecting unit is used for selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences after target information corresponding to the target medical records is obtained according to the second target vector sequences; and the first prediction unit is used for predicting the time information of each first target case label through the target prediction model according to a second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

Further, the target prediction model is obtained by training through the following steps: the determining unit is used for determining a sample medical record; a second obtaining unit, configured to obtain a fourth target vector sequence corresponding to second data information of the sample medical record, where the second data information at least includes: second attribute information and second visit information; the third acquisition unit is used for acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of preset various case labels in a case history set of a first target group from the historical medical data information, wherein the first target group is a case group corresponding to cases in the sample case history; the calculation unit is used for calculating and obtaining the historical visit time interval in the sample medical record according to the fourth target vector sequence; the construction unit is used for constructing masks according to the number of the fourth target vector sequences and carrying out shielding treatment on the fourth target sequences according to the masks to obtain the treated fourth target vector sequences; and the training unit is used for taking the processed fourth vector target sequence, the probability value of occurrence of preset various case labels in the medical record set of the first target group and the historical visit time interval as a training set, training an initial prediction model according to the training set and generating the target prediction model.

Further, the training unit comprises: the output module is used for inputting the training set into the initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; the first calculation module is used for performing loss calculation according to each predicted probability value and probability values of various preset case labels appearing in the medical record sets of the first target group to obtain a first predicted loss function; the second calculation module is used for performing loss calculation according to the time information of the various types of case labels and the historical visit time interval to obtain a second predicted loss function; a third determination module to take the first predicted loss function and the second predicted loss function as target loss functions; and the training module is used for training the initial prediction model according to the target loss function to obtain the target prediction model.

Further, the apparatus further comprises: and the second prediction unit is used for predicting case labels and probability values of the case labels of a second target population according to the target information corresponding to the target medical record after the target information corresponding to the target medical record is obtained according to the second target vector sequence, so as to obtain a prediction result, wherein the second target population is a case population corresponding to cases in the target medical record.

Further, the apparatus further comprises: the selecting unit is used for randomly selecting the first target case labels according to the probability value of each first target case label of the target medical record after target information corresponding to the target medical record is obtained according to the second target vector sequence to obtain second target case labels; a third prediction unit, configured to input a vector sequence corresponding to the second target case label obtained from the first target vector sequence into the target prediction model for time prediction, so as to obtain time information of the second target case label; and the updating unit is used for updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuing to process the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target medical record.

According to another aspect of the embodiments of the present invention, there is further provided a computer-readable storage medium storing a program, where when the program runs, the apparatus on which the storage medium is located is controlled to execute the method for analyzing data described in any one of the above.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes a method for analyzing data according to any one of the above methods.

In the embodiment of the application, the following steps are adopted: acquiring first data information from a target medical record, wherein the first data information at least comprises the following components: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in a target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: the first target case label and the probability value of the first target case label solve the technical problem that the accuracy of case label prediction of a neural network model through feature information in a medical record in the related technology is low. The method comprises the steps of mapping first data information in a target medical record into a first target vector sequence through a feature embedding module in a target prediction model, carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module of the target prediction model to obtain a second target vector sequence, and finally carrying out case label prediction by utilizing the second target vector sequence, so that the prediction probability values of various case labels of the target medical record can be accurately obtained, and the effect of improving the accuracy of the predicted case labels is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a computer terminal according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for analyzing data provided according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a multi-head self-attention module according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a feature embedding module data process according to an embodiment of the present application;

FIG. 5 is a schematic illustration of a predicted visit time provided according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative data analysis method according to an embodiment of the present application;

FIG. 7 is a flow chart of a method for analyzing data provided according to embodiment two of the present application;

FIG. 8 is a schematic diagram of an apparatus for analyzing data provided according to the third embodiment of the present application;

fig. 9 is a schematic diagram of a computer terminal according to the fourth embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

ICD-10: international Classification of Diseases (ICD) is a system that classifies diseases according to certain characteristics of diseases according to rules and represents the diseases by a coding method.

Characteristic embedding: feature Embedding is mostly used for natural language processing algorithms, and refers to a process of mapping each word or attribute into a vector of a specified dimension.

Multi-head self-attention model: the self-attention model is a depth coding model of sequence features, and the output of the next layer model is generated by calculating the relative contribution of the current position and all other positions. The multi-head self-attention model is obtained by parallel and splicing a plurality of self-attention models.

Example 1

There is also provided, in accordance with an embodiment of the present application, a method of data analysis, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing an analysis method of data. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include a processor book 102 (the processor book 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc., and the processor book 102 may include a processor book, shown in fig. 1 as 102a,102b, \ 8230; \8230;, 102 n), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the above-described processor complex 102 and/or other data processing circuitry may be referred to herein generally as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of variable resistance termination paths connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the analysis method of data in the embodiment of the present application, and the processor set 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, to implement the analysis method of data described above. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor book 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

Under the above operating environment, the present application provides a method for analyzing data as shown in fig. 2. Fig. 2 is a flowchart of a method of analyzing data according to embodiment 1 of the present application.

Step S201, acquiring first data information from a target medical record of a target object, where the first data information at least includes: first attribute information and first visit information.

Step S202, the first attribute information and the first visit information are mapped through a feature embedding module in the target prediction model, and a first target vector sequence is obtained.

Step S203, the first target vector sequence is subjected to time sequence coding through a multi-head self-attention module in the target prediction model, and a second target vector sequence is obtained.

Step S204, obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

Specifically, first data information of a target medical record needing case label prediction is obtained, the first data information may include first attribute information of a certain case in the target medical record and first visit information of the certain case, the first attribute information may include one or more of attributes of the certain case, such as gender, age, overall area, and medical insurance type, and the visit data information may include one or more of information of diagnosis, operation, cost, duration of visit, date of visit, and the like. A medical record may have 0 or more visits.

And then inputting the first data information into a target prediction model, wherein the target prediction model outputs target information of a target medical record, and the target information at least comprises a first target case label and a probability value of the first target case label, wherein the first target case label can refer to a certain type of disease, such as hypertension, heart disease and the like.

Mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; and predicting the target medical record through the second target vector sequence to obtain target information.

Specifically, the method for obtaining the first target vector sequence includes: various types of data are converted into a first target vector sequence (feature-embedded sequence) by a feature-embedding module in the target prediction model. The method for obtaining the second target vector sequence comprises the following steps: and performing time sequence coding on the feature embedded sequence (namely the first target vector sequence) through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence. As shown in fig. 3, the first target vector sequence is processed through three different linear layers, then weight calculation is performed through scaling dot product attention, and finally a second target vector sequence is obtained through a centralized layer and a linear layer. And predicting the target medical record through the second target vector sequence, and outputting target information (namely various case labels and corresponding probability values) of the target medical record.

It should be noted that, when the target prediction model is trained, the probability value of occurrence of various case labels in a case group corresponding to a case in the target medical record is used as a true value, the target loss function is obtained through calculation of the probability value, and then the target prediction model is trained through the target loss function. The target prediction model obtained in the way can fully consider the group attribute corresponding to the target medical record, and the target prediction model predicts the possibility of disease attack according to the second target vector sequence with the time characteristic information obtained from the diagnosis characteristic information in the target medical record, so that the accuracy of disease prediction can be improved.

Converting various types of data into a first target vector sequence by a feature embedding module in a target prediction model comprises the following steps: randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area; mapping first visit information except the visit date into a first initial vector sequence according to the first vector set; encoding the date through absolute time to obtain a second vector set; mapping the visit date into a second initial vector sequence according to the second vector set; randomly generating a third vector set with preset dimensions based on the attribute information stored in the target storage area; mapping the first attribute information into a third initial vector sequence according to a third vector set; and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain a first target vector sequence.

Specifically, various data in the same type of data (for example, all the data are diagnosis data or attribute data, as shown in fig. 4) are mapped into vectors according to respective embedding dictionaries, and then the obtained vectors are spliced into a final embedding vector of the data. The embedded dictionaries are the first vector set, the second vector set and the third vector set. The generation mode of the embedded dictionary is as follows: the diagnosis date in the diagnosis data is encoded by absolute time to obtain a corresponding embedded dictionary, and other types of data are randomly generated (corresponding to the diagnosis data information stored in the target storage area, a first vector set with preset dimensionality is randomly generated, the date is encoded by absolute time to obtain a second vector set and attribute information stored in the target storage area, and a third vector set with preset dimensionality is randomly generated).

Specifically, the diagnostic data information stored in the target storage area is converted into a corresponding vector with a preset dimension to obtain the first vector set, and then the first visit information is matched with the first vector set to obtain a corresponding first initial vector sequence, for example, a vector corresponding to a disease, namely hypertension in the first vector set, is a first vector, and then a vector corresponding to matching, namely hypertension appearing in the first visit information is a first vector; similarly, all the time is converted into vectors corresponding to the absolute time to obtain a second vector set; matching the visit time in the first diagnostic data with the second vector set to obtain a second initial vector set; the method for obtaining the third vector set and the third initial vector sequence is consistent with the above method, and is not described herein.

It should be noted that the diagnostic data information stored in the target storage area refers to data information of some diseases known now, for example, ICD-10: diseases of the given class are included in the international classification of diseases.

In an optional embodiment, in a generation process of the feature embedding sequence (i.e., the first target vector sequence) shown in fig. 4, each piece of data in the diagnosis data information and the attribute information is mapped to a corresponding vector, and then the vectors are spliced to obtain the first target vector sequence. Specifically, as shown in fig. 4, the attribute information corresponds to the basic information in fig. 4, including information such as gender, age, pool area, and medical insurance type; the information of the treatment data mainly comprises diagnosis, operation, expense, date, duration and the like; and performing word segmentation processing on the basic information and the visit data to obtain corresponding words, matching each word with the first vector set, the second vector set and the third vector set one by one to obtain corresponding vectors, and splicing the obtained vectors to obtain the first target vector sequence.

By converting the data into the corresponding vector sequences, the model is facilitated to identify the characteristic information contained in the data, and the probability value of the disease is accurately predicted.

In the prediction, the target prediction model may predict probability values of a large number of existing case labels, but some of the probability values are very low in the prediction, so that it is not necessary to output all the probability values, and therefore, in the data analysis method provided by the first embodiment of the present invention, the following limitations are further provided: predicting through a second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; taking an initial case label with a probability value meeting a preset requirement as a first target case label; the first target case label and the probability value of the first target case label are taken as target information.

Specifically, the target prediction model obtains all kinds of case labels and corresponding probability values (i.e., the above-mentioned multiple initial case labels and the probability values of the initial case labels) that can be predicted by the target prediction model according to the second target vector sequence. And screening the obtained case labels of all kinds and the corresponding probability values to obtain target information. For example, the case label with the top 10 probability values is used as the first target case label, or the case label with the probability value satisfying more than 30% is used as the first target case label (corresponding to the above-mentioned initial case label with the probability value meeting the preset requirement is used as the first target case label). Then if none of the initial case labels meets the requirements, a first preset prompt is output, and the probability that each type of case label (each type of disease) appears in the target medical record is very low (e.g., less than 0.1%).

By screening all kinds of case labels and corresponding probability values, the output efficiency of the target prediction model can be improved, and the user experience is improved.

The target prediction model can also be used for predicting the time information of various future case labels in the target medical record, and mainly comprises the following steps: selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences; and predicting the time information of each first target case label through a target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

Specifically, the time information of a case appearing in the target medical record in the future can be predicted through the target prediction model (i.e., a plurality of predicted times can be predicted through the target prediction model). For example, a case that hypertension occurs in the target medical record at a certain time in the future can be predicted and obtained through the target prediction model, and in popular terms, a target object corresponding to the target medical record can go to a hospital for diagnosis aiming at the hypertension at a certain time in the future.

The time prediction is to predict the time information of each target case label, and select a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a corresponding third target vector sequence (which may also be called a feature embedding sequence). And obtaining the time information of the first target case label by using the third target vector sequence and the second target vector sequence with the time sequence characteristics.

As shown in fig. 5, a first target vector sequence is obtained by a feature embedding module, a second target vector sequence is obtained by a multi-head self-attention module, then a vector sequence of a first target case label object to be predicted (i.e. a third target vector sequence) is selected from the first target vector sequence, the second target vector sequence and the third target vector sequence are spliced, and finally the visit time of the first target case label is predicted by using the spliced sequence, so as to obtain the prediction time.

The prediction of the occurrence time of the case label is realized through the target prediction model, so that the case in the target medical record can pay attention to the health problem in time, and the method can be used in the scenes of health care, medicine, business insurance and the like which have analysis requirements on the occurrence rule of the group case.

The target prediction model is obtained by training the following steps: determining a sample medical record; acquiring a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information; acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of preset various case labels in a case history set of a first target group from historical medical data information, wherein the first target group is a case group corresponding to cases in a sample case history; according to the fourth target vector sequence, calculating to obtain historical visit time intervals in the sample medical record; constructing a mask according to the number of the fourth target vector sequences, and carrying out shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence; and taking the processed fourth vector target sequence, the probability value of the occurrence of preset various case labels in the case history set of the first target group and the historical visit time interval as a training set, and training the initial prediction model according to the training set to generate a target prediction model.

Inputting the training set into an initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; performing loss calculation according to each predicted probability value and probability values of various preset case labels appearing in the medical record sets of the first target group to obtain a first predicted loss function; loss calculation is carried out according to the time information of the various types of case labels and the historical visit time interval to obtain a second prediction loss function; taking the first prediction loss function and the second prediction loss function as target loss functions; and training the initial prediction model according to the target loss function to obtain a target prediction model.

Specifically, when the target prediction model is trained, the probability value of various case labels appearing in the case history set of the group corresponding to the case in the case history is used as the true value. In a popular way, after a sample medical record is selected, a corresponding target group is selected according to various types of case labels appearing in the sample medical record or various types of attribute information in a target medical record. For example, if the target medical record is a female, the age is 20-30 years, and the medical insurance type is employee medical care, then the corresponding population with the same attribute can be obtained according to the information. Or the target medical record has a case label such as hypertension or heart disease, the group having the case label can be found from the historical medical data information, and the group can be the first target group. Case labels can be used to refer to various types of diseases, such as hypertension, heart disease, etc.; the case set refers to a set of various diseases, and case labels such as hypertension, heart disease and the like can be used as the case set;

specifically, the training uses two parts of dynamic masking and probability generation to generate a training set, and for the features of any sample medical record, a sequence X0, X1, X2,. Xn (corresponding to the fourth target vector sequence described above) is embedded, where X0 is attribute information in the sample medical record, X1-Xn is visit data information in the sample medical record, and for each k =0,1,. N-1, k is a natural number, the training set is constructed as follows:

a. if k =0, that is, if the feature embedding sequence of the sample medical record includes only attribute information, the probability P that a group having the same attribute as the case in the sample medical record (i.e., the first target group described above) has a preset class label is counted from the history data.

b. If k > =1, the probability P that the group (i.e. the first target group) consistent with the current disease sequence k-2, k-1, k has the preset various types of case labels is counted from the historical data.

c. The time interval T between the time when the case label of X _ k +1 appears and the last visit time (i.e., the visit time interval described above) is calculated.

d. And constructing a corresponding number of masks [ M0, M1, M2.. And Mn ] according to the number of the fourth target vector sequences, wherein Mi takes a value of 0 or 1, a construction rule can firstly set [ M0, M1.. Mk ] as 1 and [ M _ k + 1.. M _ n ] as 0, and then according to the random inversion value of M _ i, the original 1 is inverted into 0, and the original 0 is inverted into 1, so that the change of the training set is realized. And carrying out shielding treatment on the fourth target sequence through a mask, wherein the mask value is 1 and is not shielded, and the mask value is 0 and is used for shielding the corresponding sequence.

It should be noted that randomly flipping the value of M _ i means that the sequence to be masked needs to be replaced during training, and some sequences are not always masked.

And when the target prediction model is trained, performing weighted average on the prediction loss of the probability P and the prediction loss of the time interval T to serve as a final loss function. Namely, the training set is input into the initial prediction model, and the preset prediction probability value of various diseases and the prediction treatment time of various diseases are obtained. The predetermined types of diseases refer to currently known diseases. And calculating by using the predicted probability value and the probability P to obtain a first predicted loss function, calculating by using the predicted visit time and the time interval T to obtain a second predicted loss function, and performing weighted average calculation on the first predicted loss function and the second predicted loss function to obtain a target loss function. And finally, training the model by using the target loss function to obtain a target prediction model.

The model is trained according to the probability values of various preset case labels appearing in the case history sets of the target group to obtain a target prediction model, so that the accuracy of case label prediction is improved, and the obtained prediction probability value can also be used for evaluating the probability values of various case labels appearing in the case history sets of the target group, so that a bridge from micro to macro is obtained. The group is evaluated by the individual, and the running condition and the development trend of the medical insurance fund are effectively monitored and predicted according to the morbidity of the group.

In the method for analyzing data provided in the first embodiment of the present invention, the evaluating a population by an individual includes: and predicting case labels of a second target group and probability values of the case labels through target information corresponding to the target medical records to obtain a prediction result, wherein the second target group is a case group corresponding to cases in the target medical records.

Specifically, the model is trained according to the probability values of various preset case labels appearing in the case history set of the target group to obtain the target prediction model, so that the obtained target information of the target case history contains the incidence probability information of the group to which the target case history belongs, so that the case labels of the case group corresponding to the cases in the target case history and the probability values of the case labels can be predicted according to the target information of the target case history to obtain a prediction result, and the operation condition and the development trend of the medical insurance fund can be effectively monitored and predicted according to the prediction result.

The target prediction model provided by the invention generates the probability of the attack of the crowd according with the individual historical characteristics under the subdivision condition, thereby constructing a bridge from micro to macro. The target prediction model can be used for medical insurance data analysis and can be used in scenes such as health care, medicine, business insurance and the like with analysis requirements on the disease incidence rules of segment population.

The target prediction model provided by the invention can also predict and obtain an annual visit time sequence (annual time information sequence) of the target medical record, and specifically comprises the following contents: randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label; inputting a vector sequence corresponding to a second target case label obtained from the first target vector sequence into a target prediction model for time prediction to obtain time information of the second target case label; and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuously executing the processing of the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

Specifically, the embedded feature sequence X0, X1, X2,. Xn of a certain target medical record is processed by running a trained target prediction model to obtain prediction probabilities of all case labels (probability values corresponding to the plurality of first target case labels), one case label is randomly selected according to the prediction probabilities, a feature vector (embedding) of the case label is input into the target prediction model to predict time information of the case label, and the embedded feature sequence is updated to be X0, X1, X2,. X _ n +1 according to the selected case label and the corresponding prediction time information. The above process is repeated until the output of the next case label is "healthy" (i.e., the second preset prompt mentioned above), or the visit time is longer than the preset time limit (e.g., predicting the time information of the case label of one year), and the annual visit time series of the target medical record can be predicted through the above steps.

In an alternative embodiment, the data analysis may be implemented using a schematic diagram as shown in fig. 6. As shown in fig. 6, a corresponding sample medical record is selected to train the model, and a target prediction model is obtained. And obtaining a target vector sequence of the medical record to be predicted through feature embedding and time sequence coding in the target prediction model, predicting diseases and predicting treatment time according to the target vector sequence, and obtaining the annual treatment time sequence of the target medical record through model reasoning which can be carried out through the target prediction model.

The target prediction model mainly has the following technical effects: and in the training stage, the model learns the relationship between the individual and the group, so that the model has the characteristic of fine granularity. The occurrence probability of the current case label can be analyzed according to individual history information. And the time interval of the future occurrence of case labels can be predicted by taking days as granularity through built-in time characteristic embedding and disease attack time prediction.

In the analysis method of data provided in the first embodiment of the present invention, first data information is obtained from a target medical record, where the first data information at least includes: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: the first target case label and the probability value of the first target case label solve the technical problem that the accuracy of case label prediction of a neural network model through feature information in a medical record in the related technology is low. The method comprises the steps of mapping first data information in a target medical record into a first target vector sequence through a feature embedding module in a target prediction model, carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module of the target prediction model to obtain a second target vector sequence, and finally carrying out case label prediction by utilizing the second target vector sequence.

It should be noted that for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a method for analyzing data, as shown in fig. 7, the method including:

step S701, acquiring first data information, which is sent by a client and acquired from a target medical record of a target object, where the first data information at least includes: first attribute information and first visit information.

Step S702, mapping the first attribute information and the first visit information in a cloud server through a feature embedding module in a target prediction model to obtain a first target vector sequence; carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; and obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

Step S703, returning the target information to the client.

Specifically, first data information of the target medical record is sent to the cloud server through the client, and the first data information is input into the target prediction model in the cloud server to obtain target information of the target medical record.

The case label of the target medical record is predicted through the cloud service, the efficiency of the data analysis method is improved, and the storage pressure of a local terminal is reduced.

In the cloud server, the specific method for analyzing the data is the same as that in the first embodiment, and is not described herein again.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

Example 3

According to an embodiment of the present application, there is also provided an analyzer apparatus for implementing the above data, as shown in fig. 8, the apparatus including: a first acquisition unit 801, a mapping unit 802, an encoding unit 803, and an analysis unit 804.

A first obtaining unit 801, configured to obtain first data information from a target medical record of a target object, where the first data information at least includes: first attribute information and first visit information;

the mapping unit 802 is configured to perform mapping processing on the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence;

an encoding unit 803, configured to perform time-series encoding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence;

an analyzing unit 804, configured to obtain target information corresponding to the target medical record according to the second target vector sequence, where the target information at least includes: a first target case label and a probability value of the first target case label.

In the data analysis apparatus provided in the third embodiment of the present invention, the first obtaining unit 801 obtains first data information from a target medical record, where the first data information at least includes: first attribute information and first visit information; the mapping unit 802 performs mapping processing on the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; the encoding unit 803 performs time-series encoding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; the analysis unit 804 obtains target information corresponding to the target medical record according to the second target vector sequence, where the target information at least includes: the first target case label and the probability value of the first target case label solve the technical problem that the accuracy of case label prediction of a neural network model through feature information in a medical record in the related technology is low. The method comprises the steps of mapping first data information in a target medical record into a first target vector sequence through a feature embedding module in a target prediction model, carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module of the target prediction model to obtain a second target vector sequence, and finally carrying out case label prediction by utilizing the second target vector sequence, so that the prediction probability values of various case labels of the target medical record can be accurately obtained, and the effect of improving the accuracy of the prediction probability values is achieved.

Optionally, in the apparatus for analyzing data provided in the third embodiment of the present invention, the mapping unit includes: the first generation module is used for randomly generating a first vector set of preset dimensionality based on the diagnostic data information stored in the target storage area; the first mapping module is used for mapping first medical record information except the medical record date to a first initial vector sequence according to the first vector set; the encoding module is used for encoding the date through absolute time to obtain a second vector set; the second mapping module is used for mapping the treatment date into a second initial vector sequence according to the second vector set; the second generation module is used for randomly generating a third vector set with preset dimensionality based on the attribute information stored in the target storage area; the third mapping module is used for mapping the first attribute information into a third initial vector sequence according to a third vector set; and the splicing module is used for splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain a first target vector sequence.

Optionally, in the data analysis apparatus provided in the third embodiment of the present invention, the analysis unit includes: the prediction module is used for predicting through a second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; the first determining module is used for taking an initial case label with the probability value meeting the preset requirement as a first target case label; a second determination module to take the first target case label and the probability value of the first target case label as target information.

Optionally, in the apparatus for analyzing data provided in the third embodiment of the present invention, the apparatus further includes: and the output unit is used for outputting a first preset prompt by the target prediction model after the initial information is obtained and if the probability values of the initial case labels do not accord with the preset requirements.

Optionally, in the apparatus for analyzing data provided in the third embodiment of the present invention, the apparatus further includes: the selecting unit is used for selecting a vector sequence corresponding to each first target case label from the first target vector sequences after target information corresponding to the target medical records is obtained according to the second target vector sequences to obtain a plurality of third target vector sequences; and the first prediction unit is used for predicting the time information of each first target case label through the target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

Optionally, in the apparatus for analyzing data provided in the third embodiment of the present invention, the target prediction model is obtained by training through the following steps: the determining unit is used for determining a sample medical record; a second obtaining unit, configured to obtain a fourth target vector sequence corresponding to second data information of the sample medical record, where the second data information at least includes: second attribute information and second visit information; the third acquisition unit is used for acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of preset various case labels in a case history set of a first target group from historical medical data information, wherein the first target group is a case group corresponding to cases in a sample case history; the calculation unit is used for calculating and obtaining the historical visit time interval in the sample medical record according to the fourth target vector sequence; the construction unit is used for constructing masks according to the number of the fourth target vector sequences and carrying out shielding treatment on the fourth target sequences according to the masks to obtain the treated fourth target vector sequences; and the training unit is used for taking the processed fourth vector target sequence, the probability value of the occurrence of preset various case labels in the medical record set of the first target group and the historical visit time interval as a training set, training the initial prediction model according to the training set and generating the target prediction model.

Optionally, in the apparatus for analyzing data provided in the third embodiment of the present invention, the training unit includes: the output module is used for inputting the training set into the initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; the first calculation module is used for performing loss calculation according to each predicted probability value and probability values of various preset case labels appearing in the case history sets of the first target group to obtain a first predicted loss function; the second calculation module is used for performing loss calculation according to the time information of the various types of case labels and the historical visit time interval to obtain a second predicted loss function; a third determining module for taking the first predicted loss function and the second predicted loss function as target loss functions; and the training module is used for training the initial prediction model according to the target loss function to obtain a target prediction model.

Optionally, in the data analysis apparatus provided in the third embodiment of the present invention, the apparatus further includes: and the second prediction unit is used for predicting the case labels of a second target group and the probability values of the case labels according to the target information corresponding to the target medical record after obtaining the target information corresponding to the target medical record according to the second target vector sequence, so as to obtain a prediction result, wherein the second target group is a case group corresponding to the cases in the target medical record.

Optionally, in the data analysis apparatus provided in the third embodiment of the present invention, the apparatus further includes: the selecting unit is used for randomly selecting the first target case labels according to the probability value of each first target case label of the target medical record after obtaining target information corresponding to the target medical record according to the second target vector sequence to obtain second target case labels; a third prediction unit, configured to input a vector sequence corresponding to a second target case label obtained from the first target vector sequence into the target prediction model for time prediction, so as to obtain time information of the second target case label; and the updating unit is used for updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuously executing the processing of the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

It should be noted here that, in steps S201 to S204 of the first obtaining unit 801, the mapping unit 802, the encoding unit 803, and the analyzing unit 804, the four units are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure of the first embodiment. It should be noted that the above units as a part of the apparatus may operate in the computer terminal 10 provided in the first embodiment.

It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.

Example 4

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal described above may execute the program code of the following steps in the data analysis method: acquiring first data information from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; and obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

The computer terminal described above may further execute a program code of the following steps in the data analysis method: mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence, wherein the first target vector sequence comprises: randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area; mapping first visit information except the visit date in the target medical record into a first initial vector sequence according to the first vector set; the date is coded through absolute time to obtain a second vector set; mapping the visit date into a second initial vector sequence according to the second vector set; randomly generating a third vector set with preset dimensions based on the attribute information stored in the target storage area; mapping the first attribute information into a third initial vector sequence according to a third vector set; and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain a first target vector sequence.

The computer terminal described above may also execute the program code of the following steps in the data analysis method: obtaining target information corresponding to the target medical record according to the second target vector sequence comprises: predicting through a second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; taking an initial case label with a probability value meeting a preset requirement as a first target case label; the first target case label and the probability value of the first target case label are taken as target information.

The computer terminal described above may also execute the program code of the following steps in the data analysis method: after obtaining the initial information, the method further comprises: and if the probability values of the initial case labels do not meet the preset requirements, the target prediction model outputs a first preset prompt.

The computer terminal described above may also execute the program code of the following steps in the data analysis method: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences; and predicting the time information of each first target case label through a target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

The computer terminal described above may also execute the program code of the following steps in the data analysis method: the target prediction model is obtained by training the following steps: determining a sample medical record; acquiring a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information; acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of various preset case labels in a case history set of a first target group from historical medical data information, wherein the first target group is a case group corresponding to cases in a sample case history; according to the fourth target vector sequence, calculating to obtain the historical visit time interval in the sample medical record; constructing a mask according to the number of the fourth target vector sequences, and carrying out shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence; and taking the processed fourth vector target sequence, the probability value of the occurrence of each preset type of case labels in the case history set of the first target group and the historical visit time interval as a training set, and training the initial prediction model according to the training set to generate a target prediction model.

The computer terminal described above may further execute a program code of the following steps in the data analysis method: training the initial prediction model according to a training set, and generating a target prediction model comprises: inputting the training set into an initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; performing loss calculation according to the probability values of all the predicted probability values and the probability values of all preset case labels appearing in the medical record sets of the first target group to obtain a first predicted loss function; loss calculation is carried out according to the time information of the various types of case labels and the historical visit time interval to obtain a second prediction loss function; taking the first prediction loss function and the second prediction loss function as target loss functions; and training the initial prediction model according to the target loss function to obtain a target prediction model.

The computer terminal described above may further execute a program code of the following steps in the data analysis method: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: and predicting case labels of a second target population and probability values of the case labels according to target information corresponding to the target medical records to obtain a prediction result, wherein the second target population is a case population corresponding to cases in the target medical records.

The computer terminal described above may further execute a program code of the following steps in the data analysis method: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label; inputting a vector sequence corresponding to a second target case label obtained from the first target vector sequence into a target prediction model for time prediction to obtain time information of the second target case label; and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuously executing the processing of the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

Optionally, fig. 9 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 9, the computer terminal 10 may include: one or more (only one shown in fig. 9) processors, memory.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the data analysis method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by operating the software programs and modules stored in the memory, so as to implement the data analysis method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring first data information from a target medical record of a target object, wherein the first data information at least comprises the following components: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

Optionally, the processor may further perform the following steps: mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence, wherein the first target vector sequence comprises: randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area; mapping first visit information except the visit date in the target medical record into a first initial vector sequence according to the first vector set; encoding the date through absolute time to obtain a second vector set; mapping the visit date into a second initial vector sequence according to the second vector set; randomly generating a third vector set with preset dimensionality based on the attribute information stored in the target storage area; mapping the first attribute information into a third initial vector sequence according to a third vector set; and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain a first target vector sequence.

Optionally, the processor may further perform the following steps: obtaining target information corresponding to the target medical record according to the second target vector sequence comprises: predicting through a second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; taking an initial case label with a probability value meeting a preset requirement as a first target case label; the first target case label and the probability value of the first target case label are taken as target information.

Optionally, the processor may further perform the following steps: after obtaining the initial information, the method further comprises: and if the probability values of the initial case labels do not accord with the preset requirements, the target prediction model outputs a first preset prompt.

Optionally, the processor may further perform the following steps: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences; and predicting the time information of each first target case label through a target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

Optionally, the processor may further perform the following steps: the target prediction model is obtained by training the following steps: determining a sample medical record; acquiring a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information; acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of preset various case labels in a case history set of a first target group from historical medical data information, wherein the first target group is a case group corresponding to cases in a sample case history; according to the fourth target vector sequence, calculating to obtain the historical visit time interval in the sample medical record; constructing a mask according to the number of the fourth target vector sequences, and carrying out shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence; and taking the processed fourth vector target sequence, the probability value of the occurrence of each preset type of case labels in the case history set of the first target group and the historical visit time interval as a training set, and training the initial prediction model according to the training set to generate a target prediction model.

Optionally, the processor may further perform the following steps: training the initial prediction model according to a training set, and generating a target prediction model comprises: inputting the training set into an initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; performing loss calculation according to the probability values of all the predicted probability values and the probability values of all preset case labels appearing in the medical record sets of the first target group to obtain a first predicted loss function; loss calculation is carried out according to the time information of the various types of case labels and the historical visit time interval to obtain a second prediction loss function; taking the first predicted loss function and the second predicted loss function as target loss functions; and training the initial prediction model according to the target loss function to obtain a target prediction model.

Optionally, the processor may further perform the following steps: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: and predicting case labels of a second target population and probability values of the case labels according to target information corresponding to the target medical records to obtain a prediction result, wherein the second target population is a case population corresponding to cases in the target medical records.

Optionally, the processor may further perform the following steps: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label; inputting a vector sequence corresponding to a second target case label obtained from the first target vector sequence into a target prediction model for time prediction to obtain time information of the second target case label; and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuously executing the processing of the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

It can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), PAD, etc. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a computer-readable storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the data analysis method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring first data information from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information; mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in a target prediction model to obtain a second target vector sequence; and obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value of the first target case label.

The storage medium is further configured to store program code for performing the steps of: mapping the first attribute information and the first visit information through a feature embedding module in the target prediction model to obtain a first target vector sequence, wherein the first target vector sequence comprises: randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area; mapping first visit information except the visit date in the target medical record into a first initial vector sequence according to the first vector set; the date is coded through absolute time to obtain a second vector set; mapping the visit date into a second initial vector sequence according to the second vector set; randomly generating a third vector set with preset dimensionality based on the attribute information stored in the target storage area; mapping the first attribute information into a third initial vector sequence according to a third vector set; and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain a first target vector sequence.

The storage medium described above is further configured to store program code for performing the steps of: obtaining target information corresponding to the target medical record according to the second target vector sequence comprises: predicting through a second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels; taking an initial case label with a probability value meeting a preset requirement as a first target case label; the first target case label and the probability value of the first target case label are taken as target information.

The storage medium is further configured to store program code for performing the steps of: after obtaining the initial information, the method further comprises: and if the probability values of the initial case labels do not meet the preset requirements, the target prediction model outputs a first preset prompt.

The storage medium is further configured to store program code for performing the steps of: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences; and predicting the time information of each first target case label through a target prediction model according to the second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

The storage medium is further configured to store program code for performing the steps of: the target prediction model is obtained by training the following steps: determining a sample medical record; acquiring a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information; acquiring historical medical data information, wherein the historical medical data information at least comprises a medical record set; statistically obtaining the probability value of the occurrence of various preset case labels in a case history set of a first target group from historical medical data information, wherein the first target group is a case group corresponding to cases in a sample case history; according to the fourth target vector sequence, calculating to obtain historical visit time intervals in the sample medical record; constructing a mask according to the number of the fourth target vector sequences, and carrying out shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence; and taking the processed fourth vector target sequence, the probability value of the occurrence of preset various case labels in the case history set of the first target group and the historical visit time interval as a training set, and training the initial prediction model according to the training set to generate a target prediction model.

The storage medium is further configured to store program code for performing the steps of: training the initial prediction model according to a training set, and generating a target prediction model comprises: inputting the training set into an initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels; performing loss calculation according to the probability values of all the predicted probability values and the probability values of all preset case labels appearing in the medical record sets of the first target group to obtain a first predicted loss function; performing loss calculation according to the time information of the various case labels and the historical visit time interval to obtain a second predicted loss function; taking the first prediction loss function and the second prediction loss function as target loss functions; and training the initial prediction model according to the target loss function to obtain a target prediction model.

The storage medium is further configured to store program code for performing the steps of: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: and predicting case labels of a second target group and probability values of the case labels through target information corresponding to the target medical records to obtain a prediction result, wherein the second target group is a case group corresponding to cases in the target medical records.

The storage medium is further configured to store program code for performing the steps of: after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further includes: randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label; inputting a vector sequence corresponding to a second target case label obtained from the first target vector sequence into a target prediction model for time prediction to obtain time information of the second target case label; and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuously executing the processing of the updated second target vector sequence through the target prediction model until the target prediction model outputs a second preset prompt, or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of analyzing data, comprising:

acquiring first data information from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information;

mapping the first attribute information and the first visit information through a feature embedding module in a target prediction model to obtain a first target vector sequence;

performing time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence;

obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label.

2. The method of claim 1, wherein mapping the first attribute information and the first visit information by a feature embedding module in the target prediction model to obtain a first target vector sequence comprises:

randomly generating a first vector set of preset dimensions based on the diagnostic data information stored in the target storage area;

mapping first medical treatment information except the medical treatment date in the target medical record into a first initial vector sequence according to the first vector set;

encoding the date through absolute time to obtain a second vector set;

mapping the visit date into a second initial vector sequence according to the second vector set;

randomly generating a third vector set of the preset dimension based on the attribute information stored in the target storage area;

mapping the first attribute information into a third initial vector sequence according to the third vector set;

and splicing the first initial vector sequence, the second initial vector sequence and the third initial vector sequence to obtain the first target vector sequence.

3. The method of claim 2, wherein obtaining the target information corresponding to the target medical record according to the second target vector sequence comprises:

predicting through the second target vector sequence to obtain initial information, wherein the initial information consists of a plurality of initial case labels and probability values of the initial case labels;

taking an initial case label with a probability value meeting a preset requirement as the first target case label;

the first target case label and the probability value of the first target case label are taken as the target information.

4. The method of claim 3, wherein after obtaining the initial information, the method further comprises:

and if the probability values of the initial case labels do not accord with the preset requirements, the target prediction model outputs a first preset prompt.

5. The method according to claim 1, wherein after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further comprises:

selecting a vector sequence corresponding to each first target case label from the first target vector sequences to obtain a plurality of third target vector sequences;

and predicting the time information of each first target case label through the target prediction model according to a second target vector sequence and the third target vector sequence to obtain a plurality of prediction times.

6. The method of claim 1, wherein the target prediction model is trained by:

determining a sample medical record;

obtaining a fourth target vector sequence corresponding to second data information of the sample medical record, wherein the second data information at least comprises: second attribute information and second visit information;

acquiring historical medical data information, wherein the historical medical data information comprises at least one medical record set;

statistically obtaining the probability value of the occurrence of preset various types of case labels in a case history set of a first target population from the historical medical data information, wherein the first target population is a case population corresponding to cases in the sample case history;

calculating to obtain a historical visit time interval in the sample medical record according to the fourth target vector sequence;

constructing a mask according to the number of the fourth target vector sequences, and performing shielding treatment on the fourth target sequences according to the mask to obtain a treated fourth target vector sequence;

and taking the processed fourth vector target sequence, the probability value of the occurrence of preset various types of case labels in the medical record set of the first target group and the historical visit time interval as a training set, and training an initial prediction model according to the training set to generate the target prediction model.

7. The method of claim 6, wherein training an initial prediction model from the training set, generating the target prediction model comprises:

inputting the training set into the initial prediction model to obtain the preset prediction probability values of various case labels and the time information of the various case labels;

performing loss calculation according to each predicted probability value and probability values of various preset case labels appearing in the medical record set of the first target group to obtain a first predicted loss function;

performing loss calculation according to the time information of the various types of case labels and the historical visit time interval to obtain a second predicted loss function;

taking the first predictive loss function and the second predictive loss function as target loss functions;

and training the initial prediction model according to the target loss function to obtain the target prediction model.

8. The method according to claim 1, wherein after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further comprises:

and predicting case labels of a second target population and probability values of the case labels according to target information corresponding to the target medical records to obtain a prediction result, wherein the second target population is a case population corresponding to cases in the target medical records.

9. The method according to claim 1, wherein after obtaining the target information corresponding to the target medical record according to the second target vector sequence, the method further comprises:

randomly selecting the first target case label according to the probability value of each first target case label of the target medical record to obtain a second target case label;

inputting a vector sequence corresponding to the second target case label obtained from the first target vector sequence into the target prediction model for time prediction to obtain time information of the second target case label;

and updating the second target vector sequence according to the second target case label and the time information of the second target case label to obtain an updated second target vector sequence, and continuing to process the updated second target vector sequence through the target prediction model until a second preset prompt is output by the target prediction model or the time information output by the target prediction model is greater than a preset time limit, wherein the second preset prompt is used for representing that the case label currently predicted by the target prediction model does not exist in the target case history.

10. A method of analyzing data, comprising:

acquiring first data information which is sent by a client and acquired from a target medical record of a target object, wherein the first data information at least comprises: first attribute information and first visit information;

mapping the first attribute information and the first visit information in a cloud server through a feature embedding module in a target prediction model to obtain a first target vector sequence; performing time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence; obtaining target information corresponding to the target medical record according to the second target vector sequence, wherein the target information at least comprises: a first target case label and a probability value for the first target case label;

and returning the target information to the client.

11. An apparatus for analyzing data, comprising:

a first obtaining unit, configured to obtain first data information from a target medical record of a target object, where the first data information at least includes: first attribute information and first visit information;

the mapping unit is used for mapping the first attribute information and the first visit information through a feature embedding module in a target prediction model to obtain a first target vector sequence;

the coding unit is used for carrying out time sequence coding on the first target vector sequence through a multi-head self-attention module in the target prediction model to obtain a second target vector sequence;

an analysis unit, configured to obtain target information corresponding to the target medical record according to the second target vector sequence, where the target information at least includes: a first target case label and a probability value for the first target case label.

12. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the method for analyzing data according to any one of claims 1 to 9.

13. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the method of analyzing data according to any one of claims 1 to 9 when running.