CN115964471B

CN115964471B - Medical data approximate query method

Info

Publication number: CN115964471B
Application number: CN202310255574.2A
Authority: CN
Inventors: 刘瑞华; 张金涛; 李睿; 胡其桐; 郑名扬; 唐学文
Original assignee: Chengdu Angels Biomedical Technology Co ltd
Current assignee: Chengdu Angels Biomedical Technology Co ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-02
Anticipated expiration: 2043-03-16
Also published as: CN115964471A

Abstract

The invention belongs to the technical field of database query, and discloses a medical data approximate query method, which comprises the following steps: converting the similar query records into a representation in a natural language form, and carrying out data enhancement on query problems in the similar query records in the natural language form by using a synonym replacement mode; enriching the query results in the similar query records in the natural language form; combining the query questions and the query results into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs; fine tuning the GPT-3 model by using the question-answer set; and inputting the natural language query into the GPT-3 model, and outputting an answer result. According to the medical data approximate query method, the GPT-3 model is used for realizing medical data query with ultralow access delay, and query results are enriched by data enhancement on query problems; the GPT-3 model is better suitable for approximate access of the database, and accuracy of the approximate access is improved.

Description

Medical data approximate query method

Technical Field

The invention belongs to the technical field of database query, and particularly relates to a medical data approximate query method based on a GPT-3 model.

Background

The approximate query processing is a key problem in the database, and refers to an optimization technology for accelerating query efficiency for quick response of user query under acceptable query errors, and compared with the traditional database query, the approximate query can greatly improve the query speed of the database under the condition of slightly sacrificing query precision, and is generally applied to commercial databases with larger data volume, and mainly aims at query sentences comprising 'count, sum, avg' and other aggregation operations.

In the field of approximate queries, the prior art focuses on improving query optimizers in databases to ensure that execution plans with higher execution efficiency can be compiled for approximate query statements, thereby accelerating the overall data query process. However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art:

1. query efficiency is still difficult to achieve on the order of milliseconds. Although the existing approximate query technology can reduce the data range to be queried by sampling and the like, the process is still extremely time-consuming under the condition of large data volume. Second, the prior art focuses on improving query optimizers inside databases to accelerate queries, but still requires standard query statements to be executed in the databases, which cannot avoid time-consuming operations such as data scanning and transmission;

2. the approximate query in natural language form cannot be answered. Often, a database provider needs a certain SQL writing capability for users, and even some users timely master the writing of simple SQL, the execution speed is still too slow due to the fact that the written SQL is not standard enough. This results in a database provider with too high a requirement of expertise on the user.

Disclosure of Invention

The present invention aims to solve the above technical problems at least to some extent. To this end, the present invention aims to provide a medical data approximate query method.

The technical scheme adopted by the invention is as follows:

the medical data approximate query method comprises the following steps:

s1, converting the approximate query record into a representation in a natural language form through a transducer model, and carrying out data enhancement on query problems in the approximate query record in the natural language form by using a synonym substitution mode; enriching the query results in the similar query records in the natural language form; combining the query questions and the corresponding query results after data enhancement into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs;

s2, processing the question-answer set into a data format comprising prompts and conclusions, and calling a fine-tuning API of the GPT-3 model by using the processed question-answer set to finely tune the GPT-3 model;

s3, inputting natural language query into the trimmed GPT-3 model, and outputting an answer result of the GPT-3 model.

Preferably, the approximate query record in step S1 includes a historical approximate query record and a randomly generated approximate query record, the randomly generated approximate query record being randomly generated by means of a fixed query template.

Preferably, the transducer model in step S1 represents the query language as a two-dimensional matrix X ₁ Wherein each vector represents an embedded representation of each word (token) in the query language and is transformed into Q, K, V three matrices by three linear variations, the transformation formula being:

；

wherein MatMul represents a linear matrix multiplication,

representing three different two-dimensional matrices, respectively.

Preferably, the objective function of the GPT-3 model tuning process is constructed by a maximum likelihood function:

；

wherein ,

representing questions in question-answer pair->

Representing the first i words of the query result, θ represents the parameters of the GPT-3 model.

Preferably, the GPT-3 model in step S3 accepts the representationMatrix of natural language queries

And outputs a word y ₀ After which +.>

To input and output the second word, and so on until the end identifier is output.

The beneficial effects of the invention are as follows:

according to the medical data approximate query method provided by the invention, the GPT-3 model is used for realizing medical data query with ultralow access delay, and the query result is enriched in natural language form by carrying out data enhancement on the query problem; the GPT-3 model is better suitable for approximate access of the database, and accuracy of the approximate access is improved.

The medical data approximate query method also solves the problem of insufficient history approximate query records by randomly generating the approximate query records randomly in a mode of fixing the query template.

Drawings

FIG. 1 is a flow chart of a medical data approximation query method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

As shown in fig. 1, the medical data approximate query method of the present embodiment includes the following steps:

s1, converting the similar query record into a representation in a natural language form through a transducer model, wherein the transducer model has the advantage that compared with a sequence model, the transducer model can perform parallel calculation by taking an input representation as a matrix, so that training and reasoning efficiency is improved. The transducer model represents the input as a two-dimensional matrix X ₁ And converts it into Q, K, V three matrices by three linear changes to facilitate the role of the self-attention mechanism. The conversion formula is as follows:

；

wherein MatMul represents a linear matrix multiplication,

representing three different two-dimensional matrices, respectively.

The approximate query records include historical approximate query records and randomly generated approximate query records, and the target medical database does not have enough historical approximate query records, i.e. the historical approximate query records are insufficient to cover various data ranges (attributes and attribute values) and query types ("count, sum, avg"), and then the random generated approximate query records which are randomly generated by a fixed query template mode are needed to be supplemented. For example, if the data range for which there is already a large number of historical approximate query records is "patient number", then it is necessary to randomly generate a randomly generated approximate query record whose data range is "patient number of stay" or other data range, and by combining the randomly generated approximate query record with the historical approximate query record, sufficient fine tuning data can be composed for use in subsequent steps.

And carrying out data enhancement on the query problems in the approximate query records in the natural language form by using a synonym replacement mode so as to strengthen the generalization capability of the GPT-3 model after fine tuning. The process retrieves and replaces the words or terms in the query question based on an open source synonym table (e.g., a Hadamard synonym table) and stores the same as a new query. For example, for the query question "how many patients are older than 50", the synonym "total amount" is queried in the synonym table using "number", and the new query after replacement is "how much more than 50.

And enriching the query results in the natural language form in the approximate query records in the natural language form so as to coordinate with the query problems in the natural language form to finely tune the GPT-3 model. For example, if the query is titled "how many patients are older than 50 years," the query result is "2003," the query result is enriched as "number 2003. The process automatically converts query results into natural language based on the type of aggregate operation of the query (SUM, count.) and the name of the attribute being queried (e.g. "patient"). Specifically, it is first necessary to capture "SUM (patient)" in the query with a regular expression, then translate "SUM" into "total number", and extract the attribute name of "patient". Finally, the "+" aggregate operation (total number) "with the query result" 2003 "according to the" attribute (patient) "+" is "+" query result (2003) "+". The total number of "sequential combination of natural language" patients is 2003. "can be used.

And combining each query question and the corresponding query result after data enhancement into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs.

S2, processing the question and answer set into a data format comprising prompts and conclusions, wherein the data format is specifically { "sample" < sample text > "," complex ":" < ideal generated text > "; and calling a fine-tuning API of the GPT-3 model by using the processed question-answer set to finely tune the GPT-3 model, so that the GPT-3 model can accurately answer the approximate query aiming at the target medical database. The GPT-3 model is used because the pre-trained GPT-3 model has better learning effect under a small sample.

The GPT-3 model has the following formula:

；

；

where Q represents input information, which is information that the input text exists. K represents content information, namely semantic information, namely the degree of matching between Query and Key is represented by Attention (Q, K), and V represents information per se, wherein the main function is to weight the degree of matching;

for the calculation result of multi-head attention, concat represents matrix splicing; the objective function of the GPT-3 model tuning process is constructed by maximum likelihood functions:

；

wherein ,

representing questions in question-answer pair->

S3, inputting natural language query of the user into the trimmed GPT-3 model, and returning an answer result of the GPT-3 model to the user. Specifically, the GPT-3 model accepts natural language queries

And outputs a word y ₀ After which +.>

To input and output the second word, and so on until the end identifier is output. Such as a user given a query in natural language: "number of people cold monthly". The GPT-3 model accepts the statement and outputs the first word "number" in the answer that is most likely to occur, after which the GPT-3 model accepts the "number of people who catch a cold monthly". Number ", and outputs" quantity "; and so on, finally obtaining the complete output number of 500.".

The medical data approximate query method can obtain corresponding answers by describing an approximate query with natural language without the need of a user to master the efficient SQL writing technology, thereby reducing the professional knowledge requirement of a medical database provider for the user; and medical data inquiry with ultra-low access delay can be realized. When a user inputs a natural language query, the standard SQL query is not actually executed in the database, but the natural language query is matched with the history record, and the query result of the similar natural language query is quickly returned. This greatly improves the reply efficiency of the query (to the millisecond ms level).

The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.

Claims

1. A method for approximate query of medical data, comprising the steps of:

s3, inputting natural language query into the trimmed GPT-3 model, and outputting an answer result of the GPT-3 model;

the transducer model in step S1 represents the query language as a two-dimensional matrix X ₁ Wherein each vector represents an embedded representation of each word in the query language and is transformed into Q, K, V three matrices by three linear variations, the transformation formula being:

；

wherein MatMul represents a linear matrix multiplication,

representing three different two-dimensional matrices, respectively.

2. The medical data approximate query method according to claim 1, wherein: the approximate query record in step S1 includes a history approximate query record and a randomly generated approximate query record, which is randomly generated by means of a fixed query template.

3. The medical data approximate query method according to claim 1, wherein: the objective function of the GPT-3 model tuning process is constructed by maximum likelihood functions:

；

wherein ,

representing questions in question-answer pair->

4. The medical data approximate query method according to claim 1, wherein: in step S3, GPT-3 model accepts matrix representing natural language query

And outputs a word y ₀ After which +.>

To input and output the second word, and so on until the end identifier is output. />