CN115964471B - Medical data approximate query method - Google Patents

Medical data approximate query method Download PDF

Info

Publication number
CN115964471B
CN115964471B CN202310255574.2A CN202310255574A CN115964471B CN 115964471 B CN115964471 B CN 115964471B CN 202310255574 A CN202310255574 A CN 202310255574A CN 115964471 B CN115964471 B CN 115964471B
Authority
CN
China
Prior art keywords
query
model
gpt
approximate
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310255574.2A
Other languages
Chinese (zh)
Other versions
CN115964471A (en
Inventor
刘瑞华
张金涛
李睿
胡其桐
郑名扬
唐学文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Angels Biomedical Technology Co ltd
Original Assignee
Chengdu Angels Biomedical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Angels Biomedical Technology Co ltd filed Critical Chengdu Angels Biomedical Technology Co ltd
Priority to CN202310255574.2A priority Critical patent/CN115964471B/en
Publication of CN115964471A publication Critical patent/CN115964471A/en
Application granted granted Critical
Publication of CN115964471B publication Critical patent/CN115964471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of database query, and discloses a medical data approximate query method, which comprises the following steps: converting the similar query records into a representation in a natural language form, and carrying out data enhancement on query problems in the similar query records in the natural language form by using a synonym replacement mode; enriching the query results in the similar query records in the natural language form; combining the query questions and the query results into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs; fine tuning the GPT-3 model by using the question-answer set; and inputting the natural language query into the GPT-3 model, and outputting an answer result. According to the medical data approximate query method, the GPT-3 model is used for realizing medical data query with ultralow access delay, and query results are enriched by data enhancement on query problems; the GPT-3 model is better suitable for approximate access of the database, and accuracy of the approximate access is improved.

Description

Medical data approximate query method
Technical Field
The invention belongs to the technical field of database query, and particularly relates to a medical data approximate query method based on a GPT-3 model.
Background
The approximate query processing is a key problem in the database, and refers to an optimization technology for accelerating query efficiency for quick response of user query under acceptable query errors, and compared with the traditional database query, the approximate query can greatly improve the query speed of the database under the condition of slightly sacrificing query precision, and is generally applied to commercial databases with larger data volume, and mainly aims at query sentences comprising 'count, sum, avg' and other aggregation operations.
In the field of approximate queries, the prior art focuses on improving query optimizers in databases to ensure that execution plans with higher execution efficiency can be compiled for approximate query statements, thereby accelerating the overall data query process. However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art:
1. query efficiency is still difficult to achieve on the order of milliseconds. Although the existing approximate query technology can reduce the data range to be queried by sampling and the like, the process is still extremely time-consuming under the condition of large data volume. Second, the prior art focuses on improving query optimizers inside databases to accelerate queries, but still requires standard query statements to be executed in the databases, which cannot avoid time-consuming operations such as data scanning and transmission;
2. the approximate query in natural language form cannot be answered. Often, a database provider needs a certain SQL writing capability for users, and even some users timely master the writing of simple SQL, the execution speed is still too slow due to the fact that the written SQL is not standard enough. This results in a database provider with too high a requirement of expertise on the user.
Disclosure of Invention
The present invention aims to solve the above technical problems at least to some extent. To this end, the present invention aims to provide a medical data approximate query method.
The technical scheme adopted by the invention is as follows:
the medical data approximate query method comprises the following steps:
s1, converting the approximate query record into a representation in a natural language form through a transducer model, and carrying out data enhancement on query problems in the approximate query record in the natural language form by using a synonym substitution mode; enriching the query results in the similar query records in the natural language form; combining the query questions and the corresponding query results after data enhancement into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs;
s2, processing the question-answer set into a data format comprising prompts and conclusions, and calling a fine-tuning API of the GPT-3 model by using the processed question-answer set to finely tune the GPT-3 model;
s3, inputting natural language query into the trimmed GPT-3 model, and outputting an answer result of the GPT-3 model.
Preferably, the approximate query record in step S1 includes a historical approximate query record and a randomly generated approximate query record, the randomly generated approximate query record being randomly generated by means of a fixed query template.
Preferably, the transducer model in step S1 represents the query language as a two-dimensional matrix X 1 Wherein each vector represents an embedded representation of each word (token) in the query language and is transformed into Q, K, V three matrices by three linear variations, the transformation formula being:
Figure SMS_1
wherein MatMul represents a linear matrix multiplication,
Figure SMS_2
representing three different two-dimensional matrices, respectively.
Preferably, the objective function of the GPT-3 model tuning process is constructed by a maximum likelihood function:
Figure SMS_3
wherein ,
Figure SMS_4
representing questions in question-answer pair->
Figure SMS_5
Representing the first i words of the query result, θ represents the parameters of the GPT-3 model.
Preferably, the GPT-3 model in step S3 accepts the representationMatrix of natural language queries
Figure SMS_6
And outputs a word y 0 After which +.>
Figure SMS_7
To input and output the second word, and so on until the end identifier is output.
The beneficial effects of the invention are as follows:
according to the medical data approximate query method provided by the invention, the GPT-3 model is used for realizing medical data query with ultralow access delay, and the query result is enriched in natural language form by carrying out data enhancement on the query problem; the GPT-3 model is better suitable for approximate access of the database, and accuracy of the approximate access is improved.
The medical data approximate query method also solves the problem of insufficient history approximate query records by randomly generating the approximate query records randomly in a mode of fixing the query template.
Drawings
FIG. 1 is a flow chart of a medical data approximation query method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
As shown in fig. 1, the medical data approximate query method of the present embodiment includes the following steps:
s1, converting the similar query record into a representation in a natural language form through a transducer model, wherein the transducer model has the advantage that compared with a sequence model, the transducer model can perform parallel calculation by taking an input representation as a matrix, so that training and reasoning efficiency is improved. The transducer model represents the input as a two-dimensional matrix X 1 And converts it into Q, K, V three matrices by three linear changes to facilitate the role of the self-attention mechanism. The conversion formula is as follows:
Figure SMS_8
wherein MatMul represents a linear matrix multiplication,
Figure SMS_9
representing three different two-dimensional matrices, respectively.
The approximate query records include historical approximate query records and randomly generated approximate query records, and the target medical database does not have enough historical approximate query records, i.e. the historical approximate query records are insufficient to cover various data ranges (attributes and attribute values) and query types ("count, sum, avg"), and then the random generated approximate query records which are randomly generated by a fixed query template mode are needed to be supplemented. For example, if the data range for which there is already a large number of historical approximate query records is "patient number", then it is necessary to randomly generate a randomly generated approximate query record whose data range is "patient number of stay" or other data range, and by combining the randomly generated approximate query record with the historical approximate query record, sufficient fine tuning data can be composed for use in subsequent steps.
And carrying out data enhancement on the query problems in the approximate query records in the natural language form by using a synonym replacement mode so as to strengthen the generalization capability of the GPT-3 model after fine tuning. The process retrieves and replaces the words or terms in the query question based on an open source synonym table (e.g., a Hadamard synonym table) and stores the same as a new query. For example, for the query question "how many patients are older than 50", the synonym "total amount" is queried in the synonym table using "number", and the new query after replacement is "how much more than 50.
And enriching the query results in the natural language form in the approximate query records in the natural language form so as to coordinate with the query problems in the natural language form to finely tune the GPT-3 model. For example, if the query is titled "how many patients are older than 50 years," the query result is "2003," the query result is enriched as "number 2003. The process automatically converts query results into natural language based on the type of aggregate operation of the query (SUM, count.) and the name of the attribute being queried (e.g. "patient"). Specifically, it is first necessary to capture "SUM (patient)" in the query with a regular expression, then translate "SUM" into "total number", and extract the attribute name of "patient". Finally, the "+" aggregate operation (total number) "with the query result" 2003 "according to the" attribute (patient) "+" is "+" query result (2003) "+". The total number of "sequential combination of natural language" patients is 2003. "can be used.
And combining each query question and the corresponding query result after data enhancement into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs.
S2, processing the question and answer set into a data format comprising prompts and conclusions, wherein the data format is specifically { "sample" < sample text > "," complex ":" < ideal generated text > "; and calling a fine-tuning API of the GPT-3 model by using the processed question-answer set to finely tune the GPT-3 model, so that the GPT-3 model can accurately answer the approximate query aiming at the target medical database. The GPT-3 model is used because the pre-trained GPT-3 model has better learning effect under a small sample.
The GPT-3 model has the following formula:
Figure SMS_10
Figure SMS_11
where Q represents input information, which is information that the input text exists. K represents content information, namely semantic information, namely the degree of matching between Query and Key is represented by Attention (Q, K), and V represents information per se, wherein the main function is to weight the degree of matching;
Figure SMS_12
for the calculation result of multi-head attention, concat represents matrix splicing; the objective function of the GPT-3 model tuning process is constructed by maximum likelihood functions:
Figure SMS_13
wherein ,
Figure SMS_14
representing questions in question-answer pair->
Figure SMS_15
Representing the first i words of the query result, θ represents the parameters of the GPT-3 model.
S3, inputting natural language query of the user into the trimmed GPT-3 model, and returning an answer result of the GPT-3 model to the user. Specifically, the GPT-3 model accepts natural language queries
Figure SMS_16
And outputs a word y 0 After which +.>
Figure SMS_17
To input and output the second word, and so on until the end identifier is output. Such as a user given a query in natural language: "number of people cold monthly". The GPT-3 model accepts the statement and outputs the first word "number" in the answer that is most likely to occur, after which the GPT-3 model accepts the "number of people who catch a cold monthly". Number ", and outputs" quantity "; and so on, finally obtaining the complete output number of 500.".
The medical data approximate query method can obtain corresponding answers by describing an approximate query with natural language without the need of a user to master the efficient SQL writing technology, thereby reducing the professional knowledge requirement of a medical database provider for the user; and medical data inquiry with ultra-low access delay can be realized. When a user inputs a natural language query, the standard SQL query is not actually executed in the database, but the natural language query is matched with the history record, and the query result of the similar natural language query is quickly returned. This greatly improves the reply efficiency of the query (to the millisecond ms level).
The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.

Claims (4)

1. A method for approximate query of medical data, comprising the steps of:
s1, converting the approximate query record into a representation in a natural language form through a transducer model, and carrying out data enhancement on query problems in the approximate query record in the natural language form by using a synonym substitution mode; enriching the query results in the similar query records in the natural language form; combining the query questions and the corresponding query results after data enhancement into a plurality of question-answer pairs, and forming a question-answer set by the plurality of question-answer pairs;
s2, processing the question-answer set into a data format comprising prompts and conclusions, and calling a fine-tuning API of the GPT-3 model by using the processed question-answer set to finely tune the GPT-3 model;
s3, inputting natural language query into the trimmed GPT-3 model, and outputting an answer result of the GPT-3 model;
the transducer model in step S1 represents the query language as a two-dimensional matrix X 1 Wherein each vector represents an embedded representation of each word in the query language and is transformed into Q, K, V three matrices by three linear variations, the transformation formula being:
Figure QLYQS_1
wherein MatMul represents a linear matrix multiplication,
Figure QLYQS_2
representing three different two-dimensional matrices, respectively.
2. The medical data approximate query method according to claim 1, wherein: the approximate query record in step S1 includes a history approximate query record and a randomly generated approximate query record, which is randomly generated by means of a fixed query template.
3. The medical data approximate query method according to claim 1, wherein: the objective function of the GPT-3 model tuning process is constructed by maximum likelihood functions:
Figure QLYQS_3
wherein ,
Figure QLYQS_4
representing questions in question-answer pair->
Figure QLYQS_5
Representing the first i words of the query result, θ represents the parameters of the GPT-3 model.
4. The medical data approximate query method according to claim 1, wherein: in step S3, GPT-3 model accepts matrix representing natural language query
Figure QLYQS_6
And outputs a word y 0 After which +.>
Figure QLYQS_7
To input and output the second word, and so on until the end identifier is output. />
CN202310255574.2A 2023-03-16 2023-03-16 Medical data approximate query method Active CN115964471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310255574.2A CN115964471B (en) 2023-03-16 2023-03-16 Medical data approximate query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310255574.2A CN115964471B (en) 2023-03-16 2023-03-16 Medical data approximate query method

Publications (2)

Publication Number Publication Date
CN115964471A CN115964471A (en) 2023-04-14
CN115964471B true CN115964471B (en) 2023-06-02

Family

ID=85888171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310255574.2A Active CN115964471B (en) 2023-03-16 2023-03-16 Medical data approximate query method

Country Status (1)

Country Link
CN (1) CN115964471B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017010652A1 (en) * 2015-07-15 2017-01-19 포항공과대학교 산학협력단 Automatic question and answer method and device therefor
CN109766355A (en) * 2018-12-28 2019-05-17 上海汇付数据服务有限公司 A kind of data query method and system for supporting natural language
CN114897163A (en) * 2022-05-23 2022-08-12 阿里巴巴(中国)有限公司 Pre-training model data processing method, electronic device and computer storage medium

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7937402B2 (en) * 2006-07-10 2011-05-03 Nec (China) Co., Ltd. Natural language based location query system, keyword based location query system and a natural language and keyword based location query system
US8775154B2 (en) * 2008-09-18 2014-07-08 Xerox Corporation Query translation through dictionary adaptation
US9280535B2 (en) * 2011-03-31 2016-03-08 Infosys Limited Natural language querying with cascaded conditional random fields
CN103218436B (en) * 2013-04-17 2016-05-18 中国科学院自动化研究所 A kind of Similar Problems search method and device that merges class of subscriber label
US10489463B2 (en) * 2015-02-12 2019-11-26 Microsoft Technology Licensing, Llc Finding documents describing solutions to computing issues
US10127274B2 (en) * 2016-02-08 2018-11-13 Taiger Spain Sl System and method for querying questions and answers
CN108932349B (en) * 2018-08-17 2019-03-26 齐鲁工业大学 Medical automatic question-answering method and device, storage medium, electronic equipment
WO2021195146A1 (en) * 2020-03-23 2021-09-30 Sorcero, Inc. Ontology integration for document summarization
CN111737399A (en) * 2020-05-28 2020-10-02 北京百度网讯科技有限公司 Method and device for expanding question and answer set, electronic equipment and readable storage medium
CN112035620B (en) * 2020-08-31 2023-02-14 康键信息技术(深圳)有限公司 Question-answer management method, device, equipment and storage medium of medical query system
CN111813802B (en) * 2020-09-11 2021-06-29 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN112905868A (en) * 2021-03-22 2021-06-04 京东方科技集团股份有限公司 Event extraction method, device, equipment and storage medium
CN113297364B (en) * 2021-06-07 2023-06-09 吉林大学 Natural language understanding method and device in dialogue-oriented system
US20220414737A1 (en) * 2021-06-28 2022-12-29 Microsoft Technology Licensing, Llc Query-based product representations
CN113553412B (en) * 2021-06-30 2023-07-25 北京百度网讯科技有限公司 Question-answering processing method, question-answering processing device, electronic equipment and storage medium
CN115587583A (en) * 2022-11-07 2023-01-10 维沃移动通信有限公司 Noise detection method and device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017010652A1 (en) * 2015-07-15 2017-01-19 포항공과대학교 산학협력단 Automatic question and answer method and device therefor
CN109766355A (en) * 2018-12-28 2019-05-17 上海汇付数据服务有限公司 A kind of data query method and system for supporting natural language
CN114897163A (en) * 2022-05-23 2022-08-12 阿里巴巴(中国)有限公司 Pre-training model data processing method, electronic device and computer storage medium

Also Published As

Publication number Publication date
CN115964471A (en) 2023-04-14

Similar Documents

Publication Publication Date Title
CN114547274A (en) Multi-turn question and answer method, device and equipment
CN111523328B (en) Intelligent customer service semantic processing method
CN110516057A (en) A kind of petition letter problem answer method and device
CN115964471B (en) Medical data approximate query method
CN117370580A (en) Knowledge-graph-based large language model enhanced dual-carbon field service method
CN116932776A (en) Knowledge graph-based large model knowledge updating method and device
Trummer Demonstrating the voice-based exploration of large data sets with CiceroDB-zero
CN115964468A (en) Rural information intelligent question-answering method and device based on multilevel template matching
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN112632106B (en) Knowledge graph query method, device, equipment and storage medium
CN114357137A (en) Knowledge graph-based question-answering method, knowledge graph-based question-answering equipment, knowledge graph-based storage medium and question-answering robot
CN110795550B (en) Method for improving answer richness of chatting dialogue system
Hettiarachchi et al. A Scenario-based ER Diagram and Query Generation Engine
Zheng et al. Rumor Detection Based on Improved Transformer
CN113297361B (en) Intelligent question-answer interaction system and method based on visual flow chart
Simou et al. Storing and querying fuzzy knowledge in the semantic web using fire
Li Medical Knowledge Question Answering System Based on Knowledge Graph
Mincheva et al. NLP using database context
CN117668179A (en) Financial index accurate question-answering method based on large model
CN115248854B (en) Automatic question-answering method, system and storage medium based on knowledge graph
WO2024016139A1 (en) Query method and related device
Zhang et al. Design and implementation of teaching analysis system based on data mining
Khanam et al. Question answering system with natural language interface to database
Hoi et al. Data Augmentation for building QA Systems based on Object Models with Star Schema
CN117112727A (en) Large language model fine tuning instruction set construction method suitable for cloud computing service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Liu Ruihua

Inventor after: Zhang Jintao

Inventor after: Li Rui

Inventor after: Hu Qitong

Inventor after: Zheng Mingyang

Inventor before: Liu Ruihua

Inventor before: Zhang Jintao

Inventor before: Li Rui

Inventor before: Hu Qitong

Inventor before: Zheng Mingyang

Inventor before: Tang Xuewen