CN117371401A - Data standardization processing method based on large language model - Google Patents

Data standardization processing method based on large language model Download PDF

Info

Publication number
CN117371401A
CN117371401A CN202311464984.4A CN202311464984A CN117371401A CN 117371401 A CN117371401 A CN 117371401A CN 202311464984 A CN202311464984 A CN 202311464984A CN 117371401 A CN117371401 A CN 117371401A
Authority
CN
China
Prior art keywords
data
language model
processed
large language
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311464984.4A
Other languages
Chinese (zh)
Inventor
康西龙
朱谦
谢文飞
孟得力
覃玲玲
陈华村
冯俊
夏婷
秦薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Baiyao Group Pharmaceutical E Commerce Co ltd
Yunnan Baiyao Group Co Ltd
Original Assignee
Yunnan Baiyao Group Pharmaceutical E Commerce Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Baiyao Group Pharmaceutical E Commerce Co ltd filed Critical Yunnan Baiyao Group Pharmaceutical E Commerce Co ltd
Priority to CN202311464984.4A priority Critical patent/CN117371401A/en
Publication of CN117371401A publication Critical patent/CN117371401A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning

Abstract

The application relates to a data standardization processing method and device based on a large language model and electronic equipment, wherein the method is based on acquiring data of a data table to be processed, and column names in the data table to be processed and column data corresponding to the column names are processed by adopting a pre-trained large language model and a constructed prompt to obtain column name standardization data; and processing the row content of the data table to be processed by adopting a pre-trained contrast learning model to obtain row content standardized data. Through the large language model and the contrast learning model, the standardization accuracy is improved, the manual standardization efficiency is greatly improved, and through the management of the whole data extraction and standardization result auditing flow, the data in the form of the different dealer medicine purchase, sale and storage tables are subjected to systematic, procedural and automatic processing, and the data collection and processing efficiency is greatly improved.

Description

Data standardization processing method based on large language model
Technical Field
The present disclosure relates to the field of data processing standardization, and in particular, to a data standardization processing method and apparatus based on a large language model, and an electronic device.
Background
In the pharmaceutical distribution industry, pharmaceutical manufacturers often store data such as shipment information, sales data, and inventory conditions obtained from suppliers in the form of Excel tables. However, because different suppliers may employ different column names and formats, and the variety of vendor sales drug specifications, and the data collection period is about 6000 documents per month, manually extracting and standardizing such data becomes cumbersome and time consuming.
Disclosure of Invention
In view of this, the present application proposes a data standardization processing method based on a large language model, which solves the problems presented in the above background art by extracting, standardizing and auditing the results of the data.
In one aspect of the present application, a method for data normalization processing based on a large language model is provided, including the following steps:
acquiring data of a data table to be processed;
carrying out semantic query on the data of the data table to be processed through a database to obtain a similar sample;
constructing a promt by taking the similar sample as a return sample;
and inputting the campt into a pre-trained large language model to obtain column name standardized data corresponding to the data table to be processed.
As an optional implementation manner of the present application, optionally, the semantic query is performed on the data of the data table to be processed through a database to obtain a similar sample, which includes the following steps:
selecting column names in the data table to be processed and column data corresponding to the column names, and inputting the column names and the column data into the large language model;
querying a column name standardized alignment sample most similar to input in a database by using a large language model;
the column name normalized alignment sample most similar to the input is output as a similar sample.
As an alternative embodiment of the present application, optionally, constructing a sample using the similar sample as a return sample includes:
from the returned samples, a sample format for input of the large language model is defined.
As an optional implementation manner of the present application, optionally, inputting the template into a pre-trained large language model to obtain column name standardized data corresponding to the to-be-processed data table, where the method includes:
acquiring a prompt;
the large language model extracts information in a sample format in the template according to the template;
and the large language model analyzes and processes the information in the sample format according to the historical data to obtain column name standardized data corresponding to the data table to be processed.
As an optional embodiment of the present application, optionally, further comprising:
processing the row content of the data table to be processed by adopting a pre-trained contrast learning model to obtain row content standardized data;
and the line content standardized data is manually verified and stored in a corresponding database to be used as training data for comparing the offline training stage of the learning model.
As an optional implementation manner of the present application, optionally, processing the row content of the data table to be processed by using a pre-trained contrast learning model to obtain row content standardized data includes:
and the comparison learning model compares and learns samples similar to the row content of the data table to be processed according to the row content of the data table to be processed, pushes away dissimilar samples, and outputs standardized data of the row content.
As an optional embodiment of the present application, optionally, further comprising:
at least one of manual extraction or model automatic extraction is used for extracting data from a data table to be processed, and at least one of the large language model or the contrast learning model is used for obtaining standardized data corresponding to the extracted data.
As an optional implementation manner of the present application, optionally, at least one of manual extraction or model automatic extraction extracts data from a data table to be processed, and uses at least one of the large language model or the comparative learning model to obtain a standardized data table corresponding to the data table to be processed, and further includes:
manually auditing the standardized data to obtain a verification result;
when the verification result is wrong, the verification result index is manually changed and stored into the databases referenced by the large language model and the comparison learning model.
In two aspects of the present application, a data standardization processing device based on a large language model is provided, including the following modules:
the data acquisition module is used for acquiring the data of the data table to be processed;
the semantic query module is used for carrying out semantic query on the data of the data table to be processed through a database to obtain a similar sample;
a sample constructing module, configured to construct a sample by using the similar samples as return samples, and define a sample format used for inputting the large language model according to the return samples;
the column name alignment standard name module is used for acquiring a prompt, extracting information in a sample format in the prompt according to the prompt by the large language model, and analyzing and processing the information in the sample format according to historical data by the large language model to obtain column name standardized data corresponding to the data table to be processed;
the row content standardization module is used for processing the row content of the data table to be processed by adopting a pre-trained comparison learning model to obtain row content standardization data;
and the data extraction standardized result auditing management module is used for manually extracting or automatically extracting at least one of the data from the data table to be processed by the model, and obtaining the standardized data corresponding to the extracted data by applying at least one of the large language model or the comparison learning model.
In a third aspect of the present application, an electronic device is provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement one of the above-described large language model-based data normalization processing methods when executing the executable instructions.
The beneficial effects of this application:
according to the method, through the large language model and the contrast learning model, corresponding standardized processing is carried out on the data of the data table to be processed, corresponding standardized data are obtained, and through data extraction standardized result auditing, standardized result auditing flow visualization is achieved, compared with manual complicated time-consuming data extraction work, the working efficiency is greatly improved, and meanwhile, a new solution idea is provided for similar work in other industries.
Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present application and together with the description, serve to explain the principles of the present application.
FIG. 1 is a flow chart of column name alignment standard names of a data normalization processing method based on a large language model according to the present invention;
FIG. 2 shows a line content normalization flow chart of a large language model based data normalization processing method of the present invention;
FIG. 3 is a detailed block diagram of column name alignment standard names of a large language model-based data normalization processing method according to the present invention;
FIG. 4 is a detailed block diagram of the line content normalization of a large language model-based data normalization method of the present invention;
FIG. 5 shows a block diagram of a data normalization processing method based on a large language model according to the present invention;
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
It should be understood, however, that the terms "center," "longitudinal," "transverse," "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counter-clockwise," "axial," "radial," "circumferential," and the like indicate or are based on the orientation or positional relationship shown in the drawings, and are merely for convenience of description or to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the present application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits have not been described in detail as not to unnecessarily obscure the present application.
Example 1
In the medicine distribution industry, for medicine data stored in a form of a table obtained from suppliers, since different suppliers adopt different column names and formats, manually extracting all column names for unification is complicated and time-consuming, so that a data standardization processing method based on a large language model is provided, and all column names are converted into unified and self-defined standard names of enterprises, so that the medicine data is convenient to manage.
Fig. 1 shows a column name alignment standard name flowchart of a data normalization processing method based on a large language model of the present invention. As shown in fig. 1, the column name alignment standard name procedure includes the steps of:
s100, acquiring data of a data table to be processed;
the data table to be processed in this embodiment is the original Excel table data provided by the dealer.
S200, carrying out semantic query on the data of the data table to be processed through a database to obtain a similar sample
The database adopts an Embedding database mode, and in the application program item of aligning column names with standard names, a Milvus vector database is adopted, and the aim is to store, index and manage embedded vectors of original Excel column header names and row values through a large language model, and store corresponding standard column names in a correlated mode so as to provide a sample of the new data to be processed to be aligned.
S300, taking the similar sample as a return sample to construct a sample;
the method has the advantages that the generation space of the model can be limited, the model is more focused on a specified theme and task, and the model can be forced to pay attention to specific information by adding the promt in input, so that the performance of the model on the specific task is improved. In this embodiment, a prompt model is used to complete a task of aligning column names with standard names by using a mode of building a prompt, and a correspondence between column names and standard column names is output.
S400, inputting the prompt into a pre-trained large language model to obtain column name standardized data corresponding to the data sheet to be processed;
according to the embodiment, the large language model can output the specified standard name according to the template by inquiring the Embedding database and constructing the template, so that the aim of aligning the column name with the standard name is fulfilled.
The large language model mainly refers to the generated language model with the parameter number of more than 10 hundred million, such as ChatGPT, chatGLM, general meaning thousand questions and text dialect, etc., and is not limited herein.
As an optional implementation manner of the present application, optionally, in step S200, the semantic query is performed on the data of the to-be-processed data table through a database to obtain a similar sample, which includes the following steps:
s201, selecting column names in the data table to be processed and column data corresponding to the column names, and inputting the column data into the large language model;
s202, inquiring a column name standardized alignment sample most similar to input in a database by using a large language model;
s203, the column name standardized alignment sample most similar to the input is output as a similar sample.
The input of the large language model consists of Excel column names and one piece of non-column name table data corresponding to the column names, wherein the database contains standard name sample records aligned with the column names similar to the input, a plurality of records similar to the input are generated in the query process, and the database is subjected to comparison of multiple similarity degrees, and finally the record most similar to the input is selected for outputting a result. It should be noted that the similarity measurement is not limited herein.
As an optional embodiment of the present application, optionally, in step S300, constructing a prompt using the similar sample as a return sample includes:
from the returned samples, a sample format for input of the large language model is defined.
And constructing a prompt, namely, inputting a large model constructed for a standardized name task according to a column name, wherein the sample format is 'task role definition + standard name definition + similar sample + output format designation + content to be extracted', and the sample input is a return sample returned by querying an editing database for the input content. It should be noted that, the sample format user may modify the text sequence according to the requirement, which is not limited herein.
As an optional embodiment of the present application, optionally, in step S400, inputting the template into a pre-trained large language model to obtain column name standardized data corresponding to the to-be-processed data table, including:
acquiring a prompt;
the large language model extracts information in a sample format in the template according to the template;
and the large language model analyzes and processes the information in the sample format according to the historical data to obtain column name standardized data corresponding to the data table to be processed.
The role of campt is to alert the model of what tasks to do. Because the same input is too many tasks can be done, the large language model knows what tasks are to be done this time, what results are to be output and the output format is specified through the prompt; the following will be exemplified to construct promt:
this is a column name alignment standard name task, and the standard names to be aligned include customer name, sales date, drug variety, drug specification, sales quantity, drug unit, lot number, unit price, amount, manufacturer.
Sample input is "date 2022-05-01, unit name: hebei Chengshan Kao county hospital, chengde city, hebei province, lot number: ZLC2212, trade name: yunnan Baiyao Aerosol, specification: 85g (60 g per bottle of insurance liquid) quantity: 20, unit: box"
The sample alignment outputs the results of: ("date": "sales date", "unit name": "customer name", "lot number": "lot number", "trade name": "medicine variety", "specification": "specification", "quantity": "sales quantity", "unit": "medicine unit" }
Please output "business date: 2023-02-0110:33: 56" in json format with sample alignment from the following sentence, customer: from tribute Chinese medical hospital, lot number: ZLC2209, common name: yunnan white drug powder aerosol specification: 85g (safes per bottle 60 g), basic unit number: 20 basic units: box "
Namely, the constructed campt means that the input purpose is to enable the large language model to finish the task of aligning the column names with the standard names, wherein the standard names to be aligned comprise customer names, sales dates, drug varieties, drug specifications and the like, and after similar examples are provided, an output format is designated as json format;
the output obtained is:
"date of service": "sales date";
"customer": "customer name";
"lot number": "lot number";
"generic name": "drug variety";
"Specification": "Specification";
"basic unit quantity": "sales quantity";
"basic unit": "pharmaceutical unit";
the construction of the template is determined according to the needs of the user, and is not limited herein.
After the promtt is input into the large language model, the large language model outputs a task result according to the self reasoning capacity by the input promtt.
It should be noted that, the table data adopted in the embodiment is an Excel table, and the data format is not limited to Excel, other csv, mysql and key: value type data can be applied according to the requirements of the user.
When the model scale is large enough, the large language model has reasoning capability, and on the aspect of simple reasoning, the large language model has already achieved very good capability, and based on the method reasoning of prompt, the core thought can better excite the reasoning capability of the large language model by a proper prompt or a prompt sample.
In this embodiment, the standardized data obtained through the training of the large language model vector representation is backed up and saved for manual verification while being output, when the manual verification passes, the standardized data is used as the data processed by the history, the index is stored in the database for the reference data of the next input semantic query, and the standard data is reused in this step, so that the final output standard name mapping is more accurate.
In this embodiment, the Excel column name alignment standard name is performed by constructing the template application large language generation model, which has better expansibility compared with the Bert type discrimination model.
Example 2
Fig. 2 shows a line content normalization flowchart of a data normalization processing method based on a large language model of the present invention. As shown in fig. 2, after acquiring the data of the data table to be processed, the line content normalization includes the following steps:
100. processing the row content of the data table to be processed by adopting a pre-trained contrast learning model to obtain row content standardized data;
in this embodiment, the comparison learning model is a SimCSE model, and the comparison learning idea is introduced into a sentence similarity judging model.
200. And the line content standardized data is manually verified and stored in a corresponding database to be used as training data for comparing the offline training stage of the learning model.
After the standardized data of the line content is manually verified to be correct, the mapping from the original standard to the standard marked by the history manual is obtained and used as the training data of the offline training stage of the comparison learning model,
the training process ensures that samples with different specifications of the same variety are all in the same batch, so that the model learns fine differences among different specifications, for example, 0.25g 16 capsules of Yunnan white drug powder, box and the like are all in the same batch.
As an alternative embodiment of the present application, optionally, in step 100, it includes:
and the comparison learning model compares and learns samples similar to the row content of the data table to be processed according to the row content of the data table to be processed, pushes away dissimilar samples, and outputs standardized data of the row content.
The idea of contrast learning is to pull close samples and push open dissimilar samples, and the contrast loss is based on the cross entropy loss of negative samples in the batch.
For example, the input data is the variety specification and other information to be mapped by the dealer, such as "Yunnan Baiyao capsule 0.25g x 16 capsules" and "box" and the like. The output line content standardized data is the standard variety specification of pharmaceutical factories, such as 0.25g 16 x 1 plate of Yunnan white drug powder capsule box. It should be noted that, the user may define the variety specification according to the requirement of the data record, and the method is not limited herein.
In the embodiment 2, the SimCSE model for comparing the learning ideas is applied to the drug variety specification standardization work, so that the accuracy of line content standardization is improved and the manual standardization efficiency is greatly improved.
Example 3
Example 1 and example 2 describe a column name alignment standard name flow and a row content standardization flow, respectively, and based on the above flows, data extraction standardization result audit management is performed to ensure accuracy and consistency. The data extraction standardized result auditing method comprises the following steps:
at least one of manual extraction or model automatic extraction is used for extracting data from a data table to be processed, and at least one of the large language model or the contrast learning model is used for obtaining standardized data corresponding to the extracted data.
The automatic extraction of the model can be performed on the premise that a form file to be standardized is imported into a designated position in advance, the model automatically extracts data from the form file according to the designated position, and the corresponding model is applied to realize standardization, for example, a column name is aligned to a standardized name, and then a large language model module is used; and (5) standardizing the variety specification of the line content, and using a contrast learning model.
As an optional implementation manner of the present application, optionally, at least one of manual extraction or model automatic extraction extracts data from a data table to be processed, and uses at least one of the large language model or the comparative learning model to obtain standardized data corresponding to the extracted data, and further includes:
manually auditing the standardized data to obtain a verification result;
when the verification result is wrong, the verification result index is manually changed and stored into the databases referenced by the large language model and the comparison learning model.
After the manually changed standardized column name data or row content data are indexed to the database, when the large language model or the comparison learning model performs the next data standardization task, one part of training data adopts standardized data marked by historic manual, so that the obtained standardized data table has more accurate results, the correct results are indexed to the database again, and the closed loop of correct data utilization is completed.
Example 4
Fig. 3 is a detailed block diagram of the column name alignment standard name, and the detailed implementation of the column name alignment standard name is described in embodiment 1, and is not repeated here.
Each legend definition is described as follows:
inputting an original Excel form data sample provided for a dealer;
the project adopts a Milvus vector database, and aims to store, index and manage embedded vectors of original Excel column header names and row values through a large language model, and store corresponding standard column names in a correlated manner so as to provide a sample of the new data to be aligned
The large language model mainly refers to the generated language model with the parameter quantity of more than 10 hundred million, such as ChatGPT, chatGLM, general meaning thousand questions, and text dialect.
The data is historically processed, namely, the data which passes the manual audit is extracted by the early stage manual extraction and the later stage model extraction, and the data content is standardized data corresponding to the original Excel data.
Outputting standard name mapping, wherein the standard names are defined by the embodiment including sales date, customer name, commodity specification, unit, number, lot number and the like
(1) Semantic query
And according to the input formed by one row of values in the column names and the corresponding column names of the original Excel table, inquiring an Embedding database after Embedding the large model to obtain a similar sample.
(2) Sample return
I.e. query the column name standardized alignment sample returned by the Embedding database most similar to the input, for building the template.
(3) And (3) the simplet construction, namely, large model input constructed according to the task of which the column name corresponds to the standard name, wherein the sample format is 'task role definition + standard name definition + similar sample + output format designation + content to be extracted', and the sample input is the result returned by the input content query Embedding database.
(4) And outputting the large language model, namely outputting the corresponding relation between the original column names and the standardized column names by the constructed prompt input large model according to self reasoning capacity.
Example 5
Fig. 4 is a detailed block diagram of the line content standardization, and the detailed implementation principle of the line content standardization is described in embodiment 2, which is not repeated here.
Each legend definition is described as follows:
(1) And inputting the variety specification to be mapped by a dealer and other information, such as 'Yunnan white drug powder capsule 0.25g x 16 capsules', box and the like.
(2) The SimCSE model introduces a sentence similarity judging model by a comparison learning idea into the sentence similarity judging model, wherein the comparison learning idea is to pull similar samples and push dissimilar samples away, and the comparison loss is based on cross entropy loss of negative samples in a batch.
(3) And outputting the powder to be the standard variety specification of pharmaceutical factories, such as 0.25g 16 particles 1 plate packaged in a Yunnan white drug powder capsule box.
(4) In the reasoning stage, the original variety specification is input into the SimCSE model, and the standard specification is output.
(5) And (3) manually verifying, namely after the model outputs the variety specification standardized result, manually checking to ensure that the result is correct, and finally warehousing and storing the historical training data.
(6) In the training phase, the device comprises a training device, the training data is the mapping from the original standard to the standard marked by the history manual, the training process ensures that samples with different specifications of the same variety are all in the same batch, so that the model learns the slight difference among different specifications, for example, the "Yunnan white drug powder capsule 0.25g x 16 granule", "Yunnan white drug powder capsule 0.25g x 16 granule/box" are all in the same batch.
Example 6
Based on the same principle as the foregoing method, a data standardization processing device based on a large language model is also provided, referring to fig. 5, a data standardization processing device 10 based on a large language model according to an embodiment of the disclosure includes:
an acquisition data table module 11, configured to acquire data of a data table to be processed;
the semantic query module 12 performs semantic query on the data of the data table to be processed through a database to obtain a similar sample;
a sample constructing module 13, configured to construct a sample using the similar samples as return samples, and define a sample format used for inputting the large language model according to the return samples;
the column name alignment standard name module 14 is configured to obtain a prompt, extract information in a sample format in the prompt according to the prompt by using the large language model, and analyze and process the information in the sample format according to historical data by using the large language model to obtain column name standardized data corresponding to the data table to be processed;
the row content standardization module 15 is used for processing the row content of the data table to be processed by adopting a pre-trained comparison learning model to obtain row content standardization data;
and the data extraction standardized result auditing management module 16 is used for manually extracting or automatically extracting at least one of the data from the data table to be processed by a model, and obtaining the standardized data corresponding to the extracted data by applying at least one of the large language model or the comparison learning model.
It should be apparent to those skilled in the art that the implementation of all or part of the above-described embodiments of the method may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the steps of the embodiments of the control methods described above when executed. The modules or steps of the invention described above may be implemented in a general-purpose computing device, they may be centralized in a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment methods may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the embodiment flow of each control method as described above when executed. The storage medium may be a magnetic disk, an optical disc, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), a flash memory (flash memory), a hard disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Example 7
Still further, in a third aspect of the present application, an electronic device is provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement a data normalization processing method based on a large language model according to the foregoing embodiment when executing the executable instructions.
An electronic device of an embodiment of the present disclosure includes a processor and a memory for storing processor-executable instructions. Wherein the processor is configured to implement any of the foregoing methods of data normalization based on a large language model when executing the executable instructions.
It should be noted that the number of the processors may be one or more. Meanwhile, in the electronic device of the embodiment of the disclosure, an input device and an output device may be further included. The processor, the memory, the input device, and the output device may be connected by a bus, or may be connected by other means, which is not specifically limited herein.
The memory is used as a computer readable storage medium for a data standardization processing method based on a large language model, and can be used for storing software programs, computer executable programs and various modules, such as: a program or module corresponding to a data standardization processing method based on a large language model in an embodiment of the disclosure. The processor executes various functional applications and data processing of the electronic device by running software programs or modules stored in the memory.
The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. The data standardization processing method based on the large language model is characterized by comprising the following steps:
acquiring data of a data table to be processed;
carrying out semantic query on the data of the data table to be processed through a database to obtain a similar sample;
constructing a promt by taking the similar sample as a return sample;
and inputting the campt into a pre-trained large language model to obtain column name standardized data corresponding to the data table to be processed.
2. The method for data standardization processing based on large language model as claimed in claim 1, wherein the semantic query is performed on the data of the data table to be processed through a database to obtain a similar sample, comprising the following steps:
selecting column names in the data table to be processed and column data corresponding to the column names, and inputting the column names and the column data into the large language model;
querying a column name standardized alignment sample most similar to input in a database by using a large language model;
the column name normalized alignment sample most similar to the input is output as a similar sample.
3. The data standardization processing method based on the large language model of claim 1, constructing a promt using the similar sample as a return sample, comprising:
from the returned samples, a sample format for input of the large language model is defined.
4. The method for data normalization processing based on a large language model according to claim 1, wherein inputting the template to a pre-trained large language model to obtain column name normalization data corresponding to the data table to be processed comprises:
acquiring a prompt;
the large language model extracts information in a sample format in the template according to the template;
and the large language model analyzes and processes the information in the sample format according to the historical data to obtain column name standardized data corresponding to the data table to be processed.
5. The method for data normalization based on a large language model as claimed in claim 1, further comprising:
processing the row content of the data table to be processed by adopting a pre-trained contrast learning model to obtain row content standardized data;
and the line content standardized data is manually verified and stored in a corresponding database to be used as training data for comparing the offline training stage of the learning model.
6. The method for data standardization processing based on a large language model according to claim 5, wherein processing the row contents of the data table to be processed by using a pre-trained contrast learning model to obtain row content standardization data comprises:
and the comparison learning model compares and learns samples similar to the row content of the data table to be processed according to the row content of the data table to be processed, pushes away dissimilar samples, and outputs standardized data of the row content.
7. The large language model based data normalization processing method according to claim 1 or 5, further comprising:
at least one of manual extraction or model automatic extraction is used for extracting data from a data table to be processed, and at least one of the large language model or the contrast learning model is used for obtaining standardized data corresponding to the extracted data.
8. The method for normalizing data based on a large language model according to claim 7, wherein at least one of manual extraction and model automatic extraction extracts data from a data table to be processed, and uses at least one of the large language model or the comparative learning model to obtain normalized data corresponding to the data table to be processed, further comprising:
manually auditing the standardized data to obtain a verification result;
when the verification result is wrong, the verification result index is manually changed and stored into the databases referenced by the large language model and the comparison learning model.
9. A data normalization processing arrangement based on a large language model according to any one of claims 1 to 8, comprising the following modules:
the data acquisition module is used for acquiring the data of the data table to be processed;
the semantic query module is used for carrying out semantic query on the data of the data table to be processed through a database to obtain a similar sample;
a sample constructing module, configured to construct a sample by using the similar samples as return samples, and define a sample format used for inputting the large language model according to the return samples;
the column name alignment standard name module is used for acquiring a prompt, extracting information in a sample format in the prompt according to the prompt by the large language model, and analyzing and processing the information in the sample format according to historical data by the large language model to obtain column name standardized data corresponding to the data table to be processed;
the row content standardization module is used for processing the row content of the data table to be processed by adopting a pre-trained comparison learning model to obtain row content standardization data;
and the data extraction standardized result auditing management module is used for manually extracting or automatically extracting at least one of the data from the data table to be processed by the model, and obtaining the standardized data corresponding to the extracted data by applying at least one of the large language model or the comparison learning model.
10. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement a large language model based data normalization processing method according to any one of claims 1 to 8 when executing the executable instructions.
CN202311464984.4A 2023-11-06 2023-11-06 Data standardization processing method based on large language model Pending CN117371401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311464984.4A CN117371401A (en) 2023-11-06 2023-11-06 Data standardization processing method based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311464984.4A CN117371401A (en) 2023-11-06 2023-11-06 Data standardization processing method based on large language model

Publications (1)

Publication Number Publication Date
CN117371401A true CN117371401A (en) 2024-01-09

Family

ID=89389128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311464984.4A Pending CN117371401A (en) 2023-11-06 2023-11-06 Data standardization processing method based on large language model

Country Status (1)

Country Link
CN (1) CN117371401A (en)

Similar Documents

Publication Publication Date Title
US11301484B2 (en) Systems and methods for type coercion
CN108985912B (en) Data reconciliation
CN111126026B (en) Method and tool for generating visual report form by analyzing SQL statement
AU2018372633A1 (en) Systems and methods for enhanced mapping and classification of data
US10699112B1 (en) Identification of key segments in document images
CN112527970B (en) Data dictionary standardization processing method, device, equipment and storage medium
CN110795524A (en) Main data mapping processing method and device, computer equipment and storage medium
CN113220728B (en) Data query method, device, equipment and storage medium
CN113627168A (en) Method, device, medium and equipment for checking component packaging conflict
CN117371401A (en) Data standardization processing method based on large language model
CN116562247A (en) Electronic form content generation method, electronic form content generation device and computer equipment
CN111061733A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN113626571B (en) Method, device, computer equipment and storage medium for generating answer sentence
CN115796572A (en) Risk enterprise identification method, apparatus, device and medium
CN115310772A (en) Method for monitoring quality supervision result data of medical instruments, medical instrument transaction platform and system
TW526428B (en) Financial cost forecasting system and method
EP3401799A1 (en) Data storage method and apparatus
CN114943219A (en) Method, device and equipment for generating bill of material test data and storage medium
CN111506776B (en) Data labeling method and related device
EP3460677A1 (en) Assessment program, assessment device, and assessment method
CN115826928B (en) Program generating method, system, electronic device and computer readable storage medium
JP2020166443A (en) Data processing method recommendation system, data processing method recommendation method, and data processing method recommendation program
CN110673888B (en) Verification method and device for configuration file
CN113792048B (en) Form verification rule generation method and system for non-relational database
US11372939B2 (en) Systems and methods for clustered inventory management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240221

Address after: 650000 national high tech Industrial Development Zone, Kunming, Yunnan (No. 222, 2nd Ring West Road, Kunming)

Applicant after: Yunnan Baiyao Group pharmaceutical e-commerce Co.,Ltd.

Country or region after: China

Applicant after: Yunnan Baiyao Group Co.,Ltd.

Address before: 650000 national high tech Industrial Development Zone, Kunming, Yunnan (No. 222, 2nd Ring West Road, Kunming)

Applicant before: Yunnan Baiyao Group pharmaceutical e-commerce Co.,Ltd.

Country or region before: China