CN115203206A - Data content searching method and device, computer equipment and readable storage medium - Google Patents

Data content searching method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN115203206A
CN115203206A CN202210823548.0A CN202210823548A CN115203206A CN 115203206 A CN115203206 A CN 115203206A CN 202210823548 A CN202210823548 A CN 202210823548A CN 115203206 A CN115203206 A CN 115203206A
Authority
CN
China
Prior art keywords
data
preset
model
target
data content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210823548.0A
Other languages
Chinese (zh)
Inventor
李开金
谭振海
刘伏桃
李建民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rootcloud Technology Co Ltd
Original Assignee
Rootcloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rootcloud Technology Co Ltd filed Critical Rootcloud Technology Co Ltd
Priority to CN202210823548.0A priority Critical patent/CN115203206A/en
Publication of CN115203206A publication Critical patent/CN115203206A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The embodiment of the invention discloses a data content searching method, a data content searching device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring user query corpus data; analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a candidate table; processing the user query corpus data and the metadata of the candidate list according to a preset list matching model to obtain a target list; processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to the target data content; and querying the target data content in a preset system based on the SQL query statement. Through the processing of the table matching model and the preset SQL statement generation model, the target data content can be acquired more quickly and accurately.

Description

Data content searching method and device, computer equipment and readable storage medium
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a data content searching method and apparatus, a computer device, and a readable storage medium.
Background
In the data searching process of the manufacturing industry, due to the facts that the number of data tables is large, related fields are complex, table data in the same field are similar, data of a plurality of tables have the same value, table names, column names and data names are close, and the like, a large amount of interference exists in the data searching process.
The existing data searching scheme can not realize the fast and accurate searching of data contents.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present application provide a data content search method, apparatus, computer device, and readable storage medium, and the specific scheme is as follows:
in a first aspect, an embodiment of the present application provides a data content search method, where the method includes:
acquiring user query corpus data;
analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
processing the user query corpus data and metadata of a preset number of candidate tables according to a preset table matching model to obtain a target table;
processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to target data content;
and querying the target data content in a preset system based on the SQL query statement.
According to a specific implementation manner of the embodiment of the present application, the step of analyzing the corpus data queried by the user to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables includes:
processing the user query corpus data by using a named entity recognition model to obtain a corresponding data entity;
and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
According to a specific implementation manner of the embodiment of the present application, the named entity recognition model includes a preset BERT model and a conditional random field model, and the step of processing the user query corpus data based on the named entity recognition model to obtain a corresponding data entity includes:
processing the user query corpus data based on the preset BERT model to obtain an initial coding sequence with semantic tags;
decoding the initial code sequence based on the conditional random field model to obtain a corresponding data entity.
According to a specific implementation manner of the embodiment of the present application, the coding layer of the preset BERT model adopts a RoBERTa pre-training model.
According to a specific implementation manner of the embodiment of the present application, the step of processing the user query corpus data and the metadata of the preset number of candidate tables according to a preset table matching model to obtain the target table includes:
processing the user query corpus data based on a preset BERT model to obtain a first semantic vector;
processing the metadata of the candidate tables in the preset number based on a preset BERT model to obtain a second semantic vector in the preset number;
processing a first semantic vector corresponding to the user query corpus data and a second semantic vector corresponding to a preset number of candidate tables based on a deep matching algorithm to calculate matching scores of the user query corpus data and the candidate tables;
and selecting the candidate table with the highest matching score as the target table.
According to a specific implementation manner of the embodiment of the application, the step of processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to target data content includes:
splicing the user query corpus data and the column names of the target table according to the column sequence of the target table to obtain a preset number of data to be coded;
coding each data to be coded based on a preset BERT model to obtain a preset number of coded vectors;
processing all the coding vectors based on a preset classification model to obtain a first part of combined statements and a second part of combined statements, wherein the first part of combined statements are combined statements corresponding to the target data content, and the second part of combined statements are combined statements to be deleted;
and merging all the first part combined statements to obtain the SQL query statement.
In a second aspect, an embodiment of the present application provides a data content search apparatus, including:
the acquisition module is used for acquiring the corpus data inquired by the user;
the analysis module is used for analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
the first processing module is used for processing the user query corpus data and the metadata of the candidate tables in the preset number according to a preset table matching model so as to obtain a target table;
the second processing module is used for processing the data entity and the target table according to a preset SQL statement generating model so as to obtain an SQL query statement corresponding to the target data content;
and the query module is used for querying the target data content in a preset system based on the SQL query statement.
According to a specific implementation manner of the embodiment of the present application, the parsing module is specifically configured to process the user query corpus data by using a named entity recognition model to obtain a corresponding data entity; and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
In a third aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores a computer program, and the computer program, when running on the processor, executes the data content search method according to any one of the foregoing first aspect and the foregoing first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program runs on a processor, the computer program performs the data content searching method according to any one of the foregoing first aspect and the foregoing first aspect.
The embodiment of the application provides a data content searching method, a data content searching device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring user query corpus data; analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a candidate table; processing the user query corpus data and the metadata of the candidate list according to a preset list matching model to obtain a target list; processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to the target data content; and querying the target data content in a preset system based on the SQL query statement. Through the processing of the table matching model and the preset SQL statement generation model, the target data content can be acquired more quickly and accurately.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.
Fig. 1 illustrates a method flow diagram of a data content searching method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an interaction of a named entity recognition model of a data content search method according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an interaction of table matching models of a data content search method according to an embodiment of the present application;
fig. 4 shows a schematic diagram of device modules of a data content search device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are intended to indicate only specific features, numerals, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the presence of or adding to one or more other features, numerals, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as terms defined in a commonly used dictionary) will be construed to have the same meaning as the contextual meaning in the related art and will not be construed to have an idealized or overly formal meaning unless expressly so defined in various embodiments of the present invention.
Referring to fig. 1, a schematic flow chart of a method for searching data content provided in an embodiment of the present application is shown, where, as shown in fig. 1, the method for searching data content provided in the embodiment of the present application includes:
step S101, obtaining user query corpus data;
in a specific embodiment, the query corpus data of the user may be a query statement input by the user on a front-end interface.
The user query corpus data can also be query statements stored in a historical corpus database. Specifically, the source of the corpus data queried by the user may be adaptively replaced according to an actual application scenario, which is not limited herein.
After a user inputs query sentences on a front-end interface, the query sentences can be automatically stored in a historical corpus database so as to be updated and learned by a preset query system.
Step S102, analyzing the user query corpus data to obtain data entities corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
in a specific embodiment, after the query statement is obtained, the back-end interface starts to parse the query statement of the user, analyzes the semantics of each word in the statement, performs tag classification by taking the word as a unit, and can analyze and obtain preliminary data entity content according to the tag classification result.
Specifically, the data entity includes data information related to the target data content, such as a product type, a product model, a product region, and a search tag.
After the preliminary data entity content is obtained, elastic Search (ES for short) is performed in the element code table library and the table metadata library, that is, the metadata corresponding to the candidate table can be obtained.
The candidate table is table data of target data content related to the data entity, the target table with the highest matching degree with the target data content can be determined from the multiple candidate tables through a subsequent matching model, and metadata of the candidate table comprises a table name/table alias, code table data and the like.
According to a specific implementation manner of the embodiment of the present application, the step of analyzing the corpus data queried by the user to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables includes:
processing the user query corpus data by using a named entity recognition model to obtain a corresponding data entity;
and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
In a specific embodiment, as shown in fig. 2, the Named Entity Recognition (NER) model is used to obtain a data Entity in the corpus data of the user query.
The NER model is usually implemented by using sequence labeling, i.e. performing BIO tag classification on the input chinese characters by taking characters as units, where B is an entity start position, I is an entity middle position, and O is a character outside the entity, i.e. an irrelevant character.
In this embodiment, a BERT/BiLSTM + CRF model structure is used as a basic structure of the NER model, that is, semantic coding is performed on an input sentence sequence by a BERT model (pre-trained language Representation model) or a BiLSTM pre-trained model, and then an optimal decoding result of the sequence is calculated by the CRF model.
According to a specific implementation manner of the embodiment of the application, the named entity recognition model comprises a preset BERT model and a conditional random field model, and the step of processing the user query corpus data based on the named entity recognition model to obtain the corresponding data entity comprises the following steps:
processing the user query corpus data based on the preset BERT model to obtain an initial coding sequence with semantic tags;
decoding the initial coding sequence based on the conditional random field model to obtain a corresponding data entity.
In a specific embodiment, as shown in fig. 2, when the user processes the query sentence input by the user based on the pre-trained BERT model, an initial coding sequence with BIO semantic tags is obtained.
For example, when the query sentence input by the user is "query about the recent missing material of a playback machine? ", the preset BERT model will perform semantic coding action on the query statement to obtain the initial coding sequence of" OOBOBIOOBIOO IBIO ".
And sending the initial coding sequence to a Conditional Random Field (CRF) model, calculating an optimal decoding result through the CRF model, and sending the decoding result to a full connection layer Dense in a neural network, namely, deriving a data entity 'rework, missing part and material' corresponding to the query.
According to a specific implementation manner of the embodiment of the present application, the coding layer of the preset BERT model adopts a RoBERTa pre-training model.
In the embodiment, the BERT coding layer is replaced by a RoBERTa pre-training model, so that the semantic recognition and semantic coding performance of the NER model of the embodiment can be better than that of the traditional BERT model, and no more burden is added to a processor in terms of computation.
In a specific embodiment, after obtaining a data entity corresponding to the corpus data queried by the user, the processor continues to use the data entity to perform a multi-way candidate list recall action.
Specifically, for example, when the acquired data entity is "heavy machine, short piece, material", the candidate list is recalled according to the tag type of the data entity. The "heavy machine" belongs to a product type, and when a candidate list is recalled, the product type is limited to the cause department, and flexible search is performed in the code list library. "missing" belongs to a table name, and when a candidate table recall is performed, a flexible search is performed on the table name and table alias storage area in the table metadata base. The 'material' belongs to column names, and when candidate list recalls are carried out, elastic search is carried out on the column names and the storage areas of the column names and the list names in the list metadata base.
Specifically, when the code table library is subjected to elastic search, a code value, a code type, a code table and a code column corresponding to the code table are recorded.
When performing a flexible search in the table metadata base, all field definitions of the table, such as table name, column name, table alias, column alias, etc., are recorded. It should be noted that, performing a flexible search in the table metadata base only searches for table metadata, and does not record the data content stored in the table.
By acquiring the data entity firstly and then performing multi-channel candidate table recall action on the code table database and the table metadata database, the index range during searching data content can be greatly reduced, the data content can be positioned more accurately and quickly, comprehensive search of the database is avoided, and the consumption of a processor is reduced.
Step S103, processing the user query corpus data and metadata of a preset number of candidate tables according to a preset table matching model to obtain a target table;
in a specific embodiment, a preset table matching model adopts an encoding-decoding (encoder-decoder) structure, that is, a query statement input by an encoder converts the query statement into a vector with a fixed length, and then converts the vector into a corresponding target character through a decoder.
In the actual operation process, the input parameters of the table matching model comprise two parts, wherein the first part is a data entity of a user query statement, and the second part is metadata of a candidate table, wherein the metadata of the candidate table comprises a table name/table alias, a column name/column alias and code table data.
The code table data refers to data obtained by performing elastic search in a code table library according to a data entity analyzed by the NER model. If the candidate list has no code list library search result, the candidate list consists of 1 sentence, namely list name/list alias. If there is a code table library search result, the candidate table consists of two sentences, i.e. table name/table alias, such as "camera wide table" in fig. 3; and column name plus code table data, such as "company name: pumping ".
After the data entity and the preset number of candidate table metadata are input into the table matching model, the matching score between the data entity and the metadata of each candidate table is calculated, and the metadata of the candidate table most relevant to the data entity is determined according to the matching score.
According to a specific implementation manner of the embodiment of the present application, the step of processing the user query corpus data and the metadata of the preset number of candidate tables according to a preset table matching model to obtain the target table includes:
processing the user query corpus data based on a preset BERT model to obtain a first semantic vector;
processing the metadata of the candidate tables in the preset number based on a preset BERT model to obtain a second semantic vector in the preset number;
processing a first semantic vector corresponding to the user query corpus data and a second semantic vector corresponding to a preset number of candidate tables based on a deep matching algorithm to calculate matching scores of the user query corpus data and the candidate tables;
and selecting the candidate table with the highest matching score as the target table.
In a specific embodiment, as shown in fig. 3, after the table matching model obtains the data entity corresponding to the corpus data queried by the user and the metadata of the candidate table, semantic coding is performed on each input parameter through a pre-trained BERT model to obtain a corresponding first semantic vector and a preset number of second semantic vectors.
The table matching model realizes fine ranking through a semantic matching algorithm, and the specific principle is that the query statement and the table metadata are respectively coded into semantic vectors, and then the two vectors are mapped into the same space through a depth matching model, have a semantic matching relationship, and are closer in the vector space.
Since the table metadata will be first encoded by the BERT model into N basic semantic vectors, for example, when the table metadata only includes table names/table aliases, the table metadata will be encoded by the BERT model into 1 basic semantic vector, which is the second semantic vector. When the table metadata comprises table names/table aliases, the table metadata can be coded into 2 basic semantic vectors through a BERT model, and at the moment, the basic semantic vectors need to be merged to obtain corresponding second semantic vectors.
In order to express the table metadata through the second semantic vector, the algorithm calculates the importance score of each basic semantic vector through an importance authorization mechanism, and then obtains a second semantic vector through weighting, wherein the second semantic vector is used as the semantic vector expression of the table.
In this example, as shown in fig. 3, the "pump" in the user search sentence may be matched to many tables, the "camera" may be more critical information, and the algorithm automatically learns the critical information in the matching process by using the attention mechanism in the deep learning, so that the important field has a higher weight in the final semantic representation of the table, and the matching accuracy is improved.
Specifically, the Attention weight distribution can be adaptively distributed according to the data entity in the query statement in the actual application process, the distribution weight of the "camera information wide table" in fig. 3 is 0.8, and the "company name: pump "is assigned a weight of 0.2.
After semantic coding processing of the BERT model, a first semantic vector and a second semantic vector of metadata corresponding to each group of query sentences and the candidate list can be obtained. When the matching score output by the table matching model is calculated, the inner product of a group of first semantic vectors and a group of second semantic vectors is directly calculated, and then the inner product is normalized through Softmax, so that the matching score of metadata corresponding to a group of query sentences and the candidate table can be obtained.
The specific algorithm of the Attention mechanism in the table matching model is as follows:
the algorithm model adopts an encoder-decoder structure, namely, an encoder reads an input sentence and converts the sentence into a semantic vector with fixed length, and a decoder converts the vector into a corresponding target character.
(1) Firstly, we use RNN structure to obtain hidden state (h) in encoder 1 ,h 2 ,…,h t );
(2) Assume that the hidden state of the current decoder is s t-1 We can calculate the relevance e of each input position j to the current output position tj =a(s t-1 ,h j ) Written in the form of a corresponding vector is E t =(a(s t-1 ,h 1 ),…,a(s t-1 ,h T ) A) is a correlation operator, e.g. in the form of a common dot product
Figure BDA0003743140960000111
Weighted dot product
Figure BDA0003743140960000112
And so on.
(3) For E t Performing softmax normalization operation to convert the normalization function normalization into importance distribution, A t =softmax(E t ) In the unfolded form
Figure BDA0003743140960000113
(4) Using A t We can perform weighted summation to obtain the corresponding semantic vector context vector
Figure BDA0003743140960000114
(5) Thus, we can calculate the next hidden state S of the decoder t =f(s t-1 ,y t-1 ,c t ) And the output p (y) of the position t |y 1 ,y 2 ,…,y t-1 ,x)=g(y i-1 ,s i ,c i )。
By calculating the weight of the association between the encoder and the decoder state, the weight distribution of the importance attribute can be obtained, so that the part of the data entity occupying the more important position can be automatically determined.
Because of the existence of metadata of a plurality of candidate tables, the table matching model only processes one query statement vector and a semantic vector of the original data of one candidate table at each processing. And after the matching scores between the user query corpus data and the metadata of each candidate table are obtained, selecting the metadata of the candidate table with the highest matching score as the metadata of the target table.
And extracting the corresponding target table from a preset table database according to the target table metadata so as to perform the subsequent step of acquiring the target data content.
Step S104, processing the data entity and the target table according to a preset SQL statement generating model to obtain an SQL query statement corresponding to target data content;
in a specific embodiment, the preset SQL statement generation model is a BERT-based chinese NL2SQL model.
In the specific implementation mode, based on a BERT Chinese NL2SQL model, the input of the model is a data entity and a data table of a query statement, and the output of the model is an SQL structure corresponding to one SQL statement.
Specifically, in the NL2SQL model, the sel field format is in the form of a list, which represents a column selected by a select statement. The agg field is in a list format, and corresponds to the select field one by one, and indicates what aggregation operation (e.g., count, sum, etc.) is performed on each column selected by the select statement. The conds field format is in the form of a list, representing the individual conditions in the where clause, each condition being in the form of (condition column, condition operator, condition value). The cond _ conn _ op field format is in integer numeric form, representing the relationship between conditions (e.g., and, or).
According to a specific implementation manner of the embodiment of the application, the step of processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to target data content includes:
splicing the user query corpus data and the column names of the target table according to the column sequence of the target table to obtain a preset number of data to be coded;
coding each piece of data to be coded based on a preset BERT model to obtain a preset number of coding vectors;
processing all the coding vectors based on a preset two-classification model to obtain a first part of combined sentences and a second part of combined sentences, wherein the first part of combined sentences are combined sentences corresponding to the target data content, and the second part of combined sentences are combined sentences to be deleted;
and merging all the first part combined statements to obtain the SQL query statements.
In a specific implementation manner, after the SQL statement generation model obtains the query statement target table, the query statement and the column name of the target table are spliced according to a preset sequence, so as to obtain input parameters to be input to the SQL statement generation model.
Specifically, a Token which is not included in the two BERT models, namely TEXT or REAL, is added before each column name for distinguishing, wherein the Token tokens of TEXT and REAL can be replaced by an untrained Token originally reserved by any two BERTs.
When the SQL statement generation model predicts an SQL statement, which columns in the data table need to be predicted to be selected, and because the meanings of each column in the data table are different, the query statement and the headers of each column in the data table need to be spliced together according to a preset sequence, and then the query statement and the headers are input into the BERT model together for real-time coding.
Specifically, the selection of the sequence may be to select each column of the data table from top to bottom, or to number each column of the data table in advance, and select each column of the data table according to the numbering sequence. The sequence is not specifically limited in this embodiment, and adaptive replacement may be performed according to specific situations in practical applications.
After being coded by the BERT model, a series of coding vectors can be obtained, and then the obtained coding vectors are used for predicting each clause of the SQL.
When predicting the cond _ val (condition value) clause, the SQL statement generation model specifically enumerates the combination of cond _ op (connector) and cond _ val (condition value) to generate a series of candidate combinations according to the coding vector selected in the foregoing embodiment, that is, the cond _ col (condition column) clause result, and converts the selection of these candidate combinations into a plurality of binary problems.
And splicing the generated multiple groups of [ cond _ col, cond _ op and cond _ val ] combinations and query according to the previous steps, sequentially inputting a BERT model, inputting a vector coded by the BERT into a Layer of Dense Layer for secondary classification, judging whether each combination corresponds to a problem or not, and finally combining all cond combinations with the output of 1 (namely the cond combination corresponding to the target data content) to serve as the final output of the model.
And after all the first part of combined statements are subjected to preset splicing and converting steps, the SQL query statement corresponding to the target data content can be obtained.
Specifically, all the second partial combined statements whose results are 0 after the second classification processing are deleted.
Step S105, inquiring the target data content in a preset system based on the SQL inquiry statement.
In a specific embodiment, after the corresponding SQL query statement is obtained, the target data content may be searched in any query system that performs data search through the SQL statement.
Specifically, the preset system is any data storage system capable of using an SQL statement to perform data query in the prior art, which is not described in detail in this embodiment.
In summary, the present embodiment provides a data content search method, which performs real-time analysis on query corpus data input by a user through an improved NER model to obtain corresponding data entities and metadata of multiple candidate tables. And screening the candidate tables and the data entities through a table matching model to determine the most relevant data table. The SQL sentence is judged through the SQL sentence generation model, and the SQL sentence for inquiring the target data content can be obtained quickly and accurately finally. The problem that a large amount of data content cannot be accurately and quickly searched in the prior art is effectively solved.
Referring to fig. 4, a schematic diagram of device modules of a data content search device 400 according to an embodiment of the present application is shown, where, as shown in fig. 4, the data content search device 400 according to the embodiment of the present application includes:
an obtaining module 401, configured to obtain corpus data queried by a user;
an analyzing module 402, configured to analyze the corpus data queried by the user to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
a first processing module 403, configured to process, according to a preset table matching model, the user query corpus data and metadata of a preset number of candidate tables to obtain a target table;
a second processing module 404, configured to process the data entity and the target table according to a preset SQL statement generation model, so as to obtain an SQL query statement corresponding to target data content;
and the query module 405 is configured to query the target data content in a preset system based on the SQL query statement.
According to a specific implementation manner of the embodiment of the present application, the parsing module 402 is specifically configured to process the user query corpus data by using a named entity identification model to obtain a corresponding data entity; and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
In addition, an embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores a computer program, and the computer program, when running on the processor, executes the data content search method in the foregoing embodiment.
An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a processor, the computer program executes the data content searching method in the foregoing embodiment.
To sum up, the embodiments of the present application provide a data content search method, apparatus, computer device, and readable storage medium, where after a preliminary entity analysis result is obtained by using an NER model, metadata of a candidate table is preliminarily obtained through filtering of an element code table. And then, through a table matching model in the second stage, performing secondary filtering on the corpus data queried by the user and the metadata of the candidate table, and obtaining the metadata of the target table according to a matching score obtained after calculating the inner product. And finally, generating a corresponding query statement through an NL2SQL model according to the query entity obtained in the first stage and the table metadata definition information obtained in the second stage to obtain a final query result. According to the scheme, data can be recalled quickly and accurately in scenes such as massive data and complex structures. In addition, for specific implementation processes of the data content searching apparatus, the computer device, and the computer-readable storage medium mentioned in the foregoing embodiments, reference may be made to the specific implementation processes of the foregoing method embodiments, which are not described in detail herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A method for searching data content, the method comprising:
acquiring user query corpus data;
analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
processing the user query corpus data and metadata of a preset number of candidate tables according to a preset table matching model to obtain a target table;
processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to target data content;
and querying the target data content in a preset system based on the SQL query statement.
2. The method according to claim 1, wherein the step of parsing the corpus data of the user query to obtain the data entities corresponding to the target data content requested to be queried and the metadata of the candidate tables in a preset number comprises:
processing the user query corpus data by using a named entity recognition model to obtain a corresponding data entity;
and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
3. The method according to claim 2, wherein the named entity recognition model comprises a pre-set BERT model and a conditional random field model, and the step of processing the user query corpus data based on the named entity recognition model to obtain the corresponding data entity comprises:
processing the user query corpus data based on the preset BERT model to obtain an initial coding sequence with semantic tags;
decoding the initial coding sequence based on the conditional random field model to obtain a corresponding data entity.
4. The data content searching method of claim 3, wherein the coding layer of the preset BERT model adopts a RoBERTA pre-training model.
5. The method according to claim 1, wherein the step of processing the user query corpus data and the metadata of the predetermined number of candidate tables according to a predetermined table matching model to obtain the target table comprises:
processing the user query corpus data based on a preset BERT model to obtain a first semantic vector;
processing the metadata of the candidate tables in the preset number based on a preset BERT model to obtain a second semantic vector in the preset number;
processing a first semantic vector corresponding to the user query corpus data and a second semantic vector corresponding to a preset number of candidate tables based on a deep matching algorithm to calculate matching scores of the user query corpus data and the candidate tables;
and selecting the candidate table with the highest matching score as the target table.
6. The data content search method according to claim 1, wherein the step of processing the data entity and the target table according to a preset SQL statement generation model to obtain an SQL query statement corresponding to target data content comprises:
splicing the user query corpus data and the column names of the target table according to the column sequence of the target table to obtain a preset number of data to be coded;
coding each piece of data to be coded based on a preset BERT model to obtain a preset number of coding vectors;
processing all the coding vectors based on a preset classification model to obtain a first part of combined statements and a second part of combined statements, wherein the first part of combined statements are combined statements corresponding to the target data content, and the second part of combined statements are combined statements to be deleted;
and merging all the first part combined statements to obtain the SQL query statements.
7. A data content search apparatus, characterized in that the data content search apparatus comprises:
the acquisition module is used for acquiring the corpus data inquired by the user;
the analysis module is used for analyzing the user query corpus data to obtain a data entity corresponding to the target data content requested to be queried and metadata of a preset number of candidate tables;
the first processing module is used for processing the user query corpus data and the metadata of the candidate tables in the preset number according to a preset table matching model so as to obtain a target table;
the second processing module is used for processing the data entity and the target table according to a preset SQL statement generating model so as to obtain an SQL query statement corresponding to the target data content;
and the query module is used for querying the target data content in a preset system based on the SQL query statement.
8. The apparatus according to claim 7, wherein the parsing module is specifically configured to process the corpus data of the user query by using a named entity recognition model to obtain corresponding data entities; and performing elastic search in a code table database and a table metadata database according to the data entity to obtain metadata of the candidate tables in the preset number.
9. A computer device, characterized in that it comprises a processor and a memory, said memory storing a computer program which, when run on said processor, performs the data content search method of any one of claims 1 to 6.
10. A computer-readable storage medium, in which a computer program is stored which, when run on a processor, carries out the data content search method of any one of claims 1 to 6.
CN202210823548.0A 2022-07-13 2022-07-13 Data content searching method and device, computer equipment and readable storage medium Pending CN115203206A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210823548.0A CN115203206A (en) 2022-07-13 2022-07-13 Data content searching method and device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210823548.0A CN115203206A (en) 2022-07-13 2022-07-13 Data content searching method and device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115203206A true CN115203206A (en) 2022-10-18

Family

ID=83579936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210823548.0A Pending CN115203206A (en) 2022-07-13 2022-07-13 Data content searching method and device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115203206A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955415A (en) * 2023-09-13 2023-10-27 成都融见软件科技有限公司 Design hierarchy based data search system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955415A (en) * 2023-09-13 2023-10-27 成都融见软件科技有限公司 Design hierarchy based data search system
CN116955415B (en) * 2023-09-13 2024-01-23 成都融见软件科技有限公司 Design hierarchy based data search system

Similar Documents

Publication Publication Date Title
US10380236B1 (en) Machine learning system for annotating unstructured text
CN112800170A (en) Question matching method and device and question reply method and device
CN111563384B (en) Evaluation object identification method and device for E-commerce products and storage medium
CN111914062B (en) Long text question-answer pair generation system based on keywords
CN114580382A (en) Text error correction method and device
CN116719520B (en) Code generation method and device
US11645447B2 (en) Encoding textual information for text analysis
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN110390049B (en) Automatic answer generation method for software development questions
CN113821605A (en) Event extraction method
CN114997288A (en) Design resource association method
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN112784580A (en) Financial data analysis method and device based on event extraction
CN117349423A (en) Template matching type knowledge question-answering model in water conservancy field
CN112732862A (en) Neural network-based bidirectional multi-section reading zero sample entity linking method and device
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN115730058A (en) Reasoning question-answering method based on knowledge fusion
CN115238705A (en) Semantic analysis result reordering method and system
CN116090450A (en) Text processing method and computing device
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN114610744A (en) Data query method and device and computer readable storage medium
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN114254622A (en) Intention identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination