CN116910086B

CN116910086B - Database query method and system based on self-attention syntax sensing

Info

Publication number: CN116910086B
Application number: CN202311179624.XA
Authority: CN
Inventors: 张睿恒; 刘冠宇; 徐立新; 苏毅; 赵怡婧; 王潮; 刘雨蒙
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-01
Anticipated expiration: 2043-09-13
Also published as: CN116910086A

Abstract

The invention relates to a database query method based on self-attention syntax sensing, and belongs to the technical field of computer databases. The invention develops the existing word segmentation system secondarily. In order to annotate semantic information of relational data for each word in a query sentence, semantic coverage is realized, an N-shortest word segmentation technology combining statistics and dictionary is adopted, weights are given to edges in a directed acyclic graph according to the priority of a knowledge base, so that segmentation results are optimized and converted into database query sentences better, a hidden Markov model is introduced to annotate the segmentation results in parts of speech, word information and data semantic labels are fully utilized, association between words and database objects is established, and a standardized word sequence with data semantics is output. The invention can effectively improve the capability of the conversion model for mapping the natural language into the high-dimensional semantics, improve the accuracy and the general capability of the database natural language query, and provide more accurate and intelligent query experience for users.

Description

Database query method and system based on self-attention syntax sensing

Technical Field

The invention relates to a method and a system for realizing database query based on natural language, belonging to the technical field of computer databases.

Background

With the development of the internet blowout, the industry has widely used data and the information content has rapidly increased, the data has become one of the most valuable resources in the world today, and the database has become a key tool for storing and retrieving information. However, the database is not so intuitive and understandable. In conventional relational databases, the formal database language, the structured query language (Structured Query Language, abbreviated SQL), is required to access and manage data. When the database query operation is performed, the user needs to have knowledge of SQL and the underlying database, including data table structures, attribute association relationships, etc., or use a search program for constructing a visualization in advance. This undoubtedly brings considerable confusion to non-users.

Until the advent of computers, natural language has been the primary information medium in human society for describing things and relationships in the real world. However, natural language is not the optimal way to represent entities and relationships compared to machine query languages used to express logic, relational models, XML. Nevertheless, most of the knowledge and information is written and spread in natural language form, and with the explosive growth of the internet, almost all users can easily obtain a large amount of natural language information. Therefore, in order to enhance the database query experience of the user, it is highly required to overcome the natural language barrier and construct a database query interface capable of supporting the natural language.

Natural language processing is currently one of the most interesting areas of research. From part-of-speech tagging, grammar analysis, semantic tagging, etc., more and more tools and models have emerged, providing tremendous assistance in processing natural language. With the success of IBM's "Watson" in Jeopardy's competition, and the rise of natural language dialogue systems, such as Apple's Siri, google's Home, amazon's Alexa, and microsoft's corana, there is an increasing interest in using natural language interfaces for database queries.

The database natural language interface (Natural Language Interface to Database, NLIDB) technology relates to the cross research of artificial intelligence, database system, man-machine interface and other fields, and aims at translating natural language query (Natural Language Query, NLQ) into formal database language. The ideal state of the database natural language interface system is to provide a complete natural language-based man-machine interface, including database design, definition, and operation.

In general, a database natural language interface refers to converting natural language into structured query commands, which are used only for database query operations. An ideal natural language interface should have the ability to support users making arbitrary queries to the underlying database and to obtain accurate information at the lowest cost.

However, due to limitations of natural language processing technology, in the existing database natural language interface technology, in the process of associating a natural language with a data structure of a database, semantic features extracted by a language model do not have enough depth, so that a query interface cannot understand structured deep semantic information, and query effect is reduced.

Disclosure of Invention

The invention aims at overcoming the defects and shortcomings of the prior art and creatively provides a database query method and system based on self-attention syntax sensing. The invention develops the existing word segmentation system secondarily. In order to annotate semantic information of relational data for each word in a query sentence, semantic coverage is realized, an N-shortest word segmentation technology combining statistics and a dictionary is adopted, weights are given to edges in a directed acyclic graph according to the priority of a knowledge base, so that segmentation results are optimized and better converted into database query sentences, a relational data semantic model is used as a conceptual model, a set and a linked list are used as intermediate structures, and multi-scale deep high-dimensional feature extraction of natural language is realized by using a self-attention mechanism. Theoretical and experimental results show that the invention can effectively improve the capability of the conversion model for mapping the natural language into the high-dimensional semantics, and improve the accuracy and the general capability of the database natural language query.

The technical scheme adopted by the invention is as follows.

A method of database query based on self-attention syntax sensing, comprising the steps of:

step 1: and performing lexical analysis on the natural query language to obtain a standardized word sequence.

And performing word level pre-processing, wherein the pre-processing comprises word segmentation, part-of-speech tagging and named entity recognition. Abstract modeling is carried out on natural language, and standardized word sequences with data semantics are obtained.

Step 2: and carrying out syntactic analysis on the word sequence, and encoding semantic information to obtain information in a high-dimensional semantic space.

Step 3: and carrying out semantic information decoding on the high-dimensional space semantic information to obtain semantic information of a set and linked list structure. The intermediate structure data adopts a bottom-up mode, and the data is divided into a plurality of subtrees capable of independently representing database information according to a relation data rule so as to eliminate semantic ambiguity in the data;

step 4: and recombining semantic information of the intermediate structure into a query target and a query condition, and finally generating a complete SQL sentence.

Preferably, the accuracy and the integrity of lexical analysis are improved by utilizing a knowledge base, wherein the general knowledge base is not influenced by the application field and consists of a word segmentation dictionary, a general database dictionary and a synonym forest; the special knowledge base places focus on the database object and consists of a special word division base, a synonym base, an entity knowledge base, a domain name knowledge base, a composite concept knowledge base and an enumeration value knowledge base.

Preferably, the semantics of the relational data are considered in the word segmentation process, so that the correct association relationship between the words and the database object is ensured, and the word segmentation output is a standardized word sequence with the semantics of the relational data.

Preferably, in the process of syntactic analysis, the invention adopts a self-attention mechanism-based dependency syntactic sensing module, and uses a decision-making type analysis method to acquire the dependency relationship of the words. The syntax perception module maps semantic information from natural language to high-dimensional semantic space to realize deep high-dimensional representation of semantic features, and then maps the semantic information in the high-dimensional semantic space to identification information required by a structured query statement.

Preferably, in the process of generating the structured query statement, the invention classifies and analyzes the query targets and the query conditions according to the grammar rules of SQL, and proposes a corresponding rationality check scheme, thereby being more likely to ensure the correctness of conversion. The query generation module will take as output a complete SQL query command.

On the other hand, the invention also provides a database query system based on self-attention syntactic perception, which comprises a semantic information coding module and a language analysis model.

The semantic information coding module is used for extracting multi-layer high-dimensional voice information from word sequences subjected to word embedding and position coding to obtain abstract representation of database query mode information; and acquiring importance weight parameters in the query information, and extracting data characteristics according to the importance degrees of different contents.

The language parsing model includes an encoder that maps semantic information from natural language into a high-dimensional semantic space and a decoder that maps semantic information in the high-dimensional semantic space into identification information required for structuring a query statement.

Self-attention mechanism and syntax perception are introduced in the model, so that natural language input by a user is more accurately converted into structured language which can be understood by a database, and deeper query intention understanding is realized. Wherein the self-attention mechanism allows the interface to focus on the relevance between the different parts of the input sentence, thereby focusing on the most important information; the interface can better grasp key words and phrases, understand the query intention of the user, and more accurately express the requirements of the user when converting the query intention into the structured language.

The syntactically perceived content fusion enables the interface to be more intelligent when processing sentence structures; it takes into account the syntactic structure and grammar rules in the statement, ensuring that the logic and semantics of the query remain consistent when converted to a structured query language. This helps to reduce misunderstandings and ambiguities, thereby providing more reliable query results. Semantic feature extraction may be performed using residual structure and a self-attentive syntactic perceptual model to enhance the ability of the query interface to understand deep semantic features.

By introducing self-attention mechanism and syntactic perception content fusion, the natural language query interface can more comprehensively and accurately understand the query requirement of the user, interact with the underlying database in the most efficient manner, and provide more intelligent and accurate database query experience for the user.

Advantageous effects

Compared with the prior art, the invention has the following advantages:

1. the invention innovatively adopts a multi-module fused attention architecture, fuses the outputs of a plurality of modules together, and uses an attention mechanism to perform weighted fusion on the characteristics of different modules. For example, outputs of a plurality of modules such as syntactic analysis, semantic understanding, pre-training models and the like are fused through an attention mechanism, so that information of different modules is comprehensively utilized, and the performance and query accuracy of the whole system are improved.

2. The invention applies the attention mechanism to the flow of database query, and can more accurately understand the grammar structure and the relation in the natural language query, thereby better analyzing the intention of the user. Through the attention mechanism, the query interface can carry out weight adjustment of the characteristics and strengthening of important information in each link. The process of syntactic analysis and semantic understanding can be dynamically adjusted through the feedback mechanism of the attention mechanism, so that the processing of each step is more accurate and intelligent. The traditional syntactic analysis method may have certain limitation on long sentences or complex sentences, but through an attention mechanism, more important parts in the sentences can be automatically focused by calculating attention weights among different words, so that the level and the richness of feature extraction are improved. In this way, the generated structured query language will more accurately reflect the user's query intent.

3. The present invention utilizes a pre-trained language model (e.g., BERT) to optimize the processing of natural language queries. By introducing a pre-training model into the natural language query, the context information and semantic association can be better captured, thereby improving the accuracy of syntactic analysis and semantic understanding. Such optimization can effectively improve the accuracy of database queries and provide users with a more accurate and intelligent query experience.

Drawings

FIG. 1 is a general flow chart of a database natural language query interface system according to the present invention;

FIG. 2 is a flow chart of lexical analysis according to the present invention;

FIG. 3 is a flowchart of a word segmentation algorithm in syntactic analysis according to the present invention;

fig. 4 is a general framework diagram of a syntax sensing module based on an attention mechanism according to the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and examples, which are only for explaining the present invention and are not intended to limit the scope of the present invention, and the described examples are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Examples

As shown in FIG. 1, a database query method based on self-attention syntactic perception introduces self-attention mechanism and syntactic perception, performs multi-scale deep high-dimensional feature extraction on an input natural language, and generates a final database query language based on the extracted high-dimensional semantic representation of query information.

A method of database query based on self-attention syntax sensing, comprising:

step 1: and performing lexical analysis on the natural query language, including word segmentation, part-of-speech tagging and named entity recognition. Abstract modeling is carried out on natural language, and standardized word sequences with data semantics are obtained.

The invention provides a relational data semantic model, which constructs the close relation between natural language vocabulary and database objects by expanding the existing relational model, and enriches the semantic information of data.

The semantic model of the relational data is as follows:，/>represents the application area->A given natural language vocabulary set, +.>Representing database objects, ++>For representing the association between these concepts +.>For describing the data type of the database object. The lexical analysis flow is shown in fig. 2.

Further, to optimize the processing of word segmentation and part-of-speech tagging, a set of semantic tags specific to the relational data are employed to express different semantic types. Complete data semantic information can be obtained through the semantic tags. Based on the type of database object and the expertise base, the data semantic types are divided into 9 unique categories whose semantic tags are shown in Table 1:

TABLE 1

Wherein the operation words, the function words, the logic words and the graduated words correspond to a universal database dictionary, and the words can appear in any field of inquiry. The other types of words are related to the data model and correspond to the special knowledge base.

The relational data semantic model can remarkably improve the mutual understanding between the natural language and the database object, and further enhance the semantic information of the data. This model has important practical value for practical applications where large amounts of data are processed in various fields.

When word segmentation is carried out, the invention adopts an N-shortest path word segmentation algorithm combining statistics and dictionary, and carries out secondary development on the existing ICTCLAS system. The following describes the main steps of the word segmentation operation taking "inquiring about the student's contact phone" as an example in conjunction with fig. 3:

the first step: the input sentence "inquiring the contact phone of the student" is cut into single characters, and "B/inquiring/polling/learning/generating/coupling/electric/speaking/E" is obtained.

And a second step of: according to a pre-constructed knowledge base, searching all possible word forming schemes aiming at each character atom. For example, for a string containing "contact phone," it is labeled "PROPERTY" as a data semantic tag based on knowledge base information. The word forming result is obtained as follows: "B/search/query/poll/school/student/link/contact phone/line/electricity/phone/E".

And a third step of: the length of the word edges in each word forming scheme is calculated so as to generate a binary word forming chart later.

For example, in this process, it is found that "query phone" is split into two words as one word edge, and since it is highly likely to be the whole "query phone" in the knowledge base, the word edge length is corrected, and set to 0.

Fourth step: after the word edge length calculation is completed, sequencing is carried out, and the first two reachable paths are selected as the shortest path primary separation result.

For example, the shortest path calculation results are shown in table 2:

TABLE 2

Fifth step: and carrying out special word merging and unregistered word recognition on the initial segmentation result, then generating a binary word segmentation chart again, and solving the shortest path.

For example, special words (e.g., "contact phones") are kept as a whole while unregistered words are identified to make full use of information in the knowledge base. And then, generating a binary word segmentation chart again, and solving the shortest path.

Sixth step: and obtaining a complete word segmentation result. For example, get "query, student, contact phone". The word segmentation processing process fully considers semantic information in the knowledge base, reduces segmentation ambiguity, improves word segmentation accuracy and efficiency, and is suitable for processing natural language query.

The encoder-decoder architecture based on the self-attention mechanism shown in fig. 4 is utilized to complete the syntactic analysis, realize the multi-level deep high-dimensional feature extraction and obtain the identification information required by the structured query statement.

Representing word-level text features as，/>，/>Representation->Space of real number>And->Representing the number of tokens and the feature dimension of a single token, respectively; the token represents a basic unit of word-level feature representation for converting a word or phrase into a computer-understandable and processable real vector.

In a syntactic analysis model based on a self-attention mechanism, the self-attention mechanism is implemented as a multi-head self-attention network MSA, the computation time complexity of which is，/>The computational complexity is represented in computer science. The multi-head attention network calculation process is as follows: the calculation process is as follows:

，

wherein,、/>、/>respectively represent +.>The characteristics of Query, key and Value of the network of the individual heads; />Is->、/>、/>Homologous input features of->Respectively represent from->Obtain->、/>、/>Is a linear transformation layer of->、/>、Is +.>； />Representing an attention weight matrix; />A conjugate transpose operation representing a matrix; />A +.o. representing multi-head self-attention>A plurality of self-attention branches; />、/>Total branch numbers respectively representing the self-attention of multiple heads;representing concatenation of tensors in a channel dimension; />A normalized exponential function is represented for converting any set of real numbers into a probability distribution of 0 to 1.

Through multi-head self-attention operation, the model focuses more on more important features, so that richer and effective multi-level high-dimensional syntactic features are extracted.

The first step: a representation vector of each word is obtained, the representation vector being obtained by summing word features and position codes.

And a second step of: representing the obtained words as a vector matrix and transmitting the vector matrix into an encoder, and passingAnd after the coding modules are used, coding information matrixes of all words are obtained.

And a third step of: and transmitting the coded information matrix output by the encoder into a decoder, and sequentially generating next query information by the decoder according to the current database query information (the set and the linked list).

During training, query information that has not yet been predicted is masked by a masking operation.

Step 3: and carrying out semantic information decoding on the high-dimensional space semantic information to obtain semantic information of a set and linked list structure. The intermediate structure data adopts a bottom-up mode, and the data is divided into a plurality of subtrees capable of independently representing database information according to a relation data rule so as to eliminate semantic ambiguity in the data.

For database queries, most important is to obtain the target of the query and the condition of the query from the natural language query, and generate legal query sentences according to the syntax format of SQL. In a database natural language interface, all preprocessing work (including lexical analysis and syntactic analysis) is done to convert linguistic concepts into intermediate structures.

In the invention, the set and the linked list are used as intermediate structures, the query targets and the query conditions are respectively stored into the set and the linked list, and finally, a complete SQL sentence is generated, and the words or phrases representing the query targets are extracted from the complete limited Chinese query sentence. The implementation steps are as follows:

the first step: and dynamically acquiring the maximum rule length MAX and the minimum rule length MIN in the rule set, and enabling Len to be equal to MAX. Len is a variable representing the maximum length of the truncated word object.

And a second step of: MAX word objects are intercepted in reverse order from the end of the sentence, if a word is scanned, the pointer is moved forward by one bit (the modifier relation of the word is not considered in the rule), and the reverse order parts of speech are formed into a character string tryWord.

And a third step of: and matching the try word with the rules in the target rule set. the tryWord represents a string of reverse parts-of-speech extracted from a complete limited chinese query sentence for matching with rules in a target rule set.

If the matching is successful, a target phrase is considered to be formed, the pointer is moved forward by Len objects, and the word object set is stored as an object in the target phrase set T'.

If the match is unsuccessful, len is decremented by 1 and the match is performed again until Len equals MIN.

Fourth step: if the position of the reverse sequence Len+1 in the whole sentence word sequence is a conjunctive or punctuation mark, indicating that a plurality of inquiry targets exist, and repeating the matching process until the end; if the len+1 position is other words, the recognition process is ended.

The invention also provides a database query system based on the self-attention syntactic perception, which comprises a semantic information coding module and a language analysis model.

The above description is only of the preferred embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to apply the equivalent replacement or modification to the technical solution and the inventive concept according to the present invention within the scope of the present invention.

Claims

1. A database query method based on self-attention syntactic perception is characterized in that a self-attention mechanism and syntactic perception are introduced, multi-scale deep high-dimensional feature extraction is carried out on an input natural language, and a final database query language is generated based on high-dimensional semantic representation of extracted query information;

the method comprises the following steps:

step 1: performing lexical analysis on the natural query language, including word segmentation, part-of-speech tagging and named entity recognition; abstract modeling is carried out on natural language to obtain a standardized word sequence with data semantics;

step 2: carrying out syntactic analysis on the word sequence, and obtaining information in a high-dimensional semantic space through semantic information coding;

utilizing an encoder-decoder architecture based on a self-attention mechanism to complete syntactic analysis, and realizing multi-level deep high-dimensional feature extraction to obtain identification information required by a structured query statement;

representing word-level text features as，/>，/>Representation->The space of the real number is maintained,and->Representing the number of tokens and the feature dimension of a single token, respectively; token represents a basic unit of word-level feature representation for converting a word or phrase into a computer-understandable and processable real vector;

in a syntactic analysis model based on a self-attention mechanism, the self-attention mechanism is implemented as a multi-head self-attention network MSA, and the calculation process is as follows:

，

wherein,、/>、/>respectively represent +.>The characteristics of Query, key and Value of the network of the individual heads; />Is->、/>、/>Is a homologous input feature of (2); />Respectively represent from->Obtain->、/>、/>Is a linear transformation layer of->、/>、/>Is +.>；/>A +.o. representing multi-head self-attention>A plurality of self-attention branches; />、/>Total branch numbers respectively representing the self-attention of multiple heads; />Representing momentPerforming array transposition operation; the computation complexity of multi-head self-attention is that，/>Representing computational complexity in computer science; />And->The number of tokens in the text feature representation at the word level and the feature dimension of a single token, respectively;

step 3: semantic information decoding is carried out on the high-dimensional space semantic information to obtain semantic information of an intermediate structure, namely a set and a linked list structure, the intermediate structure data adopts a bottom-up mode, and the data is divided into a plurality of subtrees which independently represent database information according to a relation data rule so as to eliminate semantic ambiguity in the data;

step 4: the semantic information of the intermediate structure is recombined into a query target and a query condition, and finally a complete SQL sentence is generated;

the first step: dynamically acquiring the maximum rule length MAX and the minimum rule length MIN in the rule set, and enabling Len to be equal to MAX; len is a variable representing the maximum length of the intercepted word object;

and a second step of: the MAX word objects are intercepted from the end of the sentence in reverse order, if the word is scanned, the pointer moves forward by one bit, and the reverse word parts form a character string tryWord;

and a third step of: matching the try word with the rules in the target rule set; the tryWord represents a character string composed of reverse parts of speech extracted from a complete limited Chinese query sentence and is used for matching with rules in a target rule set;

if the matching is successful, a target phrase is considered to be formed, the pointer is moved forwards by Len objects, and the word object set is stored as an object into a target phrase set T';

if the matching is unsuccessful, subtracting 1 from Len, and matching again until Len is equal to MIN;

2. The method for querying a database based on self-attention syntax sensing according to claim 1, wherein step 2 comprises the steps of:

the first step: obtaining a representation vector of each word, wherein the representation vector is obtained by adding word characteristics and position codes;

and a second step of: representing the obtained words as a vector matrix and transmitting the vector matrix into an encoder, and passingAfter the coding modules are coded, coding information matrixes of all words are obtained;

and a third step of: transmitting the encoding information matrix output by the encoder into a decoder, and sequentially generating next query information by the decoder according to the current database query information;