CN116414872B

CN116414872B - Data searching method and system based on natural language identification and knowledge graph

Info

Publication number: CN116414872B
Application number: CN202310404618.3A
Authority: CN
Inventors: 黄玉锋; 冯杰; 刘士毅; 姜超; 王治强
Original assignee: Zheshang Securities Co ltd
Current assignee: Zheshang Securities Co ltd
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2024-02-20
Anticipated expiration: 2043-04-11
Also published as: CN116414872A

Abstract

The invention discloses a data searching method and a system based on natural language identification and knowledge graph, wherein the method comprises the following steps: identifying based on the improved natural language model to obtain at least two query subjects and logic relations between the subjects and constructing an operation tree; analyzing the query subject into related information based on a preset knowledge graph; inquiring the best matching data item position of the data item to be inquired based on a preset similarity model and constructing a data item inquiring model; filling the data item query model and operator information into an operation tree to obtain a new operation tree and form a new operation process; and inputting the new operation tree into a database query calculation engine to obtain a calculation result. The invention recognizes the subject and logical relationship in natural language by using a natural language model; the analysis of abstract problems and the storage of operator calculation formulas are realized by maintaining the knowledge graph, so that the problems supported by a search system and the lateral expansion of operators are realized; and to accelerate the identification of the location of data items in multiple databases.

Description

Data searching method and system based on natural language identification and knowledge graph

Technical Field

The invention belongs to the technical field of searching, and particularly relates to a data searching method and system based on natural language identification and knowledge graph.

Background

At present, the data searching technology in the financial industry is mainly realized by a regularization technology or an NL2SQL technology, and the regularization technology and the NL2SQL technology have various advantages, and the regularization technology has the advantages of quick recognition and good sentence recognition effect on high formula matching degree; the NL2SQL technology has the advantages of better compatibility to sentences and good recognition effect to sentences with uncomplicated logic. Of course, both suffer from certain drawbacks, and the regular technique suffers from two drawbacks: firstly, a huge regular formula library needs to be managed, and the cost of solving formula conflicts along with the expansion of the formula library also rises; secondly, when sentences are not matched with the existing formulas or are difficult to match, the overall recognition effect is very poor. The NL2SQL technique has the disadvantage of poor sentence recognition for complex logic including nested relationships.

In addition, both technologies have the same defect and the defect is fatal when the search technology is implemented, and the two technologies cannot identify the abstract problem, such as ' who is better ', ' how; neither support operator expansion, but only query the already calculated data; in addition, accuracy can be significantly reduced when querying cross-table data.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a data searching method and system based on natural language identification and knowledge graph.

In order to solve the technical problems, the invention is solved by the following technical scheme:

a data searching method based on natural language identification and knowledge graph includes the following steps:

identifying the acquired search sentences based on an improved natural language model to obtain at least two inquiry subjects and logic relations between the subjects, and constructing an operation tree according to the inquiry subjects and the logic relations;

analyzing a query subject in the operation tree into related information based on a preset knowledge graph, wherein the related information at least comprises a data item to be queried, an index and operator information;

inquiring the best matching data item position of the data item to be inquired in a database and a data table based on a preset similarity model, finding the specific position of the best matching data item, and constructing a data item inquiry model;

filling the data item query model and operator information into the operation tree to obtain a new operation tree, and forming a new operation process based on the new operation tree;

inputting the new operation tree into a database query calculation engine, and obtaining a calculation result based on the new operation process.

As an implementation manner, the operation tree includes a root node, a plurality of child nodes and a plurality of leaf nodes, the leaf nodes represent data items, the child nodes represent operator information, one root node is connected with the child nodes, one child node is connected with the leaf nodes, the root node is used as a calculation end point, the child nodes are used as inputs of all next-level points of the root node, and the leaf nodes are used as inputs of the child nodes;

the operation process of the operation tree is as follows:

the data of the leaf nodes are used as the input of operator information, the result is obtained, and the result is used as the input of the child nodes and is repeated all the time;

the data of the child node is used as the input of the data item to obtain a result, and the result is used as the input of the root node;

when the root node is calculated to be no more of the previous level node, a calculation result is returned and the operation process of the operation tree is completed.

As an implementation manner, the construction and training process of the improved natural language model comprises the following steps:

fine tuning the natural language model to obtain an improved natural language pre-training model;

marking the query main body and the logic relationship, and taking the marked query main body and the marked logic relationship as a fine tuning training database for improving a natural language pre-training model;

and training and testing the improved natural language pre-training model based on the fine-tuning training database to obtain the improved natural language model.

As an implementation manner, the searching the best matching data item position of the data item to be searched in the database and the data table based on the preset similarity model, finding the specific position of the best matching data item, and constructing a data item searching model, which comprises the following steps:

performing vectorization processing on the names of the data items to be queried based on a preset word vector model to obtain vectorization results;

inputting the vectorization result and the data item names of the database into a preset similarity model pair by pair to obtain the similarity between the data item names to be queried and the full-database data item names;

and selecting the data item with the highest similarity as a query target data item, and returning a path of the data item in the database as a matching result to complete a data item name query flow so as to form a data item query model.

As an implementation manner, when the preset similarity model is trained, the training sample set and the test sample set at least comprise wrongly written characters and short codes.

As an implementation manner, the inputting the new operation tree into the database query calculation engine to obtain a calculation result includes the following steps:

the database query calculation engine starts to traverse the calculation task of the operator information from all leaf nodes of the new operation tree and obtains a first calculation result;

inputting the first calculation result into all child nodes of the new operation tree to start traversing the calculation task of the execution data item and obtain a second calculation result;

and inputting the second calculation result into the root node of the new calculation tree to obtain the calculation result after the execution is completed, further ending the task and returning the calculation result of the new calculation tree.

As an embodiment, the method further comprises the steps of:

and matching corresponding chart templates according to the calculation results, and presenting the data and the calculation results on a search result page, wherein the chart templates comprise one or more combinations of tables, bar charts, line charts, pie charts and search result pages.

A data search system based on natural language identification and knowledge graph comprises an identification construction module, an analysis module, a matching module, a supplementing module and a result output module;

the identification construction module is used for identifying the acquired search sentences based on the improved natural language model to obtain at least two inquiry subjects and logic relations between the subjects, and constructing an operation tree according to the inquiry subjects and the logic relations;

the analysis module is used for analyzing the query main body in the operation tree into related information based on a preset knowledge graph, wherein the related information at least comprises a data item to be queried, an index and operator information;

the matching module queries the best matching data item position of the data item to be queried in a database and a data table based on a preset similarity model, finds the specific position of the best matching data item, and constructs a data item query model;

the supplementing module is used for filling the data item query model and operator information into the operation tree to obtain a new operation tree, and forming a new operation process based on the new operation tree;

and the result output module is used for inputting the new operation tree into a database query calculation engine and obtaining a calculation result based on the new operation process.

As an embodiment, the matching module is configured to:

A computer readable storage medium storing a computer program which when executed by a processor performs the method of:

A data search device based on natural language recognition and knowledge graph, comprising a memory, a processor and a computer program stored in the memory and running on the processor, which when executed by the processor, implements the method of:

The invention has the remarkable technical effects due to the adoption of the technical scheme:

the invention better identifies the subject and logic relationship in natural language by using the natural language model;

the problems supported by the search system and the operators can be laterally expanded only by maintaining the knowledge graph to analyze the abstract problems and store the operator calculation formulas;

the use of machine learning algorithms may speed up the identification of the location of data items in multiple databases.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a method for constructing an operation tree according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an example knowledge graph provided in an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a preset similarity model process according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an operation flow of a database query computing engine according to an embodiment of the present invention;

fig. 6 is a schematic diagram of the overall structure of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples, which are illustrative of the present invention and are not intended to limit the present invention thereto.

Example 1:

a data searching method based on natural language identification and knowledge graph, as shown in figure 1, comprises the following steps:

s100, identifying the acquired search sentences based on an improved natural language model to obtain at least two inquiry subjects and logic relations between the subjects, and constructing an operation tree according to the inquiry subjects and the logic relations;

s200, analyzing a query subject in the operation tree into related information based on a preset knowledge graph, wherein the related information at least comprises a data item to be queried, an index and operator information;

s300, inquiring the position of the best matching data item of the data item to be inquired in a database and a data table based on a preset similarity model, finding the specific position of the best matching data item, and constructing a data item inquiry model;

s400, filling the data item query model and operator information into the operation tree to obtain a new operation tree, and forming a new operation process based on the new operation tree;

s500, inputting the new operation tree into a database query calculation engine, and obtaining a calculation result based on the new operation process.

In order to solve the problems of poor and inaccurate existing search and identification effects, the technical scheme of the invention is adopted, and the main body and the logic relationship in the natural language are better identified by improving the natural language model; the main body and logic relation analysis and operator calculation formula storage are realized through a preset knowledge graph, so that the problem supported by a search system and the operator lateral expansion are realized; in addition, the position recognition of the data item in a plurality of databases can be accelerated based on a preset similarity model, the specific position of the best matched data item is obtained, a data item query model is constructed, an update operation tree is filled to obtain a new operation tree, and a new operation process is formed based on the new operation tree; inputting the new operation tree into a database query calculation engine, and obtaining a calculation result based on the new operation process.

Specifically, the operation tree can be described with reference to fig. 2:

the constructed operation tree comprises a root node, a plurality of child nodes and a plurality of leaf nodes, wherein the leaf nodes represent data items, the child nodes represent operator information, the root node is connected with the child nodes, the child nodes are connected with the leaf nodes, the root node is used as a calculation end point, the child nodes are used as the input of all next-stage points of the child nodes of the root node, and the leaf nodes are used as the input of the child nodes;

the operation process of the operation tree comprises the following steps: the data of the leaf nodes are used as the input of operator information, the result is obtained, and the result is used as the input of the child nodes and is repeated all the time; the data of the child node is used as the input of the data item to obtain a result, and the result is used as the input of the root node; when the root node is calculated and no more previous level nodes exist, a calculation result is returned and the operation process of the operation tree is completed.

The operation tree of fig. 2 shows "which of the net profits of the chinese petroleum and the chinese petrochemical is higher than the comparison". Leaf nodes in the operation tree are generally specific data (such as data item names, years, enterprise names, stock codes and the like), and child nodes are generally operator information (of course, data acquisition is also regarded as operator information); firstly, starting calculation from leaf nodes of an operation tree and taking the calculation as input of a previous stage node; and taking the data of all next-stage nodes of the child nodes as the input of an operator, and then taking the calculated result as the input of the previous-stage child nodes again, so that the operation is repeated. And the final root node is used as a calculation end point, when the calculation is completed at the root node, the last level node is not needed, and the calculation result is returned to mark the completion of a calculation flow of an operation tree, namely the whole operation process, so that the final calculation result is obtained.

In the invention, the construction and training process of the improved natural language model is similar to the construction and training of other neural network models, in order to obtain the improved natural language model which is more suitable for financial corpus, only the natural language model needs to be finely tuned to obtain the improved natural language model, the obtained search sentences are identified through the improved natural language model, so that the identification result is more accurate, and the construction and training process of the improved natural language model comprises the following steps:

Fig. 3 is an example diagram of a preset knowledge graph according to an embodiment of the present invention, as shown in fig. 3. In this example, the estimated class index of the enterprise is taken as an example, and the estimated class index includes a market rate, a net market rate, a market selling rate, a market rate, an information rate, and the like, wherein the market rate is expanded, and algorithms such as dynamic, TTM and prediction are provided, and the calculation method of the market rate-TTM is that the total market value is divided by the net profit processed by the TTM operator, and the calculation needs to be strictly matched according to time. Of course, the knowledge graph may be a knowledge graph of other indexes, which is not listed here.

In one embodiment, step S300 queries the best matching data item position of the data item to be queried in the database and the data table based on the preset similarity model, finds the specific position of the best matching data item, and constructs the data item query model, comprising the steps of:

s310, carrying out vectorization processing on the names of the data items to be queried based on a preset word vector model to obtain vectorization results;

s320, inputting the vectorization result and the data item names of the database into a preset similarity model pair by pair to obtain the similarity between the data item names to be queried and the data item names of the whole database;

s330, selecting the data item with the highest similarity as a query target data item, and returning the path of the data item in the database as a matching result to complete the data item name query flow, thereby forming a data item query model.

In addition, when the preset similarity model is trained, the training sample set and the test sample set at least comprise wrongly written characters and short names.

Referring to fig. 4 in detail to the step S300, fig. 4 is a process of matching data items to be queried based on a preset similarity model according to an embodiment of the present invention. The names of the data items can be understood as target data items to be positioned, in addition, the names of the data items can be wrongly written words, short names or English abbreviations of the names of the target data items, and therefore the trained similarity model is higher in recognition rate.

The whole process must be vectorized before it can be identified by other models. The similarity model is a machine learning model or a depth model, and mainly calculates the similarity of two words, and samples such as wrongly written characters, short names and the like are considered in model training, so that the fuzzy matching capability is relatively strong. And further calculating the similarity between the names of the data items to be queried and the names of the data items in the whole database pair by pair to obtain a similarity table. And selecting the data item with the highest similarity as the target data item to be searched, and returning the path of the data item in the database as a matching result. Thus, the data item name query flow is completed.

In one embodiment, step S400 fills the data item query model and the operator information into the operation tree to obtain a new operation tree, and forms a new operation process based on the new operation tree, which is to fill the original operation tree, so as to obtain a more perfect new operation tree, the new operation tree can be continuously explained with reference to fig. 2, leaf nodes in the operation tree are typically specific data items (such as data item names, years, enterprise names, stock codes, etc., and other data items are additionally added), and child nodes are typically operator information (of course, data acquisition is also regarded as one operator information, and other updated operator information is also available); firstly, starting calculation from leaf nodes of an operation tree and taking the calculation as input of a previous stage node; and taking the data of all next-stage nodes of the child nodes as the input of an operator, and then taking the calculated result as the input of the previous-stage child nodes again, so that the operation is repeated. And the final root node is used as a calculation end point, when the calculation is completed at the root node, the last level node is not needed, and the calculation result is returned to mark the completion of a calculation flow of an operation tree, namely the whole operation process, so that the final calculation result is obtained.

In step S500, the new operation tree is input into the database query calculation engine to obtain a calculation result, which includes the following steps:

s510, the database query calculation engine starts to traverse the calculation task of the operator information from all leaf nodes of the new operation tree and obtains a first calculation result;

s520, inputting the first calculation result into all child nodes of the new operation tree to start traversing the calculation task of the execution data item and obtain a second calculation result;

s530, inputting the second calculation result into the root node of the new calculation tree to obtain the calculation result after the execution is completed, and then ending the task and returning the calculation result of the new calculation tree.

Finally, the method further comprises the following steps:

The method of the present invention is described in detail below in connection with search examples:

firstly, using an improved natural language model to identify a logical relationship between a main body and a main body in a search statement submitted by a user, wherein the main body mainly comprises a company, an index, time, a calculation formula and the like in general, and the logical relationship between the main bodies comprises a basic logical relationship of sum or the like and an action range of the main body as a timing; in addition, recognizing the logical relationship between the subject and the subject constructs a tree according to the rules, and the drawing of the tree is as described above.

Again: embodying a query subject into data items, indexes or operators by using a preset knowledge graph, wherein the data items are data items in a database, such as net profits, business incomes and the like; the indexes are the indexes of the marketing companies such as the market rate, the net rate, the liability rate and the like; the operators are calculation formulas determined by homonymy growth, ring ratio growth and the like. These data items, indices and operators are all stored and maintained in the knowledge graph.

Further: and querying a data item attribution database and a data table by using a preset similarity model, and constructing a query function. The basic material, data, that primarily provides operations for the operation tree is typically first executed in the operation tree, and the database and data table may be a plurality of databases or data tables.

Fourth step: and filling the constructed operation tree with the data item query model and operator information to form a complete new operation process from data to operation.

Fifth step: inputting the new operation tree into a database query calculation engine, and obtaining a calculation result based on the new operation process.

And finally, matching the chart template according to the calculation result, and presenting the data and the calculation result on a search result page. The chart templates comprise tables, bar charts, line charts, pie charts and search result pages, and the final presentation page is one of the chart templates or the combination of a plurality of templates.

Example 2:

a data search system based on natural language recognition and knowledge graph, as shown in figure 5, comprises a recognition construction module 100, an analysis module 200, a matching module 300, a supplementing module 400 and a result output module 500;

the recognition construction module 100 recognizes the acquired search sentence based on the improved natural language model to obtain at least two query subjects and a logic relationship between the subjects, and constructs an operation tree according to the query subjects and the logic relationship;

the analysis module 200 analyzes the query main body in the operation tree into related information based on a preset knowledge graph, wherein the related information at least comprises a data item to be queried, an index and operator information;

the matching module 300 queries the best matching data item position of the data item to be queried in the database and the data table based on a preset similarity model, finds the specific position of the best matching data item, and constructs a data item query model;

the supplementing module 400 is configured to populate the data item query model and the operator information into an operation tree to obtain a new operation tree, and form a new operation process based on the new operation tree;

the result output module 500 is configured to input the new operation tree into the database query calculation engine, and obtain a calculation result based on the new operation process.

Specifically, the matching module 300 is configured to:

inputting the vectorization result and the data item names of the database into a preset similarity model pair by pair to obtain the similarity between the data item names to be queried and the data item names of the whole database;

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that identical and similar parts of each embodiment are mutually referred to.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

In addition, the specific embodiments described in the present specification may differ in terms of parts, shapes of components, names, and the like. All equivalent or simple changes of the structure, characteristics and principle according to the inventive concept are included in the protection scope of the present invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions in a similar manner without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. The data searching method based on natural language identification and knowledge graph is characterized by comprising the following steps:

inputting the new operation tree into a database query calculation engine, and obtaining a calculation result based on the new operation process;

the method comprises the steps of inquiring the best matching data item position of the data item to be inquired in a database and a data table based on a preset similarity model, finding the specific position of the best matching data item, and constructing a data item inquiry model, and comprises the following steps:

2. The data searching method based on natural language identification and knowledge graph according to claim 1, wherein the operation tree comprises a root node, a plurality of child nodes and a plurality of leaf nodes, the leaf nodes represent data items, the child nodes represent operator information, one root node is connected with the child nodes, one child node is connected with the leaf nodes, the root node is used as a calculation endpoint, the child nodes are used as inputs of all next-level points of the root node child nodes, and the leaf nodes are used as inputs of the child nodes;

the operation process of the operation tree is as follows:

3. The data searching method based on natural language identification and knowledge graph according to claim 1, wherein the construction and training process of the improved natural language model comprises the following steps:

4. The method for searching data based on natural language recognition and knowledge graph according to claim 1, wherein the training sample set and the test sample set at least comprise wrongly written words and abbreviations when training the predetermined similarity model.

5. The data searching method based on natural language identification and knowledge graph according to claim 1, wherein the step of inputting the new operation tree into a database query calculation engine to obtain a calculation result comprises the following steps:

6. The data searching method based on natural language identification and knowledge graph according to claim 1, further comprising the steps of:

7. The data searching system based on natural language identification and knowledge graph is characterized by comprising an identification construction module, an analysis module, a matching module, a supplementing module and a result output module;

the result output module is used for inputting the new operation tree into a database query calculation engine and obtaining a calculation result based on the new operation process;

wherein the matching module is configured to:

8. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1 to 6.

9. A data search device based on natural language recognition and knowledge graph comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor implements the method according to any one of claims 1 to 6 when executing the computer program.