CN115329753B

CN115329753B - Intelligent data analysis method and system based on natural language processing

Info

Publication number: CN115329753B
Application number: CN202211252819.8A
Authority: CN
Inventors: 刘沂鑫; 周丞
Original assignee: Beijing Yihui Information Technology Co ltd
Current assignee: Beijing Yihui Information Technology Co ltd
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-03-24
Anticipated expiration: 2042-10-13
Also published as: CN115329753A

Abstract

The invention provides an intelligent data analysis method and system based on natural language processing, which relates to a data analysis system and comprises the following steps: providing a search bar for receiving user input text to receive natural language text; analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments; generating a query graph for the TOKEN set using a finite state machine representing query syntax; determining a sequence of each TOKEN in the TOKEN set based on the query graph to form a database query; a search behavior of the database is invoked using a query instruction based on the database query to obtain search results. The invention improves the data analysis efficiency, is convenient for common users to use, and greatly improves the working efficiency of non-IT professionals (common users) for directly carrying out data analysis on a large amount of mixed data of multiple data sources (non-relational data sources).

Description

Intelligent data analysis method and system based on natural language processing

Technical Field

The invention relates to the technical field of data analysis, in particular to an intelligent data analysis method and system based on natural language processing.

Background

Conventional data analysis systems may collect, analyze, and take actions on data contained in data sources. The data source may be an internal, external, local, or remote computing device associated with the data analysis system. For example, the external remote data source may be a server connected to the data analysis system through a computer network.

Existing data analysis systems have a number of disadvantages. They are designed for Information Technology (IT) professional use, not for the end-user. These systems use an Extract, transform, and Load (ETL) pipeline to Extract data from data sources and store the extracted data into a centralized data warehouse or data lake. These systems provide only partial and stale data for querying and analysis, and thus do not match the needs of modern organizations for data analysis. Analysts typically spend a significant amount of time collecting and preparing data rather than actually analyzing the data using Business Intelligence (BI) tools. Examples of BI tools with analysis or visualization functionality include tabeau, POWER BI, R, or PYTHON. These tools operate primarily on data residing in a single small relational database. However, non-relational data sources have found widespread use in modern organizations, such as HADOOP, cloud STORAGE (e.g., S3, MICROSOFT AZURE BLOB STORAGE), and NOSQL databases (e.g., mongdb, elasticcsearch, casasandra).

Furthermore, data is typically distributed among different data sources, so a user cannot simply connect a BI tool to any combination of data sources. The connection mechanism is typically too slow, queries often fail, the amount of raw data is too large or complex, and the data is typically of a mixed type.

In addition, users seeking flexible access to data analysis systems often circumvent security measures by downloading or extracting data into unsecured, unsupervised systems (e.g., spreadsheets, standalone databases, and BI servers) for subsequent analysis.

Therefore, the user needs to have an ability to access, explore and analyze a large amount of mixed data from distributed data sources without bearing the burden of a strict data analysis system mainly used by IT professionals, and how to design a data analysis system which can improve the data analysis efficiency and is convenient for common users becomes a technical problem to be solved.

Disclosure of Invention

The invention aims to at least solve one of the technical problems in the prior art or the related technology and discloses an intelligent data analysis method and system based on natural language processing.

The invention discloses an intelligent data analysis method based on natural language processing in a first aspect, which comprises the following steps: providing a search bar for receiving user input text to receive natural language text; analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments; generating a query graph for the TOKEN set using a finite state machine representing query syntax; determining a sequence of each TOKEN in the TOKEN set based on the query graph to form a database query; a search behavior of the database is invoked using a query instruction based on the database query to obtain search results.

According to the intelligent data analysis method based on natural language processing disclosed by the present invention, preferably, the step of generating the query graph for the TOKEN set by using the finite state machine representing the query syntax specifically includes: nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent effective conversion between the TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and the directed edges of the query graph represent conversion between two TOKENs in the TOKEN ordering.

According to the intelligent data analysis method based on natural language processing disclosed by the present invention, preferably, the step of determining the sequence of each TOKEN in the TOKEN set based on the query graph to form the database query specifically includes: determining a weight of a directed edge from a source node to a destination node of the query graph, corresponding to a first TOKEN to a second TOKEN of the set of TOKENs, wherein the weight is determined based on a grammatical weight of the directed edge from a node of the finite state machine representing a TOKEN type of the first TOKEN to a node of the finite state machine representing a TOKEN type of the second TOKEN, the grammatical weight indicating a frequency of transitions from the TOKEN of the first TOKEN type to the TOKEN of the second TOKEN type in the query grammar.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: one or more directed edges are removed from the query graph to form an acyclic query graph, and an order of the TOKENs in the TOKEN set is determined based on the acyclic query graph.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the query graph is generated by using the earths algorithm.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the shortest path in the query graph is determined by using a modified Dijkstra algorithm.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: and identifying effective starting TOKEN and effective ending TOKEN from the TOKEN set based on the query grammar, and determining a path from a vertex corresponding to the effective starting TOKEN to a vertex corresponding to the effective ending TOKEN in the query graph.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: the paths of the vertices in the query graph are determined and the path with the largest sum of weights is selected.

According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: determining that the string and TOKEN set match a pattern, and in response to the matching, setting a weight of a directed edge of the query graph based on a pattern score associated with the pattern.

The second aspect of the present invention discloses an intelligent data analysis system based on natural language processing, comprising: a memory for storing program instructions; and the processor is used for calling the program instructions stored in the memory to realize the intelligent data analysis method based on natural language processing in any technical scheme.

The beneficial effects of the invention at least comprise: the search behavior based on natural language is automatically converted into the query behavior based on the professional data analysis idea by using the query grammar, so that non-IT professionals (common users) can directly extract, convert and load a large amount of mixed data of multiple data sources (non-relational data sources), the working efficiency is greatly improved, and the requirements of modern organizations on data analysis are met.

Drawings

FIG. 1 illustrates a user interface diagram of an intelligent data analysis system based on natural language processing, according to one embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.

According to one embodiment of the invention, a system for providing a search interface for a database is disclosed, the system comprising a memory, a processor and a network interface, the network interface is used for connecting the database, the memory is used for storing instructions executable by the processor to realize an intelligent data analysis method based on natural language processing:

receiving user-entered text of a character string, the character string being processed by a data analysis system to generate a set of TOKENs of a database grammar based on the character string, wherein each TOKEN matches a respective segment of the character string; generating a query graph for the TOKEN set by using a finite state machine representing query syntax, wherein nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent effective conversion between the TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and the directed edges of the query graph represent conversion between two TOKENs in the TOKEN ordering. Determining a TOKEN sequence in the TOKEN set based on the query graph to form a database query; and invokes a search of the database using a query based on the database query to obtain search results.

According to the above embodiment, the weights of the directed edges from the source node to the destination node of the query graph are preferably determined, which correspond to the first TOKEN to the second TOKEN in the set of TOKENs, wherein the weights are determined based on the syntactic weights of the directed edges from the node of the finite state machine of TOKEN type representing the first TOKEN to the node of the finite state machine of TOKEN type representing the second TOKEN. The grammar weight may indicate a frequency of transitions from a TOKEN of a first TOKEN type to a TOKEN of a second TOKEN type in the query grammar.

According to the above embodiment, preferably, one or more directed edges are removed from the query graph to form an acyclic query graph, and the order of TOKENs in the TOKEN set is determined based on the acyclic query graph.

According to the above embodiments, preferably, the ages algorithm is applied to the query graph.

According to the above embodiments, preferably, the shortest path in the query graph is determined using a modified Dijkstra algorithm.

According to the above embodiment, preferably, valid start TOKEN and valid end TOKEN are identified from the TOKEN set based on the query syntax, and a path from a vertex corresponding to the valid start TOKEN to a vertex corresponding to the valid end TOKEN in the query graph is determined.

According to the above embodiment, preferably, paths of vertices in the query graph are determined, and the path with the largest sum of weights is selected.

According to the above embodiment, it is preferably determined that the string and TOKEN set match a pattern, and the weights of the directed edges of the query graph are set based on the pattern score associated with the pattern corresponding to this match.

The implementation of the intelligent data analysis method based on natural language processing disclosed by the invention can comprise any combination of the characteristics described in the above embodiments.

As shown in FIG. 1, a schematic diagram of a user interface for generating one or more database query behaviors is further disclosed in another embodiment of the present invention, and the following further describes the specific working process of the present invention with reference to FIG. 1:

the display area 110 includes a search bar 120 that enables a user to enter a character string. The character string may include text in a natural language (e.g., chinese or english). For example, the text of the string may represent a question or command for the data analysis system. The user selects the search bar 120 and types in the text of the character string to enter the character string. In addition to this, the user may select a voice icon portion (not shown in fig. 1) of the search bar 120 and input text of a character string by voice. The string may be processed by a data analysis system to determine a database query based on the string.

Display area 110 includes a database query pane 130 that displays a representation of a database query that includes a sequence of TOKENs represented by respective TOKEN icons (132, 134, 136, 138, and 140) that were initially generated based on a string. The database query pane 130 may enable users to select and edit TOKENs by interacting with their respective TOKEN icons (132, 134, 136, 138, and 140). Clicking or hovering over the TOKEN icon 140 with a cursor may trigger a suggested alternate TOKEN list of database grammars to be displayed in the suggested TOKEN menu 160 and/or a drop down menu (not shown in fig. 1) that appears near the TOKEN icon. The user may select an alternate TOKEN to edit the database query, and the TOKEN icon 140 may be removed or replaced with the TOKEN icon of the selected TOKEN.

The display area 110 includes a search results pane 150 that includes data based on search results obtained using the database query. For example, the search results pane 150 may include raw data (represented as text) that is retrieved from a database using a database query before and/or after the user modifies the database query. In addition, the search results pane 150 may also include processing data (represented as graphs and/or summary text) based on data retrieved from the database using the database query before and/or after the user modifies the database query.

Display area 110 includes a suggested TOKEN menu 160 in which suggested TOKENs for use in a database query are listed to facilitate editing of the database query. Suggested TOKENs may include TOKENs from high ranked candidate queries that were generated by the process for determining database queries based on character strings, and that were not initially selected for presentation to the user. The suggest TOKEN menu 160 enables a user to select a TOKEN to be added to a database query or a current TOKEN to replace a database query. Suggested TOKEN menu 160 includes text entry options that enable a user to search the available TOKEN space of the database grammar.

Display area 110 includes a praise icon 170 that enables the user to express approval of the database query. When the user is satisfied with the presented database query as an accurate representation of their intent, the likes icon 170 may be clicked.

According to the embodiment, the intelligent data analysis method and system based on natural language processing disclosed by the invention automatically generate a group of TOKEN sets by analyzing natural language texts, generate a query graph according to the TOKEN sets, and use the query based on database query to call the search of the database to obtain search results, so that the same operation interface can be in butt joint with a plurality of data sources, and simultaneously, the use threshold of a common user is reduced, so that the common user can master the capability of accessing, exploring and analyzing a large amount of mixed data from distributed data sources, and the burden of a strict data analysis system mainly used by IT professionals is not borne.

All or part of the steps in the methods of the above embodiments may be performed by controlling related hardware through a program, and the program may be stored in a readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a compact disc Read-Only Memory (CD-ROM) or other optical disc storage, a magnetic disc storage, a tape storage, or any other medium capable of carrying or storing data.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An intelligent data analysis method based on natural language processing is characterized by comprising the following steps:

providing a search bar for receiving user input text to receive natural language text;

analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments;

generating a query graph for the TOKEN set using a finite state machine representing query syntax;

determining a sequence of each TOKEN in the set of TOKENs based on the query graph to form a database query;

a search behavior of the database is invoked using a query instruction based on the database query to obtain search results.

2. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the step of generating a query graph for the TOKEN set by using a finite state machine representing query syntax includes:

nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent valid transitions between TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and directed edges of the query graph represent transitions between two TOKENs in the TOKEN ordering.

3. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the step of determining the sequence of each TOKEN in the TOKEN set based on the query graph to form a database query specifically comprises:

determining a weight of a directed edge from a source node to a destination node of the query graph, corresponding to a first TOKEN to a second TOKEN of the set of TOKENs, wherein the weight is determined based on a grammatical weight of the directed edge from a node of the finite state machine representing a TOKEN type of the first TOKEN to a node of the finite state machine representing a TOKEN type of the second TOKEN, the grammatical weight indicating a frequency of transitions from the TOKEN of the first TOKEN type to the TOKEN of the second TOKEN type in the query grammar.

4. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:

removing one or more directed edges from the query graph to form an acyclic query graph, and determining an order of TOKENs in a TOKEN set based on the acyclic query graph.

5. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the query graph is generated by applying EAdes algorithm.

6. The intelligent natural language processing-based data analysis method of claim 1, wherein a modified Dijkstra algorithm is used to determine the shortest path in the query graph.

7. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:

and identifying effective starting TOKEN and effective ending TOKEN from the TOKEN set based on the query grammar, and determining a path from a vertex corresponding to the effective starting TOKEN to a vertex corresponding to the effective ending TOKEN in the query graph.

8. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:

determining paths of vertices in the query graph and selecting the path with the largest weight sum.

9. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:

determining that the string and TOKEN set match a pattern, and in response to the matching, setting a weight of a directed edge of the query graph based on a pattern score associated with the pattern.

10. An intelligent data analysis system based on natural language processing, comprising:

a memory for storing program instructions;

a processor for invoking the program instructions stored in the memory to implement the intelligent natural language processing based data analysis method of any one of claims 1 to 9.