CN115329753B - Intelligent data analysis method and system based on natural language processing - Google Patents

Intelligent data analysis method and system based on natural language processing Download PDF

Info

Publication number
CN115329753B
CN115329753B CN202211252819.8A CN202211252819A CN115329753B CN 115329753 B CN115329753 B CN 115329753B CN 202211252819 A CN202211252819 A CN 202211252819A CN 115329753 B CN115329753 B CN 115329753B
Authority
CN
China
Prior art keywords
token
query
natural language
data analysis
language processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211252819.8A
Other languages
Chinese (zh)
Other versions
CN115329753A (en
Inventor
刘沂鑫
周丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yihui Information Technology Co ltd
Original Assignee
Beijing Yihui Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yihui Information Technology Co ltd filed Critical Beijing Yihui Information Technology Co ltd
Priority to CN202211252819.8A priority Critical patent/CN115329753B/en
Publication of CN115329753A publication Critical patent/CN115329753A/en
Application granted granted Critical
Publication of CN115329753B publication Critical patent/CN115329753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24522Translation of natural language queries to structured queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention provides an intelligent data analysis method and system based on natural language processing, which relates to a data analysis system and comprises the following steps: providing a search bar for receiving user input text to receive natural language text; analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments; generating a query graph for the TOKEN set using a finite state machine representing query syntax; determining a sequence of each TOKEN in the TOKEN set based on the query graph to form a database query; a search behavior of the database is invoked using a query instruction based on the database query to obtain search results. The invention improves the data analysis efficiency, is convenient for common users to use, and greatly improves the working efficiency of non-IT professionals (common users) for directly carrying out data analysis on a large amount of mixed data of multiple data sources (non-relational data sources).

Description

Intelligent data analysis method and system based on natural language processing
Technical Field
The invention relates to the technical field of data analysis, in particular to an intelligent data analysis method and system based on natural language processing.
Background
Conventional data analysis systems may collect, analyze, and take actions on data contained in data sources. The data source may be an internal, external, local, or remote computing device associated with the data analysis system. For example, the external remote data source may be a server connected to the data analysis system through a computer network.
Existing data analysis systems have a number of disadvantages. They are designed for Information Technology (IT) professional use, not for the end-user. These systems use an Extract, transform, and Load (ETL) pipeline to Extract data from data sources and store the extracted data into a centralized data warehouse or data lake. These systems provide only partial and stale data for querying and analysis, and thus do not match the needs of modern organizations for data analysis. Analysts typically spend a significant amount of time collecting and preparing data rather than actually analyzing the data using Business Intelligence (BI) tools. Examples of BI tools with analysis or visualization functionality include tabeau, POWER BI, R, or PYTHON. These tools operate primarily on data residing in a single small relational database. However, non-relational data sources have found widespread use in modern organizations, such as HADOOP, cloud STORAGE (e.g., S3, MICROSOFT AZURE BLOB STORAGE), and NOSQL databases (e.g., mongdb, elasticcsearch, casasandra).
Furthermore, data is typically distributed among different data sources, so a user cannot simply connect a BI tool to any combination of data sources. The connection mechanism is typically too slow, queries often fail, the amount of raw data is too large or complex, and the data is typically of a mixed type.
In addition, users seeking flexible access to data analysis systems often circumvent security measures by downloading or extracting data into unsecured, unsupervised systems (e.g., spreadsheets, standalone databases, and BI servers) for subsequent analysis.
Therefore, the user needs to have an ability to access, explore and analyze a large amount of mixed data from distributed data sources without bearing the burden of a strict data analysis system mainly used by IT professionals, and how to design a data analysis system which can improve the data analysis efficiency and is convenient for common users becomes a technical problem to be solved.
Disclosure of Invention
The invention aims to at least solve one of the technical problems in the prior art or the related technology and discloses an intelligent data analysis method and system based on natural language processing.
The invention discloses an intelligent data analysis method based on natural language processing in a first aspect, which comprises the following steps: providing a search bar for receiving user input text to receive natural language text; analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments; generating a query graph for the TOKEN set using a finite state machine representing query syntax; determining a sequence of each TOKEN in the TOKEN set based on the query graph to form a database query; a search behavior of the database is invoked using a query instruction based on the database query to obtain search results.
According to the intelligent data analysis method based on natural language processing disclosed by the present invention, preferably, the step of generating the query graph for the TOKEN set by using the finite state machine representing the query syntax specifically includes: nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent effective conversion between the TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and the directed edges of the query graph represent conversion between two TOKENs in the TOKEN ordering.
According to the intelligent data analysis method based on natural language processing disclosed by the present invention, preferably, the step of determining the sequence of each TOKEN in the TOKEN set based on the query graph to form the database query specifically includes: determining a weight of a directed edge from a source node to a destination node of the query graph, corresponding to a first TOKEN to a second TOKEN of the set of TOKENs, wherein the weight is determined based on a grammatical weight of the directed edge from a node of the finite state machine representing a TOKEN type of the first TOKEN to a node of the finite state machine representing a TOKEN type of the second TOKEN, the grammatical weight indicating a frequency of transitions from the TOKEN of the first TOKEN type to the TOKEN of the second TOKEN type in the query grammar.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: one or more directed edges are removed from the query graph to form an acyclic query graph, and an order of the TOKENs in the TOKEN set is determined based on the acyclic query graph.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the query graph is generated by using the earths algorithm.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the shortest path in the query graph is determined by using a modified Dijkstra algorithm.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: and identifying effective starting TOKEN and effective ending TOKEN from the TOKEN set based on the query grammar, and determining a path from a vertex corresponding to the effective starting TOKEN to a vertex corresponding to the effective ending TOKEN in the query graph.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: the paths of the vertices in the query graph are determined and the path with the largest sum of weights is selected.
According to the intelligent data analysis method based on natural language processing disclosed by the invention, preferably, the method further comprises the following steps: determining that the string and TOKEN set match a pattern, and in response to the matching, setting a weight of a directed edge of the query graph based on a pattern score associated with the pattern.
The second aspect of the present invention discloses an intelligent data analysis system based on natural language processing, comprising: a memory for storing program instructions; and the processor is used for calling the program instructions stored in the memory to realize the intelligent data analysis method based on natural language processing in any technical scheme.
The beneficial effects of the invention at least comprise: the search behavior based on natural language is automatically converted into the query behavior based on the professional data analysis idea by using the query grammar, so that non-IT professionals (common users) can directly extract, convert and load a large amount of mixed data of multiple data sources (non-relational data sources), the working efficiency is greatly improved, and the requirements of modern organizations on data analysis are met.
Drawings
FIG. 1 illustrates a user interface diagram of an intelligent data analysis system based on natural language processing, according to one embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.
According to one embodiment of the invention, a system for providing a search interface for a database is disclosed, the system comprising a memory, a processor and a network interface, the network interface is used for connecting the database, the memory is used for storing instructions executable by the processor to realize an intelligent data analysis method based on natural language processing:
receiving user-entered text of a character string, the character string being processed by a data analysis system to generate a set of TOKENs of a database grammar based on the character string, wherein each TOKEN matches a respective segment of the character string; generating a query graph for the TOKEN set by using a finite state machine representing query syntax, wherein nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent effective conversion between the TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and the directed edges of the query graph represent conversion between two TOKENs in the TOKEN ordering. Determining a TOKEN sequence in the TOKEN set based on the query graph to form a database query; and invokes a search of the database using a query based on the database query to obtain search results.
According to the above embodiment, the weights of the directed edges from the source node to the destination node of the query graph are preferably determined, which correspond to the first TOKEN to the second TOKEN in the set of TOKENs, wherein the weights are determined based on the syntactic weights of the directed edges from the node of the finite state machine of TOKEN type representing the first TOKEN to the node of the finite state machine of TOKEN type representing the second TOKEN. The grammar weight may indicate a frequency of transitions from a TOKEN of a first TOKEN type to a TOKEN of a second TOKEN type in the query grammar.
According to the above embodiment, preferably, one or more directed edges are removed from the query graph to form an acyclic query graph, and the order of TOKENs in the TOKEN set is determined based on the acyclic query graph.
According to the above embodiments, preferably, the ages algorithm is applied to the query graph.
According to the above embodiments, preferably, the shortest path in the query graph is determined using a modified Dijkstra algorithm.
According to the above embodiment, preferably, valid start TOKEN and valid end TOKEN are identified from the TOKEN set based on the query syntax, and a path from a vertex corresponding to the valid start TOKEN to a vertex corresponding to the valid end TOKEN in the query graph is determined.
According to the above embodiment, preferably, paths of vertices in the query graph are determined, and the path with the largest sum of weights is selected.
According to the above embodiment, it is preferably determined that the string and TOKEN set match a pattern, and the weights of the directed edges of the query graph are set based on the pattern score associated with the pattern corresponding to this match.
The implementation of the intelligent data analysis method based on natural language processing disclosed by the invention can comprise any combination of the characteristics described in the above embodiments.
As shown in FIG. 1, a schematic diagram of a user interface for generating one or more database query behaviors is further disclosed in another embodiment of the present invention, and the following further describes the specific working process of the present invention with reference to FIG. 1:
the display area 110 includes a search bar 120 that enables a user to enter a character string. The character string may include text in a natural language (e.g., chinese or english). For example, the text of the string may represent a question or command for the data analysis system. The user selects the search bar 120 and types in the text of the character string to enter the character string. In addition to this, the user may select a voice icon portion (not shown in fig. 1) of the search bar 120 and input text of a character string by voice. The string may be processed by a data analysis system to determine a database query based on the string.
Display area 110 includes a database query pane 130 that displays a representation of a database query that includes a sequence of TOKENs represented by respective TOKEN icons (132, 134, 136, 138, and 140) that were initially generated based on a string. The database query pane 130 may enable users to select and edit TOKENs by interacting with their respective TOKEN icons (132, 134, 136, 138, and 140). Clicking or hovering over the TOKEN icon 140 with a cursor may trigger a suggested alternate TOKEN list of database grammars to be displayed in the suggested TOKEN menu 160 and/or a drop down menu (not shown in fig. 1) that appears near the TOKEN icon. The user may select an alternate TOKEN to edit the database query, and the TOKEN icon 140 may be removed or replaced with the TOKEN icon of the selected TOKEN.
The display area 110 includes a search results pane 150 that includes data based on search results obtained using the database query. For example, the search results pane 150 may include raw data (represented as text) that is retrieved from a database using a database query before and/or after the user modifies the database query. In addition, the search results pane 150 may also include processing data (represented as graphs and/or summary text) based on data retrieved from the database using the database query before and/or after the user modifies the database query.
Display area 110 includes a suggested TOKEN menu 160 in which suggested TOKENs for use in a database query are listed to facilitate editing of the database query. Suggested TOKENs may include TOKENs from high ranked candidate queries that were generated by the process for determining database queries based on character strings, and that were not initially selected for presentation to the user. The suggest TOKEN menu 160 enables a user to select a TOKEN to be added to a database query or a current TOKEN to replace a database query. Suggested TOKEN menu 160 includes text entry options that enable a user to search the available TOKEN space of the database grammar.
Display area 110 includes a praise icon 170 that enables the user to express approval of the database query. When the user is satisfied with the presented database query as an accurate representation of their intent, the likes icon 170 may be clicked.
According to the embodiment, the intelligent data analysis method and system based on natural language processing disclosed by the invention automatically generate a group of TOKEN sets by analyzing natural language texts, generate a query graph according to the TOKEN sets, and use the query based on database query to call the search of the database to obtain search results, so that the same operation interface can be in butt joint with a plurality of data sources, and simultaneously, the use threshold of a common user is reduced, so that the common user can master the capability of accessing, exploring and analyzing a large amount of mixed data from distributed data sources, and the burden of a strict data analysis system mainly used by IT professionals is not borne.
All or part of the steps in the methods of the above embodiments may be performed by controlling related hardware through a program, and the program may be stored in a readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a compact disc Read-Only Memory (CD-ROM) or other optical disc storage, a magnetic disc storage, a tape storage, or any other medium capable of carrying or storing data.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An intelligent data analysis method based on natural language processing is characterized by comprising the following steps:
providing a search bar for receiving user input text to receive natural language text;
analyzing the received natural language text based on a natural language processing system to obtain a plurality of character string fragments corresponding to the semantics of the natural language text, and generating a group of TOKEN sets of database grammar based on the plurality of character string fragments;
generating a query graph for the TOKEN set using a finite state machine representing query syntax;
determining a sequence of each TOKEN in the set of TOKENs based on the query graph to form a database query;
a search behavior of the database is invoked using a query instruction based on the database query to obtain search results.
2. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the step of generating a query graph for the TOKEN set by using a finite state machine representing query syntax includes:
nodes of the finite state machine represent TOKEN types, directed edges of the finite state machine represent valid transitions between TOKEN types in the query syntax, the query graph corresponds to each TOKEN in the TOKEN set, and directed edges of the query graph represent transitions between two TOKENs in the TOKEN ordering.
3. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the step of determining the sequence of each TOKEN in the TOKEN set based on the query graph to form a database query specifically comprises:
determining a weight of a directed edge from a source node to a destination node of the query graph, corresponding to a first TOKEN to a second TOKEN of the set of TOKENs, wherein the weight is determined based on a grammatical weight of the directed edge from a node of the finite state machine representing a TOKEN type of the first TOKEN to a node of the finite state machine representing a TOKEN type of the second TOKEN, the grammatical weight indicating a frequency of transitions from the TOKEN of the first TOKEN type to the TOKEN of the second TOKEN type in the query grammar.
4. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:
removing one or more directed edges from the query graph to form an acyclic query graph, and determining an order of TOKENs in a TOKEN set based on the acyclic query graph.
5. The intelligent data analysis method based on natural language processing as claimed in claim 1, wherein the query graph is generated by applying EAdes algorithm.
6. The intelligent natural language processing-based data analysis method of claim 1, wherein a modified Dijkstra algorithm is used to determine the shortest path in the query graph.
7. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:
and identifying effective starting TOKEN and effective ending TOKEN from the TOKEN set based on the query grammar, and determining a path from a vertex corresponding to the effective starting TOKEN to a vertex corresponding to the effective ending TOKEN in the query graph.
8. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:
determining paths of vertices in the query graph and selecting the path with the largest weight sum.
9. The intelligent data analysis method based on natural language processing according to claim 1, further comprising:
determining that the string and TOKEN set match a pattern, and in response to the matching, setting a weight of a directed edge of the query graph based on a pattern score associated with the pattern.
10. An intelligent data analysis system based on natural language processing, comprising:
a memory for storing program instructions;
a processor for invoking the program instructions stored in the memory to implement the intelligent natural language processing based data analysis method of any one of claims 1 to 9.
CN202211252819.8A 2022-10-13 2022-10-13 Intelligent data analysis method and system based on natural language processing Active CN115329753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211252819.8A CN115329753B (en) 2022-10-13 2022-10-13 Intelligent data analysis method and system based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211252819.8A CN115329753B (en) 2022-10-13 2022-10-13 Intelligent data analysis method and system based on natural language processing

Publications (2)

Publication Number Publication Date
CN115329753A CN115329753A (en) 2022-11-11
CN115329753B true CN115329753B (en) 2023-03-24

Family

ID=83914735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211252819.8A Active CN115329753B (en) 2022-10-13 2022-10-13 Intelligent data analysis method and system based on natural language processing

Country Status (1)

Country Link
CN (1) CN115329753B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290429B (en) * 2023-11-24 2024-02-20 山东焦易网数字科技股份有限公司 Method for calling data system interface through natural language

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201834A (en) * 2007-11-01 2008-06-18 复旦大学 Method for searching XML data stream keyword based on document type definition
US10599644B2 (en) * 2016-09-14 2020-03-24 International Business Machines Corporation System and method for managing artificial conversational entities enhanced by social knowledge
US11334565B1 (en) * 2016-10-28 2022-05-17 Intuit, Inc. System to convert natural-language financial questions into database queries
CN109947794B (en) * 2019-02-21 2023-09-01 东华大学 Interactive natural language query conversion method
CN111177184A (en) * 2019-12-24 2020-05-19 深圳壹账通智能科技有限公司 Structured query language conversion method based on natural language and related equipment thereof
CN111190920B (en) * 2019-12-30 2023-09-15 南京诚勤教育科技有限公司 Data interaction query method and system based on natural language
CN114428788A (en) * 2022-01-28 2022-05-03 腾讯科技(深圳)有限公司 Natural language processing method, device, equipment and storage medium
CN114844689A (en) * 2022-04-19 2022-08-02 尚蝉(浙江)科技有限公司 Website logic vulnerability detection method and system based on finite-state machine

Also Published As

Publication number Publication date
CN115329753A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN109857917B (en) Security knowledge graph construction method and system for threat intelligence
US7668858B2 (en) Drag and drop technique for building queries
US9798768B2 (en) Search around visual queries
TW202020691A (en) Feature word determination method and device and server
US20130124194A1 (en) Systems and methods for manipulating data using natural language commands
CN111427561A (en) Service code generation method and device, computer equipment and storage medium
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
JP2006309446A (en) Classification dictionary updating device, classification dictionary updating program, and classification dictionary updating method
CN110515896B (en) Model resource management method, model file manufacturing method, device and system
WO2023024975A1 (en) Text processing method and apparatus, and electronic device
CN115329753B (en) Intelligent data analysis method and system based on natural language processing
JP2024507902A (en) Information retrieval methods, devices, electronic devices and storage media
CN112558966B (en) Depth model visualization data processing method and device and electronic equipment
CN113361240A (en) Method, device, equipment and readable storage medium for generating target article
US20210271637A1 (en) Creating descriptors for business analytics applications
US11227111B2 (en) Graphical user interface providing priority-based markup of documents
JP2839555B2 (en) Information search method
CN115469849B (en) Service processing system, method, electronic equipment and storage medium
CN112970011A (en) Recording pedigrees in query optimization
CN115687717A (en) Method, device and equipment for acquiring hook expression and computer readable storage medium
CN115016770A (en) Data processing method, device, equipment and storage medium
CN114676155A (en) Code prompt information determining method, data set determining method and electronic equipment
CN114238745A (en) Method and device for providing search result, electronic equipment and medium
CN115774797A (en) Video content retrieval method, device, equipment and computer readable storage medium
CN112989066A (en) Data processing method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant