CN112270189A

CN112270189A - Question type analysis node generation method, question type analysis node generation system and storage medium

Info

Publication number: CN112270189A
Application number: CN202011259004.3A
Authority: CN
Inventors: 姜磊; 钟颖欣; 辛岩; 杨钊
Original assignee: Brilliant Data Analytics Inc
Current assignee: Brilliant Data Analytics Inc
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-01-26
Anticipated expiration: 2040-11-12
Also published as: CN112270189B

Abstract

The invention relates to a data analysis technology, in particular to a method, a system and a storage medium for generating question analysis nodes, wherein the method comprises the following steps: preprocessing and word segmentation processing are carried out on the input natural language problem; performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a numerical form; extracting key information in the natural language problem, and identifying the type of the key information; constructing an intention recognition model, and judging the analysis intention of the input natural language question; and combining results of feature extraction, type recognition and intention recognition to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information which need to be analyzed in the natural language problem, and automatically generating analysis nodes. The invention ensures that the user can finish the data analysis and exploration work without knowing a complex data structure and an analysis method, thereby quickly exploring the problems in the data discovery service.

Description

Question type analysis node generation method, question type analysis node generation system and storage medium

Technical Field

The present invention relates to data analysis technologies, and in particular, to a method, a system, and a storage medium for generating a question-asked analysis node.

Background

The existing question data analysis system generally puts forward simple natural language questions by users, and the system automatically queries a database after analyzing to obtain results and presents the results to the users as a visual answer. This is only a query for some specific, relatively simple questions, such as "what the power consumption in a certain area is in a month" is provided by the user, and the existing questioning data analysis system aggregates the power consumption data in the current month in the database into a summary value and returns a visual view or a specific numerical value to the user.

When the questions of the users are complicated, such as "how the electricity usage trends of different user types in guangzhou city in the first half year? Since the above conventional question analysis system only has a data query function, the result corresponding to the question asked by the user does not directly exist in the database, and thus the complicated question analysis requirement of the user cannot be satisfied.

In addition, if the user's question does not relate to an analysis path in the shared library of the data analysis system, the user may not get effective analysis path recommendation feedback from the questioning data analysis system. Therefore, it is necessary to provide a questionable analysis node generation method, system, and the like for solving the problems of the analysis path recommendation data analysis system.

Disclosure of Invention

The invention provides a questioning type analysis node generation method, a questioning type analysis node generation system and a storage medium, which can analyze natural language questions proposed by users, automatically extract data, select an analysis function and generate analysis nodes, so that the users can finish data analysis and exploration work without knowing complex data structures and analysis methods, and therefore, the problems in data discovery services can be rapidly explored.

The method for generating the analysis node of the questioning formula comprises the following steps:

s1, preprocessing the input natural language problem and performing word segmentation processing to obtain words after word segmentation processing;

s2, performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a numerical form;

s3, extracting key information in the input natural language question, and performing type recognition on the key information to obtain entity category information;

s4, constructing an intention recognition model, judging the analysis intention of the input natural language question, and finishing intention recognition;

and S5, combining the results of feature extraction, type identification and intention identification in the steps S2-S4 to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information which need to be analyzed in the natural language problem, and automatically generating analysis nodes.

In a preferred embodiment, step S5 includes:

s51, making a task data interface of the analysis node, and making a standard data interface for each analysis node task;

s52, generating data interface information, and matching and indexing to obtain data source information, index information, dimension information and other additional data analysis information based on entity category information and in combination with metadata information; determining an analysis node task based on the analysis intent; and processing the data source information, the index information, the dimension information and other additional data analysis information, transmitting the processed data source information, the index information, the dimension information and the other additional data analysis information to corresponding analysis node tasks, and calling the analysis node tasks to complete the generation and display of analysis results.

The question analysis node generation system according to the present invention includes:

the preprocessing module is used for preprocessing and word segmentation processing the input natural language problem to obtain words after word segmentation processing;

the feature extraction module is used for performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem and converting the text data into a numerical form;

the information extraction module is used for extracting key information in the input natural language question and identifying the type of the key information to obtain entity category information;

the intention recognition module is used for constructing an intention recognition model, judging the analysis intention of the input natural language question and finishing intention recognition;

and the analysis node generation module is used for combining the processing results of the feature extraction module, the information extraction module and the intention identification module to obtain a data source, analysis dimensionality, analysis indexes, analysis tasks and other additional data analysis information which are required to be analyzed in the natural language problem and automatically generating an analysis node.

The storage medium of the present invention has stored thereon computer instructions which, when executed by a processor, perform the steps of the analytical node generation method of the present invention.

Compared with the prior art, the invention has the remarkable effects that: according to the input natural language problem, the intention of the user for data analysis can be automatically identified, the source data can be automatically matched and indexed, the filtering condition is generated, the analysis dimension and the index are determined, the analysis node is automatically generated, the analysis path is formed, and the threshold of the user for data analysis is reduced.

Drawings

FIG. 1 is a flow chart of an implementation of a method for visualizing an analytic concept of the present invention;

FIG. 2 is a schematic structural diagram of an LSTM-CRF model;

fig. 3 is a block flow diagram of analysis node generation.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the embodiments described below are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the method for generating an analysis node of a question includes the following steps:

and S1, preprocessing the input natural language problem and performing word segmentation processing to obtain words after word segmentation processing.

Carrying out unified normalized preprocessing on natural language problems input by a user, and carrying out full half-angle conversion, case-case conversion, special symbol cleaning and removal and the like on text data corresponding to the input natural language problems; in addition, because of the particularity of Chinese, there is no obvious separator between words, even there is no separator in the mixed text of Chinese and English, so word segmentation is needed to divide the whole sentence text string into independent words.

Step S1 specifically includes: loading text data corresponding to the input natural language question to a memory for processing; uniformly converting text data corresponding to the input natural language problem into a form of lower case letters, half corners and simplified bodies, and performing word segmentation by using a jieba word segmentation tool; and judging the word list after word segmentation, if a stop word bank exists, removing corresponding stop words, and if not, keeping the stop words.

S2, feature extraction

And performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem. The machine learning model cannot directly use natural language, and the purpose can be achieved by expressing the machine learning model in a numerical form and by characteristic representation and extraction. In this embodiment, feature representation and feature extraction are performed on the preprocessed text by using TF-IDF (Term Frequency-Inverse Document Frequency), a conversion model Word2Vec, a text feature extraction function countvectorer, and the like, and the preprocessed text is converted into a numerical form.

When text characteristic representation is carried out in the step, words in the text data are converted into word frequency matrixes, TF-IDF weights of all the words are counted, and the weights of the words in the corresponding text data are obtained, which is a compromise process; therefore, partial features which can represent text semantics are selected, so that not only can the text be better expressed, but also the algorithm complexity can be reduced;

in this embodiment, TF-IDF is a combination of TF and IDF, and the calculation formula is as follows:

wherein, T F_ijIndicating the number of times the ith feature item in the document set appears in the document j. It should be noted that: TF is the word frequency, which means the number of times a word appears in a document, and is an important evaluation index because it considers not only whether a feature word appears, but also the number of times it appears.

IDF is the inverse document frequency, considering that if a word appears in every document, it indicates that the word is a normal word and does not have the ability to distinguish between documents, and if a word appears in only a few documents in the corpus, it indicates that the word has the ability to distinguish between documents. The expression is as follows:

where N represents the total number of documents in the document set, N_jRepresenting the number of documents containing a feature word j, n_jThe significance of +0.01 is to prevent IDF from going to infinity.

S3, information extraction

The natural language questions input by the user through the machine learning algorithm are as follows: extracting key information such as time, area name, index name and the like; and the type of the key information is also identified. For example: the "Guangzhou city" is a regional information, and the "first half year" is a time information.

In the present invention, the TF-IDF weight corresponds to a numerical representation of a word in order to perform a mathematical operation. The key elements refer to time, area, index and other terms in a sentence, and the constructed entity recognition model is recognized. For example: "how much electricity is used in the city of Guangzhou for half a year? The words are divided into words to obtain each word, TF-IDF is used for carrying out mathematical representation on each word, then the entity recognition model is operated to recognize that Guangzhou is a region, the last half year is time, and electricity consumption is an index.

Further, step S3 specifically includes:

and S31, carrying out sequence tagging on the text data in the training data to obtain the entity type of the segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data, and forming tagged data.

And carrying out sequence labeling on the text data in the training data, wherein words are entity names, and words are not entity names. In this embodiment, a BIO (Inside, out) labeling manner is adopted to label each word element in the text data as "B-X", "I-X", or "O", where "B-X" indicates that the segment where the word element is located belongs to the X type and the word element is at the beginning of the segment, "I-X" indicates that the segment where the word element is located belongs to the X type and the word element is at the middle position of the segment, and "O" indicates that the word element does not belong to any type; and "X" represents the name of the type of the entity to be identified, such as time entity "TIM", regional entity "DIS", dimensional entity "DIM", etc. Taking the regional entity as an example, "B-DIS" represents the beginning of the regional entity, and "I-DIS" represents the middle of the regional entity. For example: "what is the electricity usage in Guangzhou in 6 months? "the result after sequence labeling is:

·6->B-TIM

monthly- > I-TIM

Part- > O

Guang- > B-DIS

Zhou- > I-DIS

- > O of

With- > B-IDX

E- > I-IDX

Quantity- > I-IDX

Is- > O

O is poly-)

O is small- > O

S32 model training

Named entity recognition for natural language problems NER aims at extracting text segments of specific required entities from text data, which is actually a sequence tagging problem from a model perspective. For each cell of the input sequence, a specific tag is output. In the machine learning based method, a Conditional Random Field (CRF) is a mainstream model of the named entity recognition NER, and its objective function not only considers the input state feature function, but also includes a label transfer feature function. Conditional random fields have the advantage that they can utilize rich internal and contextual feature information in labeling a location. And in the distributed representation of words in the neural network model, tokens are mapped to dense Embedding (Embedding) representation in a low-dimensional space from sparse one-hot representation, the representation of the words is enriched, an Embedding sequence of sentences is input into a recurrent neural network RNN, features are automatically extracted by the neural network, the complex feature engineering is not depended on, and the label of each token is predicted by Softmax. The disadvantage is that the process of tagging each token is independent and cannot directly utilize the tags already predicted above, resulting in the possibility of invalid predicted tag sequences.

The invention integrates the advantages of the two models, combines the neural network model and the conditional random field model to form the LSTM-CRF model, and can well solve the problem of NER named entity recognition as shown in figure 2. The LSTM, a Long Short Term Memory network (Long Short Term Memory network), is a special type of RNN that can learn Long distance dependent information. Different from the common RNN unit which only has one tanh layer, the LSTM has three gate structures (an input gate, a forgetting gate and an output gate), selectively forgets part of history information, adds part of current input information, and finally integrates the current state and generates an output state. The biLSTM-CRF model applied to the NER mainly comprises an Embedding layer, a bidirectional LSTM layer and a final CRF layer, and is the most mainstream model in the current NER method based on deep learning.

And taking the data marked by the sequence as training data, training by using a BilSTM-CRF model, and performing parameter optimization to identify the type of the newly input natural language problem to obtain entity category information.

S4 intention recognition

And constructing an intention recognition model, and judging the purpose of carrying out data analysis on the natural language question provided by the user, such as source data viewing, data filtering, multidimensional analysis, funnel analysis, comparative analysis, trend analysis, report analysis, correlation analysis and the like.

Further, step S4 specifically includes:

s41, labeling data

The purpose of intention identification is to judge the intention of data analysis of an input natural language question, whether the input natural language question is used for inquiring data or trend analysis or other analysis intents, and the essence of the intention is a text classification question; therefore, an intention recognition model is trained, namely a text classification model is trained. Firstly, training data needs to be labeled, and the intention type of each natural language question is labeled. For example, the intent types total 7 classes: source data viewing, data filtering, multidimensional analysis, funnel analysis, comparative analysis, trend analysis, report analysis, correlation analysis, which can be simply marked with the numbers 0, 1, 2, 3, 4, 5, 6.

S42 model training

The essence of the intention recognition is text classification, after input text is preprocessed and is processed by TF-IDF, the numerical characteristics of words are extracted, a Support Vector Machine (SVM) is used for training a classification model, and the classification model is constructed into an intention recognition model. After training and optimization, the intention recognition model can perform intention recognition on text data corresponding to the newly input natural language problem, perform probability prediction on each intention type, and select the intention type with the highest probability as the input natural language problem.

S5, analysis node generation

Combining the results of feature extraction, entity category identification and intention identification in the above steps S2-S4, the data source, analysis dimension, analysis index and analysis task which need to be analyzed in the natural language question input by the user can be obtained, and other additional data analysis information which may include time information, region information and the like. Combining the above information enables the system to automatically generate the analysis nodes.

Further, step S5 specifically includes:

s51, establishing a task data interface of the analysis node: a standard data interface is established for each analysis node task, for example, trend analysis node task input data includes: data source name, analysis index, time range and screening condition; distributing the analysis node task input data includes: data source name, analysis index, analysis dimension, and screening condition. By analogy, each analysis node task has corresponding input data according to the characteristics of the analysis node task. Wherein part of the input data is mandatory and part is optional. The filtering condition in both tasks is optional as described above, and other input data is necessary.

S52, data interface information generation: matching and indexing to obtain data source information, index information, dimension information, time information and region information based on entity category information obtained in an entity identification process and in combination with existing metadata information in a system; determining an analysis node task based on an analysis intention obtained in the intention identification process; the information is processed and then transmitted to the corresponding analysis node task, and the analysis node task is called to complete the generation and display of the analysis result, as shown in fig. 3. That is, the analysis intention information is used to determine which task node in the system is adopted (the task nodes are all built-in to the system), and each task node has a corresponding data interface; data source information, dimension information and index information in the sentence are obtained through the entity recognition process, and are matched with a data dictionary in the system to determine a data name, an index name and a dimension name; and the time information and the area information are subjected to regular standardization processing to serve as screening conditions of data. The above information is used as the input data of the task node, and the system can automatically generate the analysis node.

Correspondingly, the invention also provides a questioning type analysis node generation system, which comprises:

the preprocessing module is used for realizing the step S1, and carrying out preprocessing and word segmentation on the input natural language question to obtain words after word segmentation;

the feature extraction module is used for implementing the step S2, performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a numerical form;

an information extraction module, configured to implement step S3, extract key information from the input natural language question, perform type identification on the key information, and obtain entity category information;

an intention recognition module, configured to implement step S4, construct an intention recognition model, determine an analysis intention of the input natural language question, and complete intention recognition;

and an analysis node generation module for implementing the step S5, obtaining a data source, an analysis dimension, an analysis index, an analysis task and other additional data analysis information to be analyzed in the natural language problem by combining the processing results of the feature extraction module, the information extraction module and the intention identification module, and automatically generating an analysis node.

Based on the same inventive concept, the present invention also proposes a storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps S1-S5 of the inventive analysis node generation method.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A question type analysis node generation method is characterized by comprising the following steps:

2. The analysis node generation method according to claim 1, wherein step S5 includes:

3. The method according to claim 2, wherein the other additional data analysis message includes time information and area information.

4. The analytical node generation method of claim 2, wherein in step S51, the task input data of the trend analytical node includes a data source name, an analytical index, a time range, and a filtering condition; the distributed analysis node task input data comprises a data source name, an analysis index, an analysis dimension and a screening condition.

5. The analysis node generation method according to claim 1, wherein step S4 includes:

s41, firstly, marking training data, and marking the intention type of each natural language question;

and S42, training the classification model to construct an intention recognition model, performing intention recognition on text data corresponding to the input natural language question by using the intention recognition model, performing probability prediction on each intention type, and selecting the intention type with the highest probability as the input natural language question.

6. The analysis node generation method according to claim 1, wherein step S3 includes:

s31, carrying out sequence tagging on the text data in the training data to obtain the entity type of the segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data to form tagged data;

and S32, taking the data after the sequence labeling as training data, training by using a BilSTM-CRF model, and using the model obtained after parameter optimization for type recognition of the newly input natural language problem.

7. The method for generating an analysis node according to claim 6, wherein in step S31, each word element in the text data is labeled as "B-X", "I-X" or "O" in a BIO labeling manner, where "B-X" indicates that the segment where the word element is located belongs to the X type and the word element is at the beginning of the segment, "I-X" indicates that the segment where the word element is located belongs to the X type and the word element is at the middle position of the segment, and "O" indicates that the word element does not belong to any type; and "X" represents the name of the entity type to be identified.

8. A question-asked analysis node generation system, comprising:

9. The system according to claim 8, wherein the process of generating the analysis node by the analysis node generation module comprises:

making a data interface of an analysis node task, and making a standard data interface for each analysis node task;

generating data interface information, namely matching and indexing to obtain data source information, index information, dimension information and other additional data analysis information based on entity category information and in combination with metadata information; determining an analysis node task based on the analysis intent; and processing the data source information, the index information, the dimension information and other additional data analysis information, transmitting the processed data source information, the index information, the dimension information and the other additional data analysis information to corresponding analysis node tasks, and calling the analysis node tasks to complete the generation and display of analysis results.

10. Storage medium having stored thereon computer instructions, characterized in that said computer instructions, when executed by a processor, carry out the steps of the analysis node generation method according to any of claims 1-7.