CN112270189B

CN112270189B - Question type analysis node generation method, system and storage medium

Info

Publication number: CN112270189B
Application number: CN202011259004.3A
Authority: CN
Inventors: 姜磊; 钟颖欣; 辛岩; 杨钊
Original assignee: Brilliant Data Analytics Inc
Current assignee: Brilliant Data Analytics Inc
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2023-07-18
Anticipated expiration: 2040-11-12
Also published as: CN112270189A

Abstract

The invention relates to a data analysis technology, in particular to a method, a system and a storage medium for generating an analysis node of a questioning type, wherein the method comprises the following steps: preprocessing and word segmentation processing are carried out on the input natural language problem; performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form; extracting key information in the natural language problem, and carrying out type identification on the key information; constructing an intention recognition model, and judging the analysis intention of the input natural language problem; and combining the results of feature extraction, type recognition and intention recognition to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes. The invention can complete the data analysis and exploration work without knowing the complex data structure and analysis method, thereby quickly exploring the problems in the data discovery service.

Description

Question type analysis node generation method, system and storage medium

Technical Field

The present invention relates to data analysis technologies, and in particular, to a method, a system, and a storage medium for generating an analysis node of a question.

Background

The existing questioning type data analysis system generally presents simple natural language questions to a user, automatically queries a database after analysis, obtains results and presents a visual answer to the user. This is merely a matter of specific, relatively simple question query, such as a user asking "what the electricity usage is for that month in a region," and existing questionable data analysis systems aggregate the electricity usage data for that month in the database into a summary value and return the summary value to the user in a visual view or a specific numerical value.

When the user's question is complex, such as "how does the electricity trend for different user types in Guangzhou city the last half year? The conventional questioning type data analysis system has only a data query function, and the results corresponding to the questions posed by the user are not directly stored in the database, so that the complex questioning type analysis requirements of the user cannot be met.

In addition, if the problem of the user is not related to the analysis path in the shared library of the data analysis system, the user cannot obtain effective analysis path recommendation feedback from the questionable data analysis system. Therefore, it is necessary to provide a method and a system for generating an analysis node of a questioning type for solving the problems of the data analysis system of the analysis path recommendation type.

Disclosure of Invention

The invention provides a questioning type analysis node generation method, a questioning type analysis node generation system and a storage medium, which can analyze based on natural language problems proposed by users, automatically extract data, select analysis functions and generate analysis nodes, so that the users can finish data analysis and exploration work without knowing complex data structures and analysis methods, and further quickly explore the problems in data discovery service.

The method for generating the analysis node of the question type comprises the following steps:

s1, preprocessing an input natural language problem, and performing word segmentation processing to obtain words after word segmentation processing;

s2, carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form;

s3, extracting key information in the input natural language problem, and performing type identification on the key information to obtain entity category information;

s4, constructing an intention recognition model, judging the analysis intention of the input natural language problem, and completing intention recognition;

and S5, combining the results of the feature extraction, the type recognition and the intention recognition in the steps S2-S4 to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.

In a preferred embodiment, step S5 comprises:

s51, formulating an analysis node task data interface, and formulating a standard data interface for each analysis node task;

s52, generating data interface information, and obtaining data source information, index information, dimension information and other additional data analysis information by combining metadata information and matching indexes based on entity category information; determining an analysis node task based on the analysis intent; after the data source information, the index information, the dimension information and other additional data analysis information are processed, the processed data source information, the dimension information and the additional data analysis information are transmitted to the corresponding analysis node task, and meanwhile the analysis node task is called to complete the generation and the display of analysis results.

The question-type analysis node generation system according to the present invention comprises:

the preprocessing module is used for preprocessing and word segmentation processing of the input natural language problem to obtain words after word segmentation processing;

the feature extraction module is used for carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem and converting the text data into a digital form;

the information extraction module is used for extracting key information in the input natural language problem, and carrying out type identification on the key information to obtain entity category information;

the intention recognition module is used for constructing an intention recognition model, judging the analysis intention of the input natural language problem and completing intention recognition;

the analysis node generation module is used for combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.

The storage medium of the present invention has stored thereon computer instructions which, when executed by a processor, implement the steps of the analysis node generation method of the present invention.

Compared with the prior art, the invention has the remarkable effects that: according to the input natural language problem, the intention of the user for data analysis can be automatically identified, source data is automatically matched and indexed, filtering conditions are generated, analysis dimensions and indexes are determined, analysis nodes are automatically generated, an analysis path is formed, and the threshold of the user for data analysis is reduced.

Drawings

FIG. 1 is a flow chart of an implementation of the analytic concept visualization method of the present invention;

FIG. 2 is a schematic structural diagram of the LSTM-CRF model;

fig. 3 is a flow diagram of analysis node generation.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solution of the present invention will be clearly and completely described below with reference to the embodiments and the accompanying drawings. It will be apparent that the embodiments described below are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the method for generating an analysis node of the question in this embodiment specifically includes the following steps:

s1, preprocessing and word segmentation processing are carried out on the input natural language problem, and words after the word segmentation processing are obtained.

Unified standardized preprocessing is carried out on the natural language problem input by a user, and full-half angle conversion, case-to-case conversion, special symbol cleaning removal and the like are carried out on text data corresponding to the input natural language problem; in addition, because of the specificity of Chinese, no obvious separator exists between words, and even the Chinese and English mixed text is not necessarily distinguished by separator, the word segmentation process is also needed to segment the whole sentence text string into individual words.

The step S1 specifically includes: loading text data corresponding to the input natural language problem into a memory for convenient processing; uniformly converting text data corresponding to an input natural language problem into lower case letters, half angles and simplified forms, and performing word segmentation by using a jieba word segmentation tool; judging the word list after word segmentation, if a stop word stock exists, eliminating the corresponding stop word, otherwise, keeping.

S2, feature extraction

And carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem. The machine learning model cannot directly use natural language, and can achieve the purpose through expressing and extracting the characteristics by expressing the natural language in a numerical form. In this embodiment, the pre-processed text is subjected to feature representation and feature extraction by using TF-IDF (Term Frequency-inverse document Frequency), a conversion model Word2Vec, a text feature extraction function countvector, and the like, and is converted into a digital form.

When the text feature representation is carried out in the step, words in text data are converted into word frequency matrixes, TF-IDF weights of each word are counted, and weights of the words in the corresponding text data are obtained, so that the method is a compromise process; therefore, partial characteristics capable of representing text semantics are selected, so that the text can be better expressed, and the algorithm complexity can be reduced;

in this embodiment, TF-IDF is a combination of TF and IDF, and the calculation formula is as follows:

therein, T F _ij Representing the number of times the ith feature term in the document set appears in document j. It should be noted that: TF is word frequency, which refers to the number of times a word appears in a document, and is importantThe index is evaluated because it considers not only whether the feature word appears but also the number of occurrences.

IDF is the inverse document frequency, which considers that if a word appears in every document, it is a common word, without the ability to distinguish between categories, and if a word appears in only a few documents in the corpus, it is an ability to distinguish between categories. The expression is:

where N represents the total number of documents in the document set, N _j Representing the number of documents containing the feature word j, n _j The meaning of +0.01 is to prevent IDF from going to infinity.

S3, information extraction

Among natural language questions entered by a user through machine learning algorithms are: extracting key information such as time, area name, index name and the like; and meanwhile, the type identification is also carried out on the key information. For example: "Guangzhou city" is a regional information, and "last half year" is a time information.

In the present invention, the TF-IDF weights are equivalent to numerical representations of words for mathematical operations. The key elements refer to terms such as time, region, index and the like in a sentence, and the constructed entity recognition model is used for recognizing the terms. For example: "what is half a year of electricity consumption in Guangzhou city? The words are segmented to obtain each word, the words are expressed mathematically by using TF-IDF, then the entity recognition model recognizes that Guangzhou is a region, the last half year is a time, and the electricity consumption is an index.

Further, the step S3 specifically includes:

s31, sequence labeling is carried out on text data in the training data, the entity type of the segment to which each word element belongs and the position of the word element in the segment to which the word element belongs are obtained, and labeling data are formed.

Text data in the training data is sequence-tagged, which words are entity names, and which words are not entity names. In this embodiment, a BIO (Begin, side) labeling manner is adopted, each word element in the text data is labeled as "B-X", "I-X" or "O", wherein "B-X" indicates that a segment where the word element is located belongs to an X type and the word element is at the beginning of the segment, "I-X" indicates that the segment where the word element is located belongs to an X type and the word element is at the middle position of the segment, and "O" indicates that the word element is not of any type; and "X" represents the name of the entity type to be identified, such as time entity "TIM", area entity "DIS", dimension entity "DIM", etc. Taking the regional entity as an example, "B-DIS" indicates the start of the regional entity, and "I-DIS" indicates the middle of the regional entity. For example: "what is the power usage of Guangzhou 6 months? "the result after sequence labeling is:

·6->B-TIM

moon- > I-TIM

Parts- > O

Broad- > B-DIS

State- > I-DIS

- > O

Use- > B-IDX

Electric- > I-IDX

Quantity- > I-IDX

Is- > O

Multi- > O

Less- > O

S32, model training

The object of naming a natural language problem for the recognition of NER is to extract a text segment of a specific required entity from the text data, which is actually a sequence labeling problem from a model perspective. For each element of the input sequence, a particular tag is output. In the machine learning based approach, a conditional random field (CRF, conditional Random Field) is the dominant model of named entity recognition NER whose objective function not only considers the state characteristics of the input, but also includes the label transfer characteristics. The advantage of conditional random fields is that they can utilize rich internal and contextual characteristic information in labeling a location. The distributed representation of the words in the neural network model maps the token from the sparse one-hot representation to the dense Embedding representation in the low-dimensional space, enriches the word representation, inputs the Embedding sequence of sentences into the cyclic neural network RNN, automatically extracts the features by the neural network, and predicts the label of each token by Softmax without depending on complex feature engineering. The disadvantage is that the labelling process for each token is independent and the labels predicted above cannot be directly used, resulting in the possibility that the predicted label sequence may be invalid.

The invention combines the advantages of the two models, combines the neural network model and the conditional random field model to form the LSTM-CRF model, and can well solve the NER named entity identification problem as shown in figure 2. LSTM, long Short Term Memory (long term memory network) networks, are a special type of RNN that can learn long-range dependency information. Unlike a common RNN unit, which has only one tanh layer, LSTM has three gate structures (input gate, forget gate and output gate), optionally forgets part of the history information, adds part of the current input information, and finally integrates to the current state and generates the output state. The bilisTM-CRF model applied to NER mainly comprises an Embedding layer, a bidirectional LSTM layer and a final CRF layer, and is the most mainstream model in the NER method based on deep learning at present.

And training the data marked by the sequences by using a BiLSTM-CRF model as training data, and performing parameter optimization to obtain entity category information by using the data marked by the sequences for identifying the type of the newly input natural language problem.

S4, intention recognition

An intention recognition model is constructed, and the purpose that a user presents natural language questions for data analysis is judged, such as source data viewing, data filtering, multidimensional analysis, funnel analysis, comparison analysis, trend analysis, report analysis, correlation analysis and the like.

Further, the step S4 specifically includes:

s41, data annotation

The purpose of intention recognition is to judge whether the intention of inputting a natural language problem for data analysis is to inquire data or trend analysis or other analysis intention, and the essence of the intention is a text classification problem; thus, an intention recognition model, i.e., a text classification model, is trained. Firstly, training data needs to be marked, and the intention type of each natural language problem needs to be marked. For example, the intention types have 7 types in total: source data viewing, data filtering, multidimensional analysis, funnel analysis, comparative analysis, trend analysis, report analysis, correlation analysis, can be simply labeled with the numbers 0,1,2,3,4,5, 6.

S42, model training

The essence of the intention recognition is text classification, after preprocessing the input text, extracting numerical characteristics of words after TF-IDF processing, training a classification model by using a support vector machine (SVM, support Vector Machine), and constructing the classification model into the intention recognition model. After training and optimizing, the intention recognition model can recognize intention of text data corresponding to the new input natural language problem, probability prediction is carried out on each intention type, and the intention type with the highest probability is selected as the intention type of the input natural language problem.

S5, generating analysis nodes

And combining the results of the feature extraction, the entity category identification and the intention identification in the steps S2-S4, the data source, the analysis dimension, the analysis index and the analysis task which are required to be analyzed in the natural language problem input by the user and other additional data analysis information such as time information, regional information and the like which may be contained can be obtained. Combining the above information allows the system to automatically generate analysis nodes.

Further, the step S5 specifically includes:

s51, formulating an analysis node task data interface: a data interface for formulating a standard for each analysis node task, for example, trend analysis node task input data includes: data source name, analysis index, time range and screening condition; the distribution analysis node task input data includes: data source name, analysis index, analysis dimension, screening condition. Similarly, each analysis node task has corresponding input data according to the characteristics of the analysis node task. Wherein part of the input data is mandatory and part is optional. As the filtering conditions are optional among the two tasks described above, other input data is necessary.

S52, generating data interface information: based on entity category information obtained in the entity identification process, combining with the existing metadata information in the system, matching indexes to obtain data source information, index information, dimension information, time information and region information; determining an analysis node task based on the analysis intent obtained by the intent recognition process; the information is processed and then transferred to a corresponding analysis node task, and the analysis node task is called to complete the generation and the display of analysis results, as shown in fig. 3. That is, the intent information is analyzed to determine which task nodes in the system are employed (task nodes are all built-in to the system), each task node having a corresponding data interface; the data source information, the dimension information and the index information in the sentences are obtained through the entity identification process, and the data source information, the dimension information and the index information are matched with a data dictionary in the system to determine the data name, the index name and the dimension name; the time information and the region information are subjected to regularized normalization processing and serve as screening conditions of the data. The information is used as input data of the task node, and the system automatically generates the analysis node.

Correspondingly, the invention also provides a questioning type analysis node generation system, which comprises:

the preprocessing module is used for realizing the step S1, preprocessing and word segmentation processing are carried out on the input natural language problem, and words after the word segmentation processing are obtained;

the feature extraction module is used for realizing the step S2, carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form;

the information extraction module is used for realizing the step S3, extracting the key information in the input natural language problem, and carrying out type identification on the key information to obtain entity category information;

the intention recognition module is used for realizing the step S4, constructing an intention recognition model, judging the analysis intention of the input natural language problem and finishing the intention recognition;

and the analysis node generation module is used for realizing the step S5, combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.

Based on the same inventive concept, the present invention also proposes a storage medium having stored thereon computer instructions which, when executed by a processor, implement steps S1-S5 of the inventive analysis node generation method.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The questioning type analysis node generation method is characterized by comprising the following steps:

s5, combining the results of feature extraction, type recognition and intention recognition in the steps S2-S4 to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes;

the step S3 comprises the following steps:

s31, sequence labeling is carried out on text data in the training data, so that the entity type of a segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data are obtained, and labeling data are formed;

the text data in the training data is marked in sequence by adopting a BIO marking mode, which words are entity names and which words are not entity names, each word element in the text data is marked as 'B-X', 'I-X' or 'O', wherein 'B-X' represents that a fragment where the word element is located belongs to an X type and the word element is at the beginning of the fragment, 'I-X' represents that the fragment where the word element is located belongs to an X type and the word element is at the middle position of the fragment, 'O' represents that the word element does not belong to any type, 'X' represents the entity type name to be identified, a time entity is 'TIM', a regional entity is 'DIS', and a dimension entity is 'DIM'; s32, training the data after sequence labeling by using a BiLSTM-CRF model as training data, and using the model obtained after parameter optimization for type recognition of a new input natural language problem;

the step S4 includes:

s41, firstly, marking training data, and marking the intention type of each natural language problem;

s42, training the classification model, constructing an intention recognition model, carrying out intention recognition on text data corresponding to the input natural language problem by using the intention recognition model, carrying out probability prediction on each intention type, and selecting the intention type with the highest probability as the intention type of the input natural language problem;

the step S5 comprises the following steps:

2. The method of claim 1, wherein the additional data analysis information includes time information and region information.

3. The analysis node generation method according to claim 1, wherein in step S51, the trend analysis node task input data includes a data source name, an analysis index, a time range, and a screening condition; the distribution analysis node task input data comprises a data source name, an analysis index, an analysis dimension and screening conditions.

4. A question-based analysis node generation system, comprising:

the intention recognition module is used for constructing an intention recognition model, judging the analysis intention of the input natural language problem and completing intention recognition; the analysis node generation module is used for combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes;

the process of the information extraction module for carrying out type identification on the key information comprises the following steps:

sequence labeling is carried out on text data in the training data, so that the entity type of a segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data are obtained, and labeling data are formed; training the data with the sequence marked as training data by using a BiLSTM-CRF model, and using the model obtained after parameter optimization for type recognition of the new input natural language problem; the text data in the training data is marked in sequence by adopting a BIO marking mode, which words are entity names and which words are not entity names, each word element in the text data is marked as 'B-X', 'I-X' or 'O', wherein 'B-X' represents that a fragment where the word element is located belongs to an X type and the word element is at the beginning of the fragment, 'I-X' represents that the fragment where the word element is located belongs to an X type and the word element is at the middle position of the fragment, 'O' represents that the word element does not belong to any type, 'X' represents the entity type name to be identified, a time entity is 'TIM', a regional entity is 'DIS', and a dimension entity is 'DIM';

the process of intention recognition by the intention recognition module comprises the following steps: firstly, marking training data, and marking the intention type of each natural language problem; training and constructing a classification model into an intention recognition model, carrying out intention recognition on text data corresponding to the input natural language problem by using the intention recognition model, carrying out probability prediction on each intention type, and selecting the intention type with the highest probability as the intention type of the input natural language problem;

the process of generating the analysis node by the analysis node generation module comprises the following steps: formulating an analysis node task data interface, and formulating a standard data interface for each analysis node task; generating data interface information, combining metadata information based on entity category information, and matching indexes to obtain data source information, index information, dimension information and other additional data analysis information; determining an analysis node task based on the analysis intent; after the data source information, the index information, the dimension information and other additional data analysis information are processed, the processed data source information, the dimension information and the additional data analysis information are transmitted to the corresponding analysis node task, and meanwhile the analysis node task is called to complete the generation and the display of analysis results.

5. A storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the analysis node generation method of any of claims 1-3.