CN112270189B - Question type analysis node generation method, system and storage medium - Google Patents

Question type analysis node generation method, system and storage medium Download PDF

Info

Publication number
CN112270189B
CN112270189B CN202011259004.3A CN202011259004A CN112270189B CN 112270189 B CN112270189 B CN 112270189B CN 202011259004 A CN202011259004 A CN 202011259004A CN 112270189 B CN112270189 B CN 112270189B
Authority
CN
China
Prior art keywords
analysis
information
data
natural language
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011259004.3A
Other languages
Chinese (zh)
Other versions
CN112270189A (en
Inventor
姜磊
钟颖欣
辛岩
杨钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Brilliant Data Analytics Inc
Original Assignee
Brilliant Data Analytics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Brilliant Data Analytics Inc filed Critical Brilliant Data Analytics Inc
Priority to CN202011259004.3A priority Critical patent/CN112270189B/en
Publication of CN112270189A publication Critical patent/CN112270189A/en
Application granted granted Critical
Publication of CN112270189B publication Critical patent/CN112270189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a data analysis technology, in particular to a method, a system and a storage medium for generating an analysis node of a questioning type, wherein the method comprises the following steps: preprocessing and word segmentation processing are carried out on the input natural language problem; performing feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form; extracting key information in the natural language problem, and carrying out type identification on the key information; constructing an intention recognition model, and judging the analysis intention of the input natural language problem; and combining the results of feature extraction, type recognition and intention recognition to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes. The invention can complete the data analysis and exploration work without knowing the complex data structure and analysis method, thereby quickly exploring the problems in the data discovery service.

Description

Question type analysis node generation method, system and storage medium
Technical Field
The present invention relates to data analysis technologies, and in particular, to a method, a system, and a storage medium for generating an analysis node of a question.
Background
The existing questioning type data analysis system generally presents simple natural language questions to a user, automatically queries a database after analysis, obtains results and presents a visual answer to the user. This is merely a matter of specific, relatively simple question query, such as a user asking "what the electricity usage is for that month in a region," and existing questionable data analysis systems aggregate the electricity usage data for that month in the database into a summary value and return the summary value to the user in a visual view or a specific numerical value.
When the user's question is complex, such as "how does the electricity trend for different user types in Guangzhou city the last half year? The conventional questioning type data analysis system has only a data query function, and the results corresponding to the questions posed by the user are not directly stored in the database, so that the complex questioning type analysis requirements of the user cannot be met.
In addition, if the problem of the user is not related to the analysis path in the shared library of the data analysis system, the user cannot obtain effective analysis path recommendation feedback from the questionable data analysis system. Therefore, it is necessary to provide a method and a system for generating an analysis node of a questioning type for solving the problems of the data analysis system of the analysis path recommendation type.
Disclosure of Invention
The invention provides a questioning type analysis node generation method, a questioning type analysis node generation system and a storage medium, which can analyze based on natural language problems proposed by users, automatically extract data, select analysis functions and generate analysis nodes, so that the users can finish data analysis and exploration work without knowing complex data structures and analysis methods, and further quickly explore the problems in data discovery service.
The method for generating the analysis node of the question type comprises the following steps:
s1, preprocessing an input natural language problem, and performing word segmentation processing to obtain words after word segmentation processing;
s2, carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form;
s3, extracting key information in the input natural language problem, and performing type identification on the key information to obtain entity category information;
s4, constructing an intention recognition model, judging the analysis intention of the input natural language problem, and completing intention recognition;
and S5, combining the results of the feature extraction, the type recognition and the intention recognition in the steps S2-S4 to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.
In a preferred embodiment, step S5 comprises:
s51, formulating an analysis node task data interface, and formulating a standard data interface for each analysis node task;
s52, generating data interface information, and obtaining data source information, index information, dimension information and other additional data analysis information by combining metadata information and matching indexes based on entity category information; determining an analysis node task based on the analysis intent; after the data source information, the index information, the dimension information and other additional data analysis information are processed, the processed data source information, the dimension information and the additional data analysis information are transmitted to the corresponding analysis node task, and meanwhile the analysis node task is called to complete the generation and the display of analysis results.
The question-type analysis node generation system according to the present invention comprises:
the preprocessing module is used for preprocessing and word segmentation processing of the input natural language problem to obtain words after word segmentation processing;
the feature extraction module is used for carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem and converting the text data into a digital form;
the information extraction module is used for extracting key information in the input natural language problem, and carrying out type identification on the key information to obtain entity category information;
the intention recognition module is used for constructing an intention recognition model, judging the analysis intention of the input natural language problem and completing intention recognition;
the analysis node generation module is used for combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.
The storage medium of the present invention has stored thereon computer instructions which, when executed by a processor, implement the steps of the analysis node generation method of the present invention.
Compared with the prior art, the invention has the remarkable effects that: according to the input natural language problem, the intention of the user for data analysis can be automatically identified, source data is automatically matched and indexed, filtering conditions are generated, analysis dimensions and indexes are determined, analysis nodes are automatically generated, an analysis path is formed, and the threshold of the user for data analysis is reduced.
Drawings
FIG. 1 is a flow chart of an implementation of the analytic concept visualization method of the present invention;
FIG. 2 is a schematic structural diagram of the LSTM-CRF model;
fig. 3 is a flow diagram of analysis node generation.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solution of the present invention will be clearly and completely described below with reference to the embodiments and the accompanying drawings. It will be apparent that the embodiments described below are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the method for generating an analysis node of the question in this embodiment specifically includes the following steps:
s1, preprocessing and word segmentation processing are carried out on the input natural language problem, and words after the word segmentation processing are obtained.
Unified standardized preprocessing is carried out on the natural language problem input by a user, and full-half angle conversion, case-to-case conversion, special symbol cleaning removal and the like are carried out on text data corresponding to the input natural language problem; in addition, because of the specificity of Chinese, no obvious separator exists between words, and even the Chinese and English mixed text is not necessarily distinguished by separator, the word segmentation process is also needed to segment the whole sentence text string into individual words.
The step S1 specifically includes: loading text data corresponding to the input natural language problem into a memory for convenient processing; uniformly converting text data corresponding to an input natural language problem into lower case letters, half angles and simplified forms, and performing word segmentation by using a jieba word segmentation tool; judging the word list after word segmentation, if a stop word stock exists, eliminating the corresponding stop word, otherwise, keeping.
S2, feature extraction
And carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem. The machine learning model cannot directly use natural language, and can achieve the purpose through expressing and extracting the characteristics by expressing the natural language in a numerical form. In this embodiment, the pre-processed text is subjected to feature representation and feature extraction by using TF-IDF (Term Frequency-inverse document Frequency), a conversion model Word2Vec, a text feature extraction function countvector, and the like, and is converted into a digital form.
When the text feature representation is carried out in the step, words in text data are converted into word frequency matrixes, TF-IDF weights of each word are counted, and weights of the words in the corresponding text data are obtained, so that the method is a compromise process; therefore, partial characteristics capable of representing text semantics are selected, so that the text can be better expressed, and the algorithm complexity can be reduced;
in this embodiment, TF-IDF is a combination of TF and IDF, and the calculation formula is as follows:
therein, T F ij Representing the number of times the ith feature term in the document set appears in document j. It should be noted that: TF is word frequency, which refers to the number of times a word appears in a document, and is importantThe index is evaluated because it considers not only whether the feature word appears but also the number of occurrences.
IDF is the inverse document frequency, which considers that if a word appears in every document, it is a common word, without the ability to distinguish between categories, and if a word appears in only a few documents in the corpus, it is an ability to distinguish between categories. The expression is:
where N represents the total number of documents in the document set, N j Representing the number of documents containing the feature word j, n j The meaning of +0.01 is to prevent IDF from going to infinity.
S3, information extraction
Among natural language questions entered by a user through machine learning algorithms are: extracting key information such as time, area name, index name and the like; and meanwhile, the type identification is also carried out on the key information. For example: "Guangzhou city" is a regional information, and "last half year" is a time information.
In the present invention, the TF-IDF weights are equivalent to numerical representations of words for mathematical operations. The key elements refer to terms such as time, region, index and the like in a sentence, and the constructed entity recognition model is used for recognizing the terms. For example: "what is half a year of electricity consumption in Guangzhou city? The words are segmented to obtain each word, the words are expressed mathematically by using TF-IDF, then the entity recognition model recognizes that Guangzhou is a region, the last half year is a time, and the electricity consumption is an index.
Further, the step S3 specifically includes:
s31, sequence labeling is carried out on text data in the training data, the entity type of the segment to which each word element belongs and the position of the word element in the segment to which the word element belongs are obtained, and labeling data are formed.
Text data in the training data is sequence-tagged, which words are entity names, and which words are not entity names. In this embodiment, a BIO (Begin, side) labeling manner is adopted, each word element in the text data is labeled as "B-X", "I-X" or "O", wherein "B-X" indicates that a segment where the word element is located belongs to an X type and the word element is at the beginning of the segment, "I-X" indicates that the segment where the word element is located belongs to an X type and the word element is at the middle position of the segment, and "O" indicates that the word element is not of any type; and "X" represents the name of the entity type to be identified, such as time entity "TIM", area entity "DIS", dimension entity "DIM", etc. Taking the regional entity as an example, "B-DIS" indicates the start of the regional entity, and "I-DIS" indicates the middle of the regional entity. For example: "what is the power usage of Guangzhou 6 months? "the result after sequence labeling is:
·6->B-TIM
moon- > I-TIM
Parts- > O
Broad- > B-DIS
State- > I-DIS
- > O
Use- > B-IDX
Electric- > I-IDX
Quantity- > I-IDX
Is- > O
Multi- > O
Less- > O
S32, model training
The object of naming a natural language problem for the recognition of NER is to extract a text segment of a specific required entity from the text data, which is actually a sequence labeling problem from a model perspective. For each element of the input sequence, a particular tag is output. In the machine learning based approach, a conditional random field (CRF, conditional Random Field) is the dominant model of named entity recognition NER whose objective function not only considers the state characteristics of the input, but also includes the label transfer characteristics. The advantage of conditional random fields is that they can utilize rich internal and contextual characteristic information in labeling a location. The distributed representation of the words in the neural network model maps the token from the sparse one-hot representation to the dense Embedding representation in the low-dimensional space, enriches the word representation, inputs the Embedding sequence of sentences into the cyclic neural network RNN, automatically extracts the features by the neural network, and predicts the label of each token by Softmax without depending on complex feature engineering. The disadvantage is that the labelling process for each token is independent and the labels predicted above cannot be directly used, resulting in the possibility that the predicted label sequence may be invalid.
The invention combines the advantages of the two models, combines the neural network model and the conditional random field model to form the LSTM-CRF model, and can well solve the NER named entity identification problem as shown in figure 2. LSTM, long Short Term Memory (long term memory network) networks, are a special type of RNN that can learn long-range dependency information. Unlike a common RNN unit, which has only one tanh layer, LSTM has three gate structures (input gate, forget gate and output gate), optionally forgets part of the history information, adds part of the current input information, and finally integrates to the current state and generates the output state. The bilisTM-CRF model applied to NER mainly comprises an Embedding layer, a bidirectional LSTM layer and a final CRF layer, and is the most mainstream model in the NER method based on deep learning at present.
And training the data marked by the sequences by using a BiLSTM-CRF model as training data, and performing parameter optimization to obtain entity category information by using the data marked by the sequences for identifying the type of the newly input natural language problem.
S4, intention recognition
An intention recognition model is constructed, and the purpose that a user presents natural language questions for data analysis is judged, such as source data viewing, data filtering, multidimensional analysis, funnel analysis, comparison analysis, trend analysis, report analysis, correlation analysis and the like.
Further, the step S4 specifically includes:
s41, data annotation
The purpose of intention recognition is to judge whether the intention of inputting a natural language problem for data analysis is to inquire data or trend analysis or other analysis intention, and the essence of the intention is a text classification problem; thus, an intention recognition model, i.e., a text classification model, is trained. Firstly, training data needs to be marked, and the intention type of each natural language problem needs to be marked. For example, the intention types have 7 types in total: source data viewing, data filtering, multidimensional analysis, funnel analysis, comparative analysis, trend analysis, report analysis, correlation analysis, can be simply labeled with the numbers 0,1,2,3,4,5, 6.
S42, model training
The essence of the intention recognition is text classification, after preprocessing the input text, extracting numerical characteristics of words after TF-IDF processing, training a classification model by using a support vector machine (SVM, support Vector Machine), and constructing the classification model into the intention recognition model. After training and optimizing, the intention recognition model can recognize intention of text data corresponding to the new input natural language problem, probability prediction is carried out on each intention type, and the intention type with the highest probability is selected as the intention type of the input natural language problem.
S5, generating analysis nodes
And combining the results of the feature extraction, the entity category identification and the intention identification in the steps S2-S4, the data source, the analysis dimension, the analysis index and the analysis task which are required to be analyzed in the natural language problem input by the user and other additional data analysis information such as time information, regional information and the like which may be contained can be obtained. Combining the above information allows the system to automatically generate analysis nodes.
Further, the step S5 specifically includes:
s51, formulating an analysis node task data interface: a data interface for formulating a standard for each analysis node task, for example, trend analysis node task input data includes: data source name, analysis index, time range and screening condition; the distribution analysis node task input data includes: data source name, analysis index, analysis dimension, screening condition. Similarly, each analysis node task has corresponding input data according to the characteristics of the analysis node task. Wherein part of the input data is mandatory and part is optional. As the filtering conditions are optional among the two tasks described above, other input data is necessary.
S52, generating data interface information: based on entity category information obtained in the entity identification process, combining with the existing metadata information in the system, matching indexes to obtain data source information, index information, dimension information, time information and region information; determining an analysis node task based on the analysis intent obtained by the intent recognition process; the information is processed and then transferred to a corresponding analysis node task, and the analysis node task is called to complete the generation and the display of analysis results, as shown in fig. 3. That is, the intent information is analyzed to determine which task nodes in the system are employed (task nodes are all built-in to the system), each task node having a corresponding data interface; the data source information, the dimension information and the index information in the sentences are obtained through the entity identification process, and the data source information, the dimension information and the index information are matched with a data dictionary in the system to determine the data name, the index name and the dimension name; the time information and the region information are subjected to regularized normalization processing and serve as screening conditions of the data. The information is used as input data of the task node, and the system automatically generates the analysis node.
Correspondingly, the invention also provides a questioning type analysis node generation system, which comprises:
the preprocessing module is used for realizing the step S1, preprocessing and word segmentation processing are carried out on the input natural language problem, and words after the word segmentation processing are obtained;
the feature extraction module is used for realizing the step S2, carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form;
the information extraction module is used for realizing the step S3, extracting the key information in the input natural language problem, and carrying out type identification on the key information to obtain entity category information;
the intention recognition module is used for realizing the step S4, constructing an intention recognition model, judging the analysis intention of the input natural language problem and finishing the intention recognition;
and the analysis node generation module is used for realizing the step S5, combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes.
Based on the same inventive concept, the present invention also proposes a storage medium having stored thereon computer instructions which, when executed by a processor, implement steps S1-S5 of the inventive analysis node generation method.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (5)

1. The questioning type analysis node generation method is characterized by comprising the following steps:
s1, preprocessing an input natural language problem, and performing word segmentation processing to obtain words after word segmentation processing;
s2, carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem, and converting the text data into a digital form;
s3, extracting key information in the input natural language problem, and performing type identification on the key information to obtain entity category information;
s4, constructing an intention recognition model, judging the analysis intention of the input natural language problem, and completing intention recognition;
s5, combining the results of feature extraction, type recognition and intention recognition in the steps S2-S4 to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes;
the step S3 comprises the following steps:
s31, sequence labeling is carried out on text data in the training data, so that the entity type of a segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data are obtained, and labeling data are formed;
the text data in the training data is marked in sequence by adopting a BIO marking mode, which words are entity names and which words are not entity names, each word element in the text data is marked as 'B-X', 'I-X' or 'O', wherein 'B-X' represents that a fragment where the word element is located belongs to an X type and the word element is at the beginning of the fragment, 'I-X' represents that the fragment where the word element is located belongs to an X type and the word element is at the middle position of the fragment, 'O' represents that the word element does not belong to any type, 'X' represents the entity type name to be identified, a time entity is 'TIM', a regional entity is 'DIS', and a dimension entity is 'DIM'; s32, training the data after sequence labeling by using a BiLSTM-CRF model as training data, and using the model obtained after parameter optimization for type recognition of a new input natural language problem;
the step S4 includes:
s41, firstly, marking training data, and marking the intention type of each natural language problem;
s42, training the classification model, constructing an intention recognition model, carrying out intention recognition on text data corresponding to the input natural language problem by using the intention recognition model, carrying out probability prediction on each intention type, and selecting the intention type with the highest probability as the intention type of the input natural language problem;
the step S5 comprises the following steps:
s51, formulating an analysis node task data interface, and formulating a standard data interface for each analysis node task;
s52, generating data interface information, and obtaining data source information, index information, dimension information and other additional data analysis information by combining metadata information and matching indexes based on entity category information; determining an analysis node task based on the analysis intent; after the data source information, the index information, the dimension information and other additional data analysis information are processed, the processed data source information, the dimension information and the additional data analysis information are transmitted to the corresponding analysis node task, and meanwhile the analysis node task is called to complete the generation and the display of analysis results.
2. The method of claim 1, wherein the additional data analysis information includes time information and region information.
3. The analysis node generation method according to claim 1, wherein in step S51, the trend analysis node task input data includes a data source name, an analysis index, a time range, and a screening condition; the distribution analysis node task input data comprises a data source name, an analysis index, an analysis dimension and screening conditions.
4. A question-based analysis node generation system, comprising:
the preprocessing module is used for preprocessing and word segmentation processing of the input natural language problem to obtain words after word segmentation processing;
the feature extraction module is used for carrying out feature representation and feature extraction on the text data corresponding to the preprocessed input natural language problem and converting the text data into a digital form;
the information extraction module is used for extracting key information in the input natural language problem, and carrying out type identification on the key information to obtain entity category information;
the intention recognition module is used for constructing an intention recognition model, judging the analysis intention of the input natural language problem and completing intention recognition; the analysis node generation module is used for combining the processing results of the feature extraction module, the information extraction module and the intention recognition module to obtain data sources, analysis dimensions, analysis indexes, analysis tasks and other additional data analysis information required to be analyzed in the natural language problem, and automatically generating analysis nodes;
the process of the information extraction module for carrying out type identification on the key information comprises the following steps:
sequence labeling is carried out on text data in the training data, so that the entity type of a segment to which each word element belongs and the position of the word element in the segment to which the word element belongs in the text data are obtained, and labeling data are formed; training the data with the sequence marked as training data by using a BiLSTM-CRF model, and using the model obtained after parameter optimization for type recognition of the new input natural language problem; the text data in the training data is marked in sequence by adopting a BIO marking mode, which words are entity names and which words are not entity names, each word element in the text data is marked as 'B-X', 'I-X' or 'O', wherein 'B-X' represents that a fragment where the word element is located belongs to an X type and the word element is at the beginning of the fragment, 'I-X' represents that the fragment where the word element is located belongs to an X type and the word element is at the middle position of the fragment, 'O' represents that the word element does not belong to any type, 'X' represents the entity type name to be identified, a time entity is 'TIM', a regional entity is 'DIS', and a dimension entity is 'DIM';
the process of intention recognition by the intention recognition module comprises the following steps: firstly, marking training data, and marking the intention type of each natural language problem; training and constructing a classification model into an intention recognition model, carrying out intention recognition on text data corresponding to the input natural language problem by using the intention recognition model, carrying out probability prediction on each intention type, and selecting the intention type with the highest probability as the intention type of the input natural language problem;
the process of generating the analysis node by the analysis node generation module comprises the following steps: formulating an analysis node task data interface, and formulating a standard data interface for each analysis node task; generating data interface information, combining metadata information based on entity category information, and matching indexes to obtain data source information, index information, dimension information and other additional data analysis information; determining an analysis node task based on the analysis intent; after the data source information, the index information, the dimension information and other additional data analysis information are processed, the processed data source information, the dimension information and the additional data analysis information are transmitted to the corresponding analysis node task, and meanwhile the analysis node task is called to complete the generation and the display of analysis results.
5. A storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the analysis node generation method of any of claims 1-3.
CN202011259004.3A 2020-11-12 2020-11-12 Question type analysis node generation method, system and storage medium Active CN112270189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011259004.3A CN112270189B (en) 2020-11-12 2020-11-12 Question type analysis node generation method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011259004.3A CN112270189B (en) 2020-11-12 2020-11-12 Question type analysis node generation method, system and storage medium

Publications (2)

Publication Number Publication Date
CN112270189A CN112270189A (en) 2021-01-26
CN112270189B true CN112270189B (en) 2023-07-18

Family

ID=74339857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011259004.3A Active CN112270189B (en) 2020-11-12 2020-11-12 Question type analysis node generation method, system and storage medium

Country Status (1)

Country Link
CN (1) CN112270189B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528404A (en) * 2022-02-18 2022-05-24 浪潮卓数大数据产业发展有限公司 Method and device for identifying provincial and urban areas

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050032937A (en) * 2003-10-02 2005-04-08 한국전자통신연구원 Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system
WO2019153522A1 (en) * 2018-02-09 2019-08-15 卫盈联信息技术(深圳)有限公司 Intelligent interaction method, electronic device, and storage medium
CN111709235A (en) * 2020-05-28 2020-09-25 上海发电设备成套设计研究院有限责任公司 Text data statistical analysis system and method based on natural language processing

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226606B (en) * 2013-04-28 2016-08-10 浙江核新同花顺网络信息股份有限公司 Inquiry choosing method and system
CN107545349A (en) * 2016-06-28 2018-01-05 国网天津市电力公司 A kind of Data Quality Analysis evaluation model towards electric power big data
CN108108426B (en) * 2017-12-15 2021-05-07 杭州汇数智通科技有限公司 Understanding method and device for natural language question and electronic equipment
CN110309400A (en) * 2018-02-07 2019-10-08 鼎复数据科技(北京)有限公司 A kind of method and system that intelligent Understanding user query are intended to
CN110968663B (en) * 2018-09-30 2023-05-23 北京国双科技有限公司 Answer display method and device of question-answering system
CN110210036A (en) * 2019-06-05 2019-09-06 上海云绅智能科技有限公司 A kind of intension recognizing method and device
CN110413746B (en) * 2019-06-25 2024-02-09 创新先进技术有限公司 Method and device for identifying intention of user problem
CN110334347A (en) * 2019-06-27 2019-10-15 腾讯科技(深圳)有限公司 Information processing method, relevant device and storage medium based on natural language recognition
CN111026941A (en) * 2019-10-28 2020-04-17 江苏普旭软件信息技术有限公司 Intelligent query method for demonstration and evaluation of equipment system
CN111125145A (en) * 2019-11-26 2020-05-08 复旦大学 Automatic system for acquiring database information through natural language

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050032937A (en) * 2003-10-02 2005-04-08 한국전자통신연구원 Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system
WO2019153522A1 (en) * 2018-02-09 2019-08-15 卫盈联信息技术(深圳)有限公司 Intelligent interaction method, electronic device, and storage medium
CN111709235A (en) * 2020-05-28 2020-09-25 上海发电设备成套设计研究院有限责任公司 Text data statistical analysis system and method based on natural language processing

Also Published As

Publication number Publication date
CN112270189A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
Shelar et al. Named entity recognition approaches and their comparison for custom ner model
Jung Semantic vector learning for natural language understanding
CN113011533A (en) Text classification method and device, computer equipment and storage medium
Yi et al. Topic modeling for short texts via word embedding and document correlation
Tagarelli et al. Unsupervised law article mining based on deep pre-trained language representation models with application to the Italian civil code
CN112270188B (en) Questioning type analysis path recommendation method, system and storage medium
CN111737560B (en) Content search method, field prediction model training method, device and storage medium
CN111344695B (en) Facilitating domain and client specific application program interface recommendations
Ali et al. Named entity recognition using deep learning: A review
CN114997288A (en) Design resource association method
Sahnoun et al. Event detection based on open information extraction and ontology
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN114491079A (en) Knowledge graph construction and query method, device, equipment and medium
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN112270189B (en) Question type analysis node generation method, system and storage medium
Thielmann et al. Coherence based document clustering
CN116804998A (en) Medical term retrieval method and system based on medical semantic understanding
Zhang et al. Combining the attention network and semantic representation for Chinese verb metaphor identification
CN116975271A (en) Text relevance determining method, device, computer equipment and storage medium
Girija et al. A comparative review on approaches of aspect level sentiment analysis
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
Chen et al. Multi-modal multi-layered topic classification model for social event analysis
Kuttiyapillai et al. Improved text analysis approach for predicting effects of nutrient on human health using machine learning techniques
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
Hao Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant