CN116975738A - Polynomial naive Bayesian classification method for question intent recognition - Google Patents
Polynomial naive Bayesian classification method for question intent recognition Download PDFInfo
- Publication number
- CN116975738A CN116975738A CN202310969472.7A CN202310969472A CN116975738A CN 116975738 A CN116975738 A CN 116975738A CN 202310969472 A CN202310969472 A CN 202310969472A CN 116975738 A CN116975738 A CN 116975738A
- Authority
- CN
- China
- Prior art keywords
- question
- training
- word
- polynomial
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 230000011218 segmentation Effects 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 23
- 238000012360 testing method Methods 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007635 classification algorithm Methods 0.000 abstract description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a polynomial naive Bayesian classification method for question intention recognition, which belongs to the technical field of intelligent question answering, and comprises the following specific steps: (1) Collecting the definition problem types of the related databases and acquiring training questions; (2) Storing each training question and extracting corresponding characteristic information of the training questions; (3) Training a polynomial naive Bayes classifier according to the characteristic information; (4) Performing intention recognition through a polynomial naive Bayes classifier; the method can effectively improve the accuracy of the naive Bayes classification algorithm in the question classification process and improve the quality of intention recognition.
Description
Technical Field
The invention relates to the technical field of intelligent question and answer, in particular to a polynomial naive Bayesian classification method for identifying a question intention.
Background
The intention recognition is to perform natural language understanding on the question so as to extract the specific intention of the question, is a key step from question analysis to intelligent question answering, and is the basis of a task type question answering system. Intent recognition has found widespread use in intelligent question-answering systems in areas such as education, medicine, business, and management. The naive Bayes classification method can well classify questions with different keywords, so that the naive Bayes classification method is widely applied to intention recognition; therefore, it is important to develop a polynomial naive Bayesian classification method for question intent recognition.
The existing polynomial naive Bayes classification method has low accuracy in the question classification process and poor intention recognition quality; therefore, we propose a polynomial naive Bayesian classification method for question intent recognition.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a polynomial naive Bayesian classification method for identifying the intention of a question.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a polynomial naive Bayesian classification method for question intention recognition comprises the following specific steps:
(1) Collecting the definition problem types of the related databases and acquiring training questions;
(2) Storing each training question and extracting corresponding characteristic information of the training questions;
(3) Training a polynomial naive Bayes classifier according to the characteristic information;
(4) Intent recognition is performed by a polynomial naive bayes classifier.
As a further scheme of the invention, the specific acquisition steps of the training question in the step (1) are as follows:
step one: collecting databases containing related problems of a vertical domain, detecting whether each training data in each database contains a corresponding question type label or not, marking the training data without the question type label, and balancing the sample number of each category by adopting oversampling or undersampling;
step two: checking whether repeated data records exist in a database, deleting the repeated data if the repeated data exist, deleting the training data with some data field lacking values, deleting the records where the missing values exist, filling the missing values by using a mean value or a median value or processing the missing values by using interpolation;
step three: removing unnecessary special characters, punctuation marks and HTML labels in each group of training data, carrying out standardized processing on each group of processed training data in a unified format to obtain each group of training questions, and integrating and summarizing each group of training questions into a training data set.
As a further aspect of the present invention, the specific step of extracting the feature information in the step (2) is as follows:
step (1): the method comprises the steps of replacing entities in each group of training questions in a training data set with corresponding types in a database, performing question segmentation on each group of training questions through a jieba word segmentation library, segmenting each group of training questions into words or phrases, extracting stems, and restoring the words into original stems;
step (2): initializing a two-dimensional array data, storing each word segmentation character string of the i-th training question into data [ i ] [1] to data [ i ] [ j ], and then using data [ i ] [ j+1] to store question type labels corresponding to the i-th training question, wherein j is the word segmentation number of the i-th training question;
step (3): calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the first n-sized word in order from large to small, and storing the corresponding word segmentation into the array words;
step (4): and generating polynomial key word feature vectors of all training questions according to the array word, if the questions contain key words in the word, marking the corresponding positions of the polynomial key word feature vectors as 1, otherwise marking the corresponding positions as 0, and simultaneously storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
As a further aspect of the present invention, the term frequency-inverse document frequency specific calculation formula in step (3) is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
As a further aspect of the present invention, the polynomial na iotave bayesian classifier specific training step in step (3) is as follows:
step I: the method comprises the steps of inputting a polynomial keyword feature vector and a corresponding question type label into a polynomial naive Bayes classifier as training key value pairs, and dividing each type of received question into a training set and a testing set by the classifier;
step II: calculating prior probability of each type of question in the training set and conditional probability of each word under each type, then evaluating the trained polynomial naive Bayes classifier by using feature vectors of the testing set and corresponding question type labels, and calculating classification accuracy, recall rate and F1 value performance indexes;
step III: and after each performance index meets a preset threshold, applying the performance index to a new unknown user question, and if each performance index does not meet the preset threshold, smoothly adjusting parameters of the TF-IDF through Laplace to optimize the performance of the classifier.
As a further aspect of the present invention, the polynomial na iotave bayes classifier in step (4) intends to identify the specific steps as follows:
the first step: converting a question to be classified into a numerical feature vector for representation by a word frequency-inverse document frequency method, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the previous n-sized word frequency according to the order from large to small, and storing the corresponding word segmentation into the array words;
and a second step of: and calculating the probability of each group of question type labels according to the prior probability, calculating the posterior probability of the question feature vector to be classified under each group of question type labels, finding out the question type label with the maximum posterior probability as the label of the predicted question, and outputting the label.
Compared with the prior art, the invention has the beneficial effects that:
the polynomial naive Bayes classification method for question intention recognition replaces the entity in each group of training questions in the training data set with the corresponding type in the database, then carries out question segmentation on each group of training questions through the jieba word segmentation library, segments each group of training questions into words or phrases, simultaneously carries out stem extraction, restores the words into the original stem form, initializes a two-dimensional array data, stores each word segmentation character string and question type label of the i-th group of training questions into the two-dimensional array data, calculates the word frequency-inverse document frequency of each word segmentation of each training question, initializing a group of array words with the size of n, storing word frequency-inverse document frequency of each word, generating polynomial keyword feature vectors of each training question according to the array words, inputting the polynomial keyword feature vectors and corresponding question type labels as training key values into a polynomial naive Bayes classifier for training, analyzing user questions through the trained polynomial naive Bayes classifier, outputting predicted question labels, and effectively improving accuracy of a naive Bayes classification algorithm in a question classification process and improving quality of intention recognition.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
Fig. 1 is a flow chart of a polynomial naive bayes classification method for question intention recognition.
Detailed Description
Referring to fig. 1, a polynomial naive bayes classification method for question intention recognition includes the following specific steps:
and collecting the definition problem types of the related databases and acquiring training questions.
Specifically, a database containing related problems of a vertical domain is collected, meanwhile, whether each training data in each database contains a corresponding question type label is detected, the training data without the question type label is marked, the sample number of each class is balanced by adopting oversampling or undersampling, whether repeated data records exist in the database is checked, if the repeated data exist, the repeated data are deleted, then the training data with missing values of certain data fields are deleted, the record with missing values is deleted, the missing values are filled by using a mean value or a median value or the missing values are processed by using interpolation, unnecessary special characters, punctuation marks and HTML labels in each group of training data are removed, and then the processed training data are subjected to standardized processing in a unified format to obtain each group of training questions, and each group of training questions are integrated and generalized into a training data set.
And storing each training question and extracting the corresponding characteristic information of the training questions.
Specifically, the entity in each group of training questions in the training data set is replaced by the corresponding type in the database, then each group of training questions is subjected to question segmentation through a jieba word segmentation library, each group of training questions is segmented into words or phrases, at the same time, word stem extraction is carried out, the words are restored to the original word stem form, a two-dimensional array data is initialized, each word segmentation character string of the i group of training questions is stored in data [ i ] [1] to data [ i ] [ j ], then the data [ i ] [ j+1] is used for storing question type labels corresponding to the i group of training questions, j is the segmentation number of the i group of training questions, calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of array words with n, sequencing the word frequency-inverse document frequency of the first n according to the sequence from big to small, storing the corresponding word segmentation into the array word, generating polynomial key word feature vectors of each training question according to the array word, marking the corresponding position of the polynomial key word feature vectors as 1 if the keywords in the questions are contained, otherwise marking the keyword as 0, and storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
It should be further noted that, the specific calculation formula of the word frequency-inverse document frequency is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
And training a polynomial naive Bayes classifier according to the characteristic information.
Specifically, a polynomial keyword feature vector and a corresponding question type label are used as training key value pairs to be input into a polynomial naive Bayesian classifier, then the classifier divides various types of received questions into a training set and a test set, the prior probability of each type of question and the conditional probability of each word under each type in the training set are calculated, then the trained polynomial naive Bayesian classifier is evaluated by using the feature vector of the test set and the corresponding question type label, classification accuracy, recall rate and F1 value performance indexes are calculated, after each performance index meets a preset threshold, the performance index is applied to a new unknown user question, and if each performance index does not meet the preset threshold, the performance of the classifier is optimized by smoothly adjusting parameters of TF-IDF through Laplacian.
Intent recognition is performed by a polynomial naive bayes classifier.
Specifically, the question to be classified is converted into a numerical feature vector to be expressed by a word frequency-inverse document frequency method, a group of n-sized array words are initialized, the word frequency-inverse document frequency with the first n being large is ordered according to the order from large to small, the corresponding word segmentation is stored in the array words, the probability of each group of question type labels is calculated according to the prior probability, the posterior probability of the question feature vector to be classified under each group of question type labels is calculated, and the question type label with the largest posterior probability is found to be used as the label for predicting the question.
Claims (6)
1. A polynomial naive Bayesian classification method for question intention recognition is characterized by comprising the following specific steps:
(1) Collecting the definition problem types of the related databases and acquiring training questions;
(2) Storing each training question and extracting corresponding characteristic information of the training questions;
(3) Training a polynomial naive Bayes classifier according to the characteristic information;
(4) Intent recognition is performed by a polynomial naive bayes classifier.
2. The polynomial naive bayes classification method for question intention recognition according to claim 1, wherein the training question specific obtaining step in the step (1) is as follows:
step one: collecting databases containing related problems of a vertical domain, detecting whether each training data in each database contains a corresponding question type label or not, marking the training data without the question type label, and balancing the sample number of each category by adopting oversampling or undersampling;
step two: checking whether repeated data records exist in a database, deleting the repeated data if the repeated data exist, deleting the training data with some data field lacking values, deleting the records where the missing values exist, filling the missing values by using a mean value or a median value or processing the missing values by using interpolation;
step three: removing unnecessary special characters, punctuation marks and HTML labels in each group of training data, carrying out standardized processing on each group of processed training data in a unified format to obtain each group of training questions, and integrating and summarizing each group of training questions into a training data set.
3. The polynomial naive bayes classification method for question intention recognition according to claim 2, wherein the specific steps of feature information extraction in the step (2) are as follows:
step (1): the method comprises the steps of replacing entities in each group of training questions in a training data set with corresponding types in a database, performing question segmentation on each group of training questions through a jieba word segmentation library, segmenting each group of training questions into words or phrases, extracting stems, and restoring the words into original stems;
step (2): initializing a two-dimensional array data, storing each word segmentation character string of the i-th training question into data [ i ] [1] to data [ i ] [ j ], and then using data [ i ] [ j+1] to store question type labels corresponding to the i-th training question, wherein j is the word segmentation number of the i-th training question;
step (3): calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the first n-sized word in order from large to small, and storing the corresponding word segmentation into the array words;
step (4): and generating polynomial key word feature vectors of all training questions according to the array word, if the questions contain key words in the word, marking the corresponding positions of the polynomial key word feature vectors as 1, otherwise marking the corresponding positions as 0, and simultaneously storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
4. The polynomial naive bayes classification method for question intent recognition according to claim 3, wherein the term frequency-inverse document frequency concrete calculation formula in the step (3) is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
5. A polynomial naive bayes classification method for question-intent recognition as claimed in claim 3 wherein said polynomial naive bayes classifier in step (3) is specifically trained as follows:
step I: the method comprises the steps of inputting a polynomial keyword feature vector and a corresponding question type label into a polynomial naive Bayes classifier as training key value pairs, and dividing each type of received question into a training set and a testing set by the classifier;
step II: calculating prior probability of each type of question in the training set and conditional probability of each word under each type, then evaluating the trained polynomial naive Bayes classifier by using feature vectors of the testing set and corresponding question type labels, and calculating classification accuracy, recall rate and F1 value performance indexes;
step III: and after each performance index meets a preset threshold, applying the performance index to a new unknown user question, and if each performance index does not meet the preset threshold, smoothly adjusting parameters of the TF-IDF through Laplace to optimize the performance of the classifier.
6. The method for polynomial naive bayes classification for question-intent recognition of claim 5, wherein the polynomial naive bayes classifier intent recognition in step (4) comprises the following specific steps:
the first step: converting a question to be classified into a numerical feature vector for representation by a word frequency-inverse document frequency method, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the previous n-sized word frequency according to the order from large to small, and storing the corresponding word segmentation into the array words;
and a second step of: and calculating the probability of each group of question type labels according to the prior probability, calculating the posterior probability of the question feature vector to be classified under each group of question type labels, finding out the question type label with the maximum posterior probability as the label of the predicted question, and outputting the label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310969472.7A CN116975738A (en) | 2023-08-03 | 2023-08-03 | Polynomial naive Bayesian classification method for question intent recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310969472.7A CN116975738A (en) | 2023-08-03 | 2023-08-03 | Polynomial naive Bayesian classification method for question intent recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116975738A true CN116975738A (en) | 2023-10-31 |
Family
ID=88472891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310969472.7A Pending CN116975738A (en) | 2023-08-03 | 2023-08-03 | Polynomial naive Bayesian classification method for question intent recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116975738A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473511A (en) * | 2023-12-27 | 2024-01-30 | 中国联合网络通信集团有限公司 | Edge node vulnerability data processing method, device, equipment and storage medium |
-
2023
- 2023-08-03 CN CN202310969472.7A patent/CN116975738A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473511A (en) * | 2023-12-27 | 2024-01-30 | 中国联合网络通信集团有限公司 | Edge node vulnerability data processing method, device, equipment and storage medium |
CN117473511B (en) * | 2023-12-27 | 2024-04-02 | 中国联合网络通信集团有限公司 | Edge node vulnerability data processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN109492230B (en) | Method for extracting insurance contract key information based on interested text field convolutional neural network | |
US20200334410A1 (en) | Encoding textual information for text analysis | |
CN111324742A (en) | Construction method of digital human knowledge map | |
CN110910991B (en) | Medical automatic image processing system | |
CN116975738A (en) | Polynomial naive Bayesian classification method for question intent recognition | |
CN112181490B (en) | Method, device, equipment and medium for identifying function category in function point evaluation method | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN112632993A (en) | Electric power measurement entity recognition model classification method based on convolution attention network | |
CN114090736A (en) | Enterprise industry identification system and method based on text similarity | |
CN115858785A (en) | Sensitive data identification method and system based on big data | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN115953123A (en) | Method, device and equipment for generating robot automation flow and storage medium | |
Jui et al. | A machine learning-based segmentation approach for measuring similarity between sign languages | |
CN112732863B (en) | Standardized segmentation method for electronic medical records | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN113536015A (en) | Cross-modal retrieval method based on depth identification migration | |
CN117609583A (en) | Customs import and export commodity classification method based on image text combination retrieval | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN111767402B (en) | Limited domain event detection method based on counterstudy | |
CN111460160B (en) | Event clustering method of stream text data based on reinforcement learning | |
CN114756617A (en) | Method, system, equipment and storage medium for extracting structured data of engineering archives | |
CN111191448A (en) | Word processing method, device, storage medium and processor | |
CN113821571A (en) | Food safety relation extraction method based on BERT and improved PCNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231206 Address after: Room 303, 3rd Floor, Unit 1, Building 16, Luojia Yayuan (Phase I), No. 369 Shucheng Road, Hongshan District, Wuhan City, Hubei Province, 430070 Applicant after: Wuhan Tuoyun Technology Co.,Ltd. Address before: Room 403, Building B6, Innovation Port, No. 15 Jinyang Road, Huaqiao Town, Kunshan City, Suzhou City, Jiangsu Province, 215300 Applicant before: Suzhou Jiayang Technology Co.,Ltd. |