CN116975738A

CN116975738A - Polynomial naive Bayesian classification method for question intent recognition

Info

Publication number: CN116975738A
Application number: CN202310969472.7A
Authority: CN
Inventors: 王锐; 陈政宇
Original assignee: Suzhou Jiayang Technology Co ltd
Current assignee: Wuhan Tuoyun Technology Co ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-10-31

Abstract

The invention discloses a polynomial naive Bayesian classification method for question intention recognition, which belongs to the technical field of intelligent question answering, and comprises the following specific steps: (1) Collecting the definition problem types of the related databases and acquiring training questions; (2) Storing each training question and extracting corresponding characteristic information of the training questions; (3) Training a polynomial naive Bayes classifier according to the characteristic information; (4) Performing intention recognition through a polynomial naive Bayes classifier; the method can effectively improve the accuracy of the naive Bayes classification algorithm in the question classification process and improve the quality of intention recognition.

Description

Polynomial naive Bayesian classification method for question intent recognition

Technical Field

The invention relates to the technical field of intelligent question and answer, in particular to a polynomial naive Bayesian classification method for identifying a question intention.

Background

The intention recognition is to perform natural language understanding on the question so as to extract the specific intention of the question, is a key step from question analysis to intelligent question answering, and is the basis of a task type question answering system. Intent recognition has found widespread use in intelligent question-answering systems in areas such as education, medicine, business, and management. The naive Bayes classification method can well classify questions with different keywords, so that the naive Bayes classification method is widely applied to intention recognition; therefore, it is important to develop a polynomial naive Bayesian classification method for question intent recognition.

The existing polynomial naive Bayes classification method has low accuracy in the question classification process and poor intention recognition quality; therefore, we propose a polynomial naive Bayesian classification method for question intent recognition.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a polynomial naive Bayesian classification method for identifying the intention of a question.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a polynomial naive Bayesian classification method for question intention recognition comprises the following specific steps:

(1) Collecting the definition problem types of the related databases and acquiring training questions;

(2) Storing each training question and extracting corresponding characteristic information of the training questions;

(3) Training a polynomial naive Bayes classifier according to the characteristic information;

(4) Intent recognition is performed by a polynomial naive bayes classifier.

As a further scheme of the invention, the specific acquisition steps of the training question in the step (1) are as follows:

step one: collecting databases containing related problems of a vertical domain, detecting whether each training data in each database contains a corresponding question type label or not, marking the training data without the question type label, and balancing the sample number of each category by adopting oversampling or undersampling;

step two: checking whether repeated data records exist in a database, deleting the repeated data if the repeated data exist, deleting the training data with some data field lacking values, deleting the records where the missing values exist, filling the missing values by using a mean value or a median value or processing the missing values by using interpolation;

step three: removing unnecessary special characters, punctuation marks and HTML labels in each group of training data, carrying out standardized processing on each group of processed training data in a unified format to obtain each group of training questions, and integrating and summarizing each group of training questions into a training data set.

As a further aspect of the present invention, the specific step of extracting the feature information in the step (2) is as follows:

step (1): the method comprises the steps of replacing entities in each group of training questions in a training data set with corresponding types in a database, performing question segmentation on each group of training questions through a jieba word segmentation library, segmenting each group of training questions into words or phrases, extracting stems, and restoring the words into original stems;

step (2): initializing a two-dimensional array data, storing each word segmentation character string of the i-th training question into data [ i ] [1] to data [ i ] [ j ], and then using data [ i ] [ j+1] to store question type labels corresponding to the i-th training question, wherein j is the word segmentation number of the i-th training question;

step (3): calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the first n-sized word in order from large to small, and storing the corresponding word segmentation into the array words;

step (4): and generating polynomial key word feature vectors of all training questions according to the array word, if the questions contain key words in the word, marking the corresponding positions of the polynomial key word feature vectors as 1, otherwise marking the corresponding positions as 0, and simultaneously storing the polynomial key word feature vectors into data [ i ] [ k+2 ].

As a further aspect of the present invention, the term frequency-inverse document frequency specific calculation formula in step (3) is as follows:

TF-IDF _ij ＝tf _ij *idf _ij

(3)

wherein n is _ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D _i Representing the set of individual segmentations of the ith set of training questions.

As a further aspect of the present invention, the polynomial na iotave bayesian classifier specific training step in step (3) is as follows:

step I: the method comprises the steps of inputting a polynomial keyword feature vector and a corresponding question type label into a polynomial naive Bayes classifier as training key value pairs, and dividing each type of received question into a training set and a testing set by the classifier;

step II: calculating prior probability of each type of question in the training set and conditional probability of each word under each type, then evaluating the trained polynomial naive Bayes classifier by using feature vectors of the testing set and corresponding question type labels, and calculating classification accuracy, recall rate and F1 value performance indexes;

step III: and after each performance index meets a preset threshold, applying the performance index to a new unknown user question, and if each performance index does not meet the preset threshold, smoothly adjusting parameters of the TF-IDF through Laplace to optimize the performance of the classifier.

As a further aspect of the present invention, the polynomial na iotave bayes classifier in step (4) intends to identify the specific steps as follows:

the first step: converting a question to be classified into a numerical feature vector for representation by a word frequency-inverse document frequency method, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the previous n-sized word frequency according to the order from large to small, and storing the corresponding word segmentation into the array words;

and a second step of: and calculating the probability of each group of question type labels according to the prior probability, calculating the posterior probability of the question feature vector to be classified under each group of question type labels, finding out the question type label with the maximum posterior probability as the label of the predicted question, and outputting the label.

Compared with the prior art, the invention has the beneficial effects that:

the polynomial naive Bayes classification method for question intention recognition replaces the entity in each group of training questions in the training data set with the corresponding type in the database, then carries out question segmentation on each group of training questions through the jieba word segmentation library, segments each group of training questions into words or phrases, simultaneously carries out stem extraction, restores the words into the original stem form, initializes a two-dimensional array data, stores each word segmentation character string and question type label of the i-th group of training questions into the two-dimensional array data, calculates the word frequency-inverse document frequency of each word segmentation of each training question, initializing a group of array words with the size of n, storing word frequency-inverse document frequency of each word, generating polynomial keyword feature vectors of each training question according to the array words, inputting the polynomial keyword feature vectors and corresponding question type labels as training key values into a polynomial naive Bayes classifier for training, analyzing user questions through the trained polynomial naive Bayes classifier, outputting predicted question labels, and effectively improving accuracy of a naive Bayes classification algorithm in a question classification process and improving quality of intention recognition.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

Fig. 1 is a flow chart of a polynomial naive bayes classification method for question intention recognition.

Detailed Description

Referring to fig. 1, a polynomial naive bayes classification method for question intention recognition includes the following specific steps:

and collecting the definition problem types of the related databases and acquiring training questions.

Specifically, a database containing related problems of a vertical domain is collected, meanwhile, whether each training data in each database contains a corresponding question type label is detected, the training data without the question type label is marked, the sample number of each class is balanced by adopting oversampling or undersampling, whether repeated data records exist in the database is checked, if the repeated data exist, the repeated data are deleted, then the training data with missing values of certain data fields are deleted, the record with missing values is deleted, the missing values are filled by using a mean value or a median value or the missing values are processed by using interpolation, unnecessary special characters, punctuation marks and HTML labels in each group of training data are removed, and then the processed training data are subjected to standardized processing in a unified format to obtain each group of training questions, and each group of training questions are integrated and generalized into a training data set.

And storing each training question and extracting the corresponding characteristic information of the training questions.

Specifically, the entity in each group of training questions in the training data set is replaced by the corresponding type in the database, then each group of training questions is subjected to question segmentation through a jieba word segmentation library, each group of training questions is segmented into words or phrases, at the same time, word stem extraction is carried out, the words are restored to the original word stem form, a two-dimensional array data is initialized, each word segmentation character string of the i group of training questions is stored in data [ i ] [1] to data [ i ] [ j ], then the data [ i ] [ j+1] is used for storing question type labels corresponding to the i group of training questions, j is the segmentation number of the i group of training questions, calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of array words with n, sequencing the word frequency-inverse document frequency of the first n according to the sequence from big to small, storing the corresponding word segmentation into the array word, generating polynomial key word feature vectors of each training question according to the array word, marking the corresponding position of the polynomial key word feature vectors as 1 if the keywords in the questions are contained, otherwise marking the keyword as 0, and storing the polynomial key word feature vectors into data [ i ] [ k+2 ].

It should be further noted that, the specific calculation formula of the word frequency-inverse document frequency is as follows:

TF-IDF _ij ＝tf _ij *idf _ij

(3)

And training a polynomial naive Bayes classifier according to the characteristic information.

Specifically, a polynomial keyword feature vector and a corresponding question type label are used as training key value pairs to be input into a polynomial naive Bayesian classifier, then the classifier divides various types of received questions into a training set and a test set, the prior probability of each type of question and the conditional probability of each word under each type in the training set are calculated, then the trained polynomial naive Bayesian classifier is evaluated by using the feature vector of the test set and the corresponding question type label, classification accuracy, recall rate and F1 value performance indexes are calculated, after each performance index meets a preset threshold, the performance index is applied to a new unknown user question, and if each performance index does not meet the preset threshold, the performance of the classifier is optimized by smoothly adjusting parameters of TF-IDF through Laplacian.

Intent recognition is performed by a polynomial naive bayes classifier.

Specifically, the question to be classified is converted into a numerical feature vector to be expressed by a word frequency-inverse document frequency method, a group of n-sized array words are initialized, the word frequency-inverse document frequency with the first n being large is ordered according to the order from large to small, the corresponding word segmentation is stored in the array words, the probability of each group of question type labels is calculated according to the prior probability, the posterior probability of the question feature vector to be classified under each group of question type labels is calculated, and the question type label with the largest posterior probability is found to be used as the label for predicting the question.

Claims

1. A polynomial naive Bayesian classification method for question intention recognition is characterized by comprising the following specific steps:

(4) Intent recognition is performed by a polynomial naive bayes classifier.

2. The polynomial naive bayes classification method for question intention recognition according to claim 1, wherein the training question specific obtaining step in the step (1) is as follows:

3. The polynomial naive bayes classification method for question intention recognition according to claim 2, wherein the specific steps of feature information extraction in the step (2) are as follows:

4. The polynomial naive bayes classification method for question intent recognition according to claim 3, wherein the term frequency-inverse document frequency concrete calculation formula in the step (3) is as follows:

TF-IDF _ij ＝tf _ij *idf _ij

(3)

5. A polynomial naive bayes classification method for question-intent recognition as claimed in claim 3 wherein said polynomial naive bayes classifier in step (3) is specifically trained as follows:

6. The method for polynomial naive bayes classification for question-intent recognition of claim 5, wherein the polynomial naive bayes classifier intent recognition in step (4) comprises the following specific steps: