CN116975738A - Polynomial naive Bayesian classification method for question intent recognition - Google Patents

Polynomial naive Bayesian classification method for question intent recognition Download PDF

Info

Publication number
CN116975738A
CN116975738A CN202310969472.7A CN202310969472A CN116975738A CN 116975738 A CN116975738 A CN 116975738A CN 202310969472 A CN202310969472 A CN 202310969472A CN 116975738 A CN116975738 A CN 116975738A
Authority
CN
China
Prior art keywords
question
training
word
polynomial
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310969472.7A
Other languages
Chinese (zh)
Inventor
王锐
陈政宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Tuoyun Technology Co ltd
Original Assignee
Suzhou Jiayang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Jiayang Technology Co ltd filed Critical Suzhou Jiayang Technology Co ltd
Priority to CN202310969472.7A priority Critical patent/CN116975738A/en
Publication of CN116975738A publication Critical patent/CN116975738A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a polynomial naive Bayesian classification method for question intention recognition, which belongs to the technical field of intelligent question answering, and comprises the following specific steps: (1) Collecting the definition problem types of the related databases and acquiring training questions; (2) Storing each training question and extracting corresponding characteristic information of the training questions; (3) Training a polynomial naive Bayes classifier according to the characteristic information; (4) Performing intention recognition through a polynomial naive Bayes classifier; the method can effectively improve the accuracy of the naive Bayes classification algorithm in the question classification process and improve the quality of intention recognition.

Description

Polynomial naive Bayesian classification method for question intent recognition
Technical Field
The invention relates to the technical field of intelligent question and answer, in particular to a polynomial naive Bayesian classification method for identifying a question intention.
Background
The intention recognition is to perform natural language understanding on the question so as to extract the specific intention of the question, is a key step from question analysis to intelligent question answering, and is the basis of a task type question answering system. Intent recognition has found widespread use in intelligent question-answering systems in areas such as education, medicine, business, and management. The naive Bayes classification method can well classify questions with different keywords, so that the naive Bayes classification method is widely applied to intention recognition; therefore, it is important to develop a polynomial naive Bayesian classification method for question intent recognition.
The existing polynomial naive Bayes classification method has low accuracy in the question classification process and poor intention recognition quality; therefore, we propose a polynomial naive Bayesian classification method for question intent recognition.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a polynomial naive Bayesian classification method for identifying the intention of a question.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a polynomial naive Bayesian classification method for question intention recognition comprises the following specific steps:
(1) Collecting the definition problem types of the related databases and acquiring training questions;
(2) Storing each training question and extracting corresponding characteristic information of the training questions;
(3) Training a polynomial naive Bayes classifier according to the characteristic information;
(4) Intent recognition is performed by a polynomial naive bayes classifier.
As a further scheme of the invention, the specific acquisition steps of the training question in the step (1) are as follows:
step one: collecting databases containing related problems of a vertical domain, detecting whether each training data in each database contains a corresponding question type label or not, marking the training data without the question type label, and balancing the sample number of each category by adopting oversampling or undersampling;
step two: checking whether repeated data records exist in a database, deleting the repeated data if the repeated data exist, deleting the training data with some data field lacking values, deleting the records where the missing values exist, filling the missing values by using a mean value or a median value or processing the missing values by using interpolation;
step three: removing unnecessary special characters, punctuation marks and HTML labels in each group of training data, carrying out standardized processing on each group of processed training data in a unified format to obtain each group of training questions, and integrating and summarizing each group of training questions into a training data set.
As a further aspect of the present invention, the specific step of extracting the feature information in the step (2) is as follows:
step (1): the method comprises the steps of replacing entities in each group of training questions in a training data set with corresponding types in a database, performing question segmentation on each group of training questions through a jieba word segmentation library, segmenting each group of training questions into words or phrases, extracting stems, and restoring the words into original stems;
step (2): initializing a two-dimensional array data, storing each word segmentation character string of the i-th training question into data [ i ] [1] to data [ i ] [ j ], and then using data [ i ] [ j+1] to store question type labels corresponding to the i-th training question, wherein j is the word segmentation number of the i-th training question;
step (3): calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the first n-sized word in order from large to small, and storing the corresponding word segmentation into the array words;
step (4): and generating polynomial key word feature vectors of all training questions according to the array word, if the questions contain key words in the word, marking the corresponding positions of the polynomial key word feature vectors as 1, otherwise marking the corresponding positions as 0, and simultaneously storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
As a further aspect of the present invention, the term frequency-inverse document frequency specific calculation formula in step (3) is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
As a further aspect of the present invention, the polynomial na iotave bayesian classifier specific training step in step (3) is as follows:
step I: the method comprises the steps of inputting a polynomial keyword feature vector and a corresponding question type label into a polynomial naive Bayes classifier as training key value pairs, and dividing each type of received question into a training set and a testing set by the classifier;
step II: calculating prior probability of each type of question in the training set and conditional probability of each word under each type, then evaluating the trained polynomial naive Bayes classifier by using feature vectors of the testing set and corresponding question type labels, and calculating classification accuracy, recall rate and F1 value performance indexes;
step III: and after each performance index meets a preset threshold, applying the performance index to a new unknown user question, and if each performance index does not meet the preset threshold, smoothly adjusting parameters of the TF-IDF through Laplace to optimize the performance of the classifier.
As a further aspect of the present invention, the polynomial na iotave bayes classifier in step (4) intends to identify the specific steps as follows:
the first step: converting a question to be classified into a numerical feature vector for representation by a word frequency-inverse document frequency method, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the previous n-sized word frequency according to the order from large to small, and storing the corresponding word segmentation into the array words;
and a second step of: and calculating the probability of each group of question type labels according to the prior probability, calculating the posterior probability of the question feature vector to be classified under each group of question type labels, finding out the question type label with the maximum posterior probability as the label of the predicted question, and outputting the label.
Compared with the prior art, the invention has the beneficial effects that:
the polynomial naive Bayes classification method for question intention recognition replaces the entity in each group of training questions in the training data set with the corresponding type in the database, then carries out question segmentation on each group of training questions through the jieba word segmentation library, segments each group of training questions into words or phrases, simultaneously carries out stem extraction, restores the words into the original stem form, initializes a two-dimensional array data, stores each word segmentation character string and question type label of the i-th group of training questions into the two-dimensional array data, calculates the word frequency-inverse document frequency of each word segmentation of each training question, initializing a group of array words with the size of n, storing word frequency-inverse document frequency of each word, generating polynomial keyword feature vectors of each training question according to the array words, inputting the polynomial keyword feature vectors and corresponding question type labels as training key values into a polynomial naive Bayes classifier for training, analyzing user questions through the trained polynomial naive Bayes classifier, outputting predicted question labels, and effectively improving accuracy of a naive Bayes classification algorithm in a question classification process and improving quality of intention recognition.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
Fig. 1 is a flow chart of a polynomial naive bayes classification method for question intention recognition.
Detailed Description
Referring to fig. 1, a polynomial naive bayes classification method for question intention recognition includes the following specific steps:
and collecting the definition problem types of the related databases and acquiring training questions.
Specifically, a database containing related problems of a vertical domain is collected, meanwhile, whether each training data in each database contains a corresponding question type label is detected, the training data without the question type label is marked, the sample number of each class is balanced by adopting oversampling or undersampling, whether repeated data records exist in the database is checked, if the repeated data exist, the repeated data are deleted, then the training data with missing values of certain data fields are deleted, the record with missing values is deleted, the missing values are filled by using a mean value or a median value or the missing values are processed by using interpolation, unnecessary special characters, punctuation marks and HTML labels in each group of training data are removed, and then the processed training data are subjected to standardized processing in a unified format to obtain each group of training questions, and each group of training questions are integrated and generalized into a training data set.
And storing each training question and extracting the corresponding characteristic information of the training questions.
Specifically, the entity in each group of training questions in the training data set is replaced by the corresponding type in the database, then each group of training questions is subjected to question segmentation through a jieba word segmentation library, each group of training questions is segmented into words or phrases, at the same time, word stem extraction is carried out, the words are restored to the original word stem form, a two-dimensional array data is initialized, each word segmentation character string of the i group of training questions is stored in data [ i ] [1] to data [ i ] [ j ], then the data [ i ] [ j+1] is used for storing question type labels corresponding to the i group of training questions, j is the segmentation number of the i group of training questions, calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of array words with n, sequencing the word frequency-inverse document frequency of the first n according to the sequence from big to small, storing the corresponding word segmentation into the array word, generating polynomial key word feature vectors of each training question according to the array word, marking the corresponding position of the polynomial key word feature vectors as 1 if the keywords in the questions are contained, otherwise marking the keyword as 0, and storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
It should be further noted that, the specific calculation formula of the word frequency-inverse document frequency is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
And training a polynomial naive Bayes classifier according to the characteristic information.
Specifically, a polynomial keyword feature vector and a corresponding question type label are used as training key value pairs to be input into a polynomial naive Bayesian classifier, then the classifier divides various types of received questions into a training set and a test set, the prior probability of each type of question and the conditional probability of each word under each type in the training set are calculated, then the trained polynomial naive Bayesian classifier is evaluated by using the feature vector of the test set and the corresponding question type label, classification accuracy, recall rate and F1 value performance indexes are calculated, after each performance index meets a preset threshold, the performance index is applied to a new unknown user question, and if each performance index does not meet the preset threshold, the performance of the classifier is optimized by smoothly adjusting parameters of TF-IDF through Laplacian.
Intent recognition is performed by a polynomial naive bayes classifier.
Specifically, the question to be classified is converted into a numerical feature vector to be expressed by a word frequency-inverse document frequency method, a group of n-sized array words are initialized, the word frequency-inverse document frequency with the first n being large is ordered according to the order from large to small, the corresponding word segmentation is stored in the array words, the probability of each group of question type labels is calculated according to the prior probability, the posterior probability of the question feature vector to be classified under each group of question type labels is calculated, and the question type label with the largest posterior probability is found to be used as the label for predicting the question.

Claims (6)

1. A polynomial naive Bayesian classification method for question intention recognition is characterized by comprising the following specific steps:
(1) Collecting the definition problem types of the related databases and acquiring training questions;
(2) Storing each training question and extracting corresponding characteristic information of the training questions;
(3) Training a polynomial naive Bayes classifier according to the characteristic information;
(4) Intent recognition is performed by a polynomial naive bayes classifier.
2. The polynomial naive bayes classification method for question intention recognition according to claim 1, wherein the training question specific obtaining step in the step (1) is as follows:
step one: collecting databases containing related problems of a vertical domain, detecting whether each training data in each database contains a corresponding question type label or not, marking the training data without the question type label, and balancing the sample number of each category by adopting oversampling or undersampling;
step two: checking whether repeated data records exist in a database, deleting the repeated data if the repeated data exist, deleting the training data with some data field lacking values, deleting the records where the missing values exist, filling the missing values by using a mean value or a median value or processing the missing values by using interpolation;
step three: removing unnecessary special characters, punctuation marks and HTML labels in each group of training data, carrying out standardized processing on each group of processed training data in a unified format to obtain each group of training questions, and integrating and summarizing each group of training questions into a training data set.
3. The polynomial naive bayes classification method for question intention recognition according to claim 2, wherein the specific steps of feature information extraction in the step (2) are as follows:
step (1): the method comprises the steps of replacing entities in each group of training questions in a training data set with corresponding types in a database, performing question segmentation on each group of training questions through a jieba word segmentation library, segmenting each group of training questions into words or phrases, extracting stems, and restoring the words into original stems;
step (2): initializing a two-dimensional array data, storing each word segmentation character string of the i-th training question into data [ i ] [1] to data [ i ] [ j ], and then using data [ i ] [ j+1] to store question type labels corresponding to the i-th training question, wherein j is the word segmentation number of the i-th training question;
step (3): calculating word frequency-inverse document frequency of each word segmentation data [ i ] [ j ] of each training question, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the first n-sized word in order from large to small, and storing the corresponding word segmentation into the array words;
step (4): and generating polynomial key word feature vectors of all training questions according to the array word, if the questions contain key words in the word, marking the corresponding positions of the polynomial key word feature vectors as 1, otherwise marking the corresponding positions as 0, and simultaneously storing the polynomial key word feature vectors into data [ i ] [ k+2 ].
4. The polynomial naive bayes classification method for question intent recognition according to claim 3, wherein the term frequency-inverse document frequency concrete calculation formula in the step (3) is as follows:
TF-IDF ij =tf ij *idf ij
(3)
wherein n is ij Representing data [ i ]][j]The number of times the word segment appears in the i-th training question, D i Representing the set of individual segmentations of the ith set of training questions.
5. A polynomial naive bayes classification method for question-intent recognition as claimed in claim 3 wherein said polynomial naive bayes classifier in step (3) is specifically trained as follows:
step I: the method comprises the steps of inputting a polynomial keyword feature vector and a corresponding question type label into a polynomial naive Bayes classifier as training key value pairs, and dividing each type of received question into a training set and a testing set by the classifier;
step II: calculating prior probability of each type of question in the training set and conditional probability of each word under each type, then evaluating the trained polynomial naive Bayes classifier by using feature vectors of the testing set and corresponding question type labels, and calculating classification accuracy, recall rate and F1 value performance indexes;
step III: and after each performance index meets a preset threshold, applying the performance index to a new unknown user question, and if each performance index does not meet the preset threshold, smoothly adjusting parameters of the TF-IDF through Laplace to optimize the performance of the classifier.
6. The method for polynomial naive bayes classification for question-intent recognition of claim 5, wherein the polynomial naive bayes classifier intent recognition in step (4) comprises the following specific steps:
the first step: converting a question to be classified into a numerical feature vector for representation by a word frequency-inverse document frequency method, initializing a group of n-sized array words, sequencing the word frequency-inverse document frequency of the previous n-sized word frequency according to the order from large to small, and storing the corresponding word segmentation into the array words;
and a second step of: and calculating the probability of each group of question type labels according to the prior probability, calculating the posterior probability of the question feature vector to be classified under each group of question type labels, finding out the question type label with the maximum posterior probability as the label of the predicted question, and outputting the label.
CN202310969472.7A 2023-08-03 2023-08-03 Polynomial naive Bayesian classification method for question intent recognition Pending CN116975738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310969472.7A CN116975738A (en) 2023-08-03 2023-08-03 Polynomial naive Bayesian classification method for question intent recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310969472.7A CN116975738A (en) 2023-08-03 2023-08-03 Polynomial naive Bayesian classification method for question intent recognition

Publications (1)

Publication Number Publication Date
CN116975738A true CN116975738A (en) 2023-10-31

Family

ID=88472891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310969472.7A Pending CN116975738A (en) 2023-08-03 2023-08-03 Polynomial naive Bayesian classification method for question intent recognition

Country Status (1)

Country Link
CN (1) CN116975738A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473511A (en) * 2023-12-27 2024-01-30 中国联合网络通信集团有限公司 Edge node vulnerability data processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473511A (en) * 2023-12-27 2024-01-30 中国联合网络通信集团有限公司 Edge node vulnerability data processing method, device, equipment and storage medium
CN117473511B (en) * 2023-12-27 2024-04-02 中国联合网络通信集团有限公司 Edge node vulnerability data processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
US20200334410A1 (en) Encoding textual information for text analysis
CN111324742A (en) Construction method of digital human knowledge map
CN110910991B (en) Medical automatic image processing system
CN116975738A (en) Polynomial naive Bayesian classification method for question intent recognition
CN112181490B (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN114416979A (en) Text query method, text query equipment and storage medium
CN112632993A (en) Electric power measurement entity recognition model classification method based on convolution attention network
CN114090736A (en) Enterprise industry identification system and method based on text similarity
CN115858785A (en) Sensitive data identification method and system based on big data
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
Jui et al. A machine learning-based segmentation approach for measuring similarity between sign languages
CN112732863B (en) Standardized segmentation method for electronic medical records
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN113536015A (en) Cross-modal retrieval method based on depth identification migration
CN117609583A (en) Customs import and export commodity classification method based on image text combination retrieval
CN103034657B (en) Documentation summary generates method and apparatus
CN111767402B (en) Limited domain event detection method based on counterstudy
CN111460160B (en) Event clustering method of stream text data based on reinforcement learning
CN114756617A (en) Method, system, equipment and storage medium for extracting structured data of engineering archives
CN111191448A (en) Word processing method, device, storage medium and processor
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231206

Address after: Room 303, 3rd Floor, Unit 1, Building 16, Luojia Yayuan (Phase I), No. 369 Shucheng Road, Hongshan District, Wuhan City, Hubei Province, 430070

Applicant after: Wuhan Tuoyun Technology Co.,Ltd.

Address before: Room 403, Building B6, Innovation Port, No. 15 Jinyang Road, Huaqiao Town, Kunshan City, Suzhou City, Jiangsu Province, 215300

Applicant before: Suzhou Jiayang Technology Co.,Ltd.