CN112256833A

CN112256833A - Intelligent question answering method for mobile phone based on big data and AI algorithm

Info

Publication number: CN112256833A
Application number: CN202011147321.6A
Authority: CN
Inventors: 胡亚军; 邵若梅; 孙树清
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-22
Anticipated expiration: 2040-10-23
Also published as: CN112256833B

Abstract

The invention discloses a mobile phone question intelligent question and answer method based on big data and AI algorithm, comprising the following steps: s1, constructing a common question answer library, namely an FAQ library; s2, constructing a classified AI model based on the DNN neural network, and acquiring character information of a plurality of comments related to the mobile phone from the Internet to construct a training set to train the classified AI model; s3, constructing a similarity calculation model based on a Bert algorithm; s4, collecting text information of questions related to the mobile phone, inputting the text information into a classification AI model, obtaining a classification corresponding to the data to be processed, and obtaining a question-answer subset corresponding to the classification from an FAQ library; and S5, inputting the questions related to the mobile phone and the question-answer subsets classified correspondingly to the questions to the similarity calculation model to obtain matched questions and answers thereof, and providing the answers to the user. The invention can intelligently provide the correct answer service for the user according to the free question of the user.

Description

Intelligent question answering method for mobile phone based on big data and AI algorithm

Technical Field

The invention relates to the field of text data artificial intelligence processing methods, in particular to a mobile phone question intelligent question and answer method based on big data and an AI algorithm.

Background

At present, a user help module is generally set in an online customer service center of a mobile phone manufacturer, and is used for answering questions of a user about a mobile phone, such as questions related to the types of a display screen, a camera, a battery and the like of the mobile phone. The existing user help module is set in such a way that if a user encounters a problem, the user can go to an online customer service center to look up all the problems and corresponding answers of the mobile phone, and by looking up all the problem sets, the user can possibly find out the situation similar to the problems encountered by the user, so as to obtain a desired answer; if the user can not find out the desired answer in the online customer service center, the user can further provide questions in the system, the further provided questions are sent to the customer service mail system, and the customer service manually gives answers to the user according to the sequence of all the questions sent to the customer service mail system and sends the answers to a mailbox of the user. The above process is time-consuming, the system feedback is not timely, and the user needs to wait for a long time, so the user experience is poor.

Disclosure of Invention

The invention aims to provide a mobile phone question intelligent question-answering method based on big data and an AI algorithm, which aims to solve the problems of long time consumption and untimely feedback existing in the prior art that a mobile phone online customer service center user help module answers mobile phone related questions provided by a user.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a mobile phone question intelligent question and answer method based on big data and AI algorithm is characterized in that: the method comprises the following steps:

s1, constructing a common question answering library (FAQ library), classifying common questions in the FAQ library to obtain a plurality of classified question answering subsets, wherein each classified question answering subset comprises a plurality of questions of the same classification and corresponding answers;

s2, constructing a classification AI model based on the DNN neural network, wherein the classification AI model inputs vector data and outputs classification, obtains character information of a plurality of comments related to the mobile phone from the Internet as expected data, converts the expected data into vector data which can be identified by a computer and then serves as a training set, and then inputs the training set into the classification AI model for training to obtain the classification AI model with the prediction accuracy meeting the requirement;

s3, constructing a similarity calculation model based on a Bert algorithm;

s4, collecting text information of questions related to the mobile phone as data to be processed, converting the data to be processed into vector data, inputting the vector data into the classification AI model trained in the step S2 to obtain a classification corresponding to the data to be processed, and then obtaining a question-answer subset corresponding to the classification from the FAQ library obtained in the step S1;

and S5, respectively inputting the vector data corresponding to the data to be processed and the question-answer subsets corresponding to the classifications obtained in the step S4 into the similarity calculation model obtained in the step S3, performing distance calculation by using the similarity calculation model to obtain the question and the answer in the question-answer subsets corresponding to the classifications closest to the question corresponding to the data to be processed, and providing the obtained answer as a matching answer to the user.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: the FAQ library in step S1 is formed by converting text information of the mobile phone-related questions and answers to the questions acquired from a plurality of channels into data.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: in step S2, text information of a plurality of comments related to the mobile phone is acquired from the internet as expected data by using a crawler capture method.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: a crawler service program consisting of a crawler scheduler, a URL manager, a webpage downloader and a webpage resolver is constructed based on a script frame to perform crawler capture, wherein the webpage downloader downloads webpage data from the Internet, the URL manager manages the downloaded webpage data, the webpage resolver identifies HTML tags of all the webpage data and resolves character information of a plurality of comments related to the mobile phone, the crawler scheduler is used for scheduling the URL manager, the webpage downloader and the webpage resolver, and finally obtained character information of the plurality of comments related to the mobile phone is stored in a Mysql database.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: and an IP pool of the crawler is built in the crawler service program based on the dynamic IP proxy technology so as to ensure that the crawler continuously captures and downloads webpage data from the Internet.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: before the training set is constructed in step S2, the text information of multiple comments related to the mobile phone in multiple languages is first translated and converted into the same language, and then the invalid text information is cleaned and removed, and the translated and converted text information is used as the expected data.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: and (5) adopting sentence validity judgment and combining a word list deactivation mode to clean and remove invalid character information.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: in step S2, the expectation data is converted into vector data by a word embedding method.

The intelligent question and answer method for the mobile phone based on the big data and AI algorithm is characterized in that: in step S2, the constructed classified AI model includes an input layer, at least two hidden layers and an output layer, each hidden layer includes a plurality of neurons, and the classified AI model whose prediction accuracy meets the requirement is finally obtained by adjusting the number of layers of the hidden layers and the number of neurons in each hidden layer in the training process.

According to the invention, a powerful crawler system is built, massive user comment data are obtained from the Internet, then the data are cleaned by utilizing a big data technology, processed partial data are manually labeled, and the labeled data sets are data to be added into a classification model based on an artificial intelligence technology for model training, so that an effective classification model is finally obtained. And classifying the problems according to the AI classification model to obtain the problem classification which the user accurately knows (for example, by using the classification model, knowing that the user knows about the problem of the battery). After the user's questions are effectively classified, the Bert model is used for carrying out distance operation on the questions and the questions in the FAQ subset, the FAQ target question which is the smallest in distance is found and serves as the problem which is the best matched with the user, and finally the answer corresponding to the question is sent to the user. The intelligent question-answering service provided by the method aims to quickly and effectively answer the questions of the user in time and give the user a real-time interactive feeling.

Through the thought, the accuracy rate of the answer provided for the user can reach more than 90%. When the user cannot answer or the feedback answer content of the user is irrelevant to the question, the questions can be collected, the answer of the question is found, and the collected question and answer are expanded into an FAQ library, so that the service capability of the intelligent question-answering service is further improved.

The invention has the beneficial effects that: the invention can intelligently provide the correct answer service for the user according to the free question asking of the user, has the advantages of short time consumption and timely feedback, does not need the user to wait for a long time, and improves the user experience.

Drawings

FIG. 1 is a block diagram of a method flow of an embodiment of the present invention.

Fig. 2 is a schematic diagram of an embodiment of the present invention.

FIG. 3 is a flowchart of a crawler service technique according to an embodiment of the invention.

Fig. 4 is a structural diagram of a crawler service according to an embodiment of the present invention.

FIG. 5 is a flow diagram of a technique for language translation in accordance with an embodiment of the present invention.

Fig. 6a is a schematic diagram of a word skipping pattern in word embedding.

Fig. 6b is a schematic diagram of a continuous bag-of-words model in word embedding.

FIG. 7 is a schematic diagram of a classified AI model architecture according to an embodiment of the invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1 and 2, an intelligent question answering method for a mobile phone based on big data and AI algorithm includes the following steps:

s1, constructing a common question-answer library, namely an FAQ library, classifying common questions in the FAQ library to obtain a plurality of classified question-answer subsets, wherein each classified question-answer subset comprises a plurality of questions of the same classification and corresponding answers.

The FAQ library is formed by digitizing the text information of the mobile phone-related questions and the answers of the questions acquired by a plurality of channels, and is extensible, namely, the questions which cannot be answered or the questions which are not answered correctly can be subjected to expert research to acquire the answers and then are added into the FAQ library.

To implement an intelligent question-answering system, we first have an FAQ library, i.e. a frequently encountered question-answering library, in which questions and related answers that users often encounter at ordinary times are recorded, such as the following questions: q: white i can not charge my phone A, You should go to the shop while You bought the phone, and the file a form and send the form to us. so can check your phone and repair it. so that the FAQ is to be obtained from multiple channels, and the FAQ library is continuously expanded so that FAQ is stocked in the questions and corresponding answers of most people.

S2, a classification AI model based on the DNN neural network is constructed, the classification AI model is input into vector data and output into classification, text information of a plurality of comments related to the mobile phone is obtained from the Internet to serve as expected data, the expected data is converted into vector data which can be identified by a computer to serve as a training set, and then the training set is input into the classification AI model to be trained so as to obtain the classification AI model with the prediction accuracy meeting the requirement.

As shown in fig. 3 and 4, in step S2, the crawler service program is used to obtain text information related to the mobile phone fed back by the mobile phone 'S e-commerce website and the user' S review of the evaluation website from the internet as expected data, and then the text information is used to construct a training set of the classified AI model.

For mobile phone terminal users, there are tens of millions of users, and each user has different expression ways for the same question, so that the user cannot be insights on the question of the user simply through a rule matching way, and needs to understand the user question through an artificial intelligence algorithm, thereby providing the most possible correct response for the user. The sentence vectorization used in the algorithm process needs a lot of predictions to construct a word vector model, and meanwhile, the predictions are preferably specific to a certain field, so the invention uses a crawler frame to obtain the user feedback of the mobile phone electronic commerce website and the evaluation website, and then constructs the model by using the predictions. Currently, a web crawler service is mainly built to capture comment contents of websites related to a mobile phone and user feedback collected by the mobile phone as a corpus.

The crawler service program is constructed based on a script framework and comprises a crawler scheduler, a URL (Uniform resource locator) manager, a webpage downloader and a webpage resolver, wherein the webpage downloader downloads webpage data from the Internet, the URL manager manages the downloaded webpage data, the webpage resolver identifies HTML (hypertext markup language) tags of the webpage data and resolves character information of a plurality of comments related to the mobile phone, and the crawler scheduler is used for scheduling the URL manager, the webpage downloader and the webpage resolver. The core of the crawler service program is a web page analyzer, customized programming is required to be carried out according to each web page and one or more pieces of information required to be crawled from the web pages, corresponding information is analyzed by identifying different HTML tags, and finally the analyzed information is structured and stored in a Mysql database and provided for subsequent big data analysis service.

Meanwhile, the invention constructs the IP pool of the crawler in the crawler service program based on the dynamic IP proxy technology so as to ensure that the crawler continuously captures and downloads webpage data from the Internet.

In step S2, before a training set is constructed, text information of a plurality of comments related to a mobile phone in a plurality of languages is translated and converted into the same language, invalid text information is cleaned and removed, and the translated and converted text information is used as expected data.

As shown in fig. 5, there are two schemes for translation conversion: manual translation and invoking google API translation. Among them, manual translation has the following problems: a plurality of translating talents are needed, and the cost of manpower is very high; translation quality depends on the competence and literacy of the corresponding translator; the translation time is long, real-time translation cannot be achieved, and the translator cannot work all the day. For calling google API translation, the above disadvantages of manual translation are avoided, and the following advantages exist: free calling, although the calling frequency is limited, the use at ordinary times is not influenced; the translation quality is high. The method adopts Google API translation to uniformly translate various languages into English, can translate up to 100 languages, can realize real-time translation, and can immediately obtain corresponding translation results after interface calling. If some sentences can not be translated by Google translation, the sentences which can not be translated are discarded or provided for a manual translation group for translation, and generally, machine translation is adopted as a main translation and manual translation is adopted as an auxiliary translation, so that the problems of high translation cost and non-real-time performance are solved, and the defects of machine translation are avoided as much as possible.

In step S2, the invention adopts sentence validity judgment and cleans and eliminates invalid text information by combining with the way of deactivating the vocabulary during cleaning. The data cleaning process is carried out by adopting the following two means:

1. judging the validity of the sentence through the sentence word validity mode: if the number of valid words/total word number in a sentence is more than 50%, the sentence is considered as effective feedback, and there are other deeper ways to judge the effectiveness of the sentence, but because the collected comment data is relatively more, the less valuable content can be removed in a simple way.

2. Filtering useless words by deactivating the vocabulary: stop words generally refer to words that do not contribute to semantic analysis, such as some punctuation, tone, and name. So in daily text processing, after word segmentation is performed, the next step is to stop the word. However, the operation of the stop word is not invariable, and the corresponding stop word dictionary is determined according to specific scenes, such as emotional analysis (positive evaluation, neutral evaluation and negative evaluation), the corresponding word of tone and exclamation mark should be kept because the words have corresponding contribution and meaning for representing the degree of tone and emotional color.

In step S2, the expected data is converted into vector data that can be recognized by a computer by word embedding, and then the vector data is used as a training set, as shown in fig. 6a and 6 b.

In order to solve words and sentences by a computer, the words need to be encoded and converted into vector data which can be recognized by the computer, and the most basic method is a one-hot encoding mode, for example, three words, "i love you", can be encoded as follows: i 001, love 010 and you 100, so that it can be seen that how many words in the corpus are needed to represent each word by using how many bits of codes, which results in too long codes for each word. In order to solve the problems, the invention uses a word embedding word2vec mode for conversion, and the word2vec realization and core ideas are two types:

1. skip word model (skip gram). Assume that the words it surrounds a text sequence are generated based on a certain word. Assume that the text sequences are "the", "man", "loves", "his" and "son". With "loves" as the central word, the background window size is set to 2. As shown in FIG. 6a, the word skipping model is concerned with generating conditional probabilities, i.e., P (the "man", his ", son" | ", of the background words" the "," man "," his ", and" son ", of no more than 2 words from a center word" loves ", given that the generation of background words is independent of each other given the center word, the above formula can be rewritten as P (the" | loves "). P (man" | loves "). P (his" | loves "). P (son" | loves ").

2. Continuous bag of words model (CBOW). The continuous bag of words model is similar to the skip word model, and is the biggest difference from the skip word model in that the continuous bag of words model assumes that a certain central word is generated based on background words before and after the central word in a text sequence. In the same text sequence "the", "man", "loves", "his" and "son", with "loves" as the central word and a background window size of 2, the continuous bag of words model is concerned with the conditional probability that a given background word "the", "man", "his" and "son" generates the central word "loves" (as shown in fig. 6 b), i.e., P (loves "| the", man ", his", and "son").

In step S2 of the present invention, the classification AI model established is an artificial neural network model based on DNN, and during the model training process, the problem of under-fitting or over-fitting is often encountered, the under-fitting indicates that data and scenes are not deeply known, the training data amount is too small or an inappropriate model is used, so that the model does not understand actual data well and a calculation deviation occurs, and the over-fitting uses an excessively complex model to describe a type of phenomenon, so that the complex model has no universality.

In order to avoid the problems of under-fitting and over-fitting in the training process of the classified AI models, for different application scenarios and different prediction objects, when model design is performed, the number of neurons in the input layer, the number of hidden layers, the number of neurons in each hidden layer, and the number of neurons in the output layer in each classified AI model need to be confirmed. The number of neurons in the input layer, the number of hidden layers, the number of neurons in each hidden layer, and the number of neurons in the output layer are different in each model. Meanwhile, in the process of model training, certain neurons (Dropout) are discarded randomly, and certain characteristics of data are learned selectively, so that the generalization capability of the model is improved, and the over-fitting condition is avoided.

The classified AI model constructed by the present invention has four network layers in total, as shown in fig. 7, including one input layer, two hidden layers, and one output layer. The number of the neurons of the input layer is 1, and vector data of mobile phone related problems proposed by a user correspond to the neurons; the number of neurons in the first hidden layer is 1560, and the number of neurons in the second hidden layer is 195; the number of neurons in the output layer is 10, corresponding to the problem classification such as camera, screen, battery, connection, storage, appearance, performance, price, sound, hardware. If the problem classification is subdivided or other states exist, the number of the neurons of the output layer changes correspondingly, and the number of the neurons of the output layer corresponds to the total number of the problem classification. The number of hidden layers and the number of neurons therein may be adjusted according to the actual effect, and the adjusting process is as follows:

in one case, the prediction accuracy of the classification AI model for predicting problem classification reaches 98%, but the prediction accuracy in actual use is only 65%, which is called as overfitting, and in this case, the number of hidden layers and the number of neurons can be reduced from two layers to one layer, and the number of neurons in the hidden layers is reduced to 780, so that the complexity of the model is reduced, and the ability of the classification AI model to "hold one thing against three" is improved.

In another case, the accuracy of the classification of the problem predicted by the classification AI model in the experimental stage is only 70%, which is called under-fitting, at this time, the number of hidden layers and the number of neurons can be increased from two layers to three layers, the number of neurons in the first layer is 3000, the number of neurons in the second layer is 1000, and the number of neurons in the third layer is 195, so as to increase the complexity of the model and enhance the learning ability of the model.

After multiple training, prediction, a new classification AI model is finally obtained, and the accuracy of the classification AI model in the experimental stage and the prediction stage reaches more than 90%, so that the requirements of the classification AI model are met.

The whole process is an algorithm model parameter adjusting process, and all the things are done to realize the optimal model: the prediction accuracy of the model in the experimental stage and the prediction accuracy in the practical application both reach the expected standard.

In step S2, after the classified AI model is constructed, corresponding training data is created and provided to the classified AI model for training, and it is very important how to construct a high-quality data set. For the NLP domain, different people have different understandings of a sentence, which further increases the difficulty of constructing the training data. For example, for the following: in The photo I just look books dictionary, different people do not understand The sentence consistently, and some people will consider The sentence to be Camera, some people will consider The sentence to be Screen, and others will consider The sentence to be a System problem.

In order to construct a high-quality training set, the invention adopts a mode of labeling multiple persons together, takes 20 sentence labels as an example, manually labeling the training data with corresponding labels, and then summarizing the labels. For sentence S, twenty tags T1-T20 are obtained, then the 20 tags are grouped and classified, then the most numerous class in the group is taken out, if the number in the group exceeds (contains) 15, then the class can be considered as valid and used as the final class of the sentence. If the number of the packets with the maximum number is less than 15, the comprehension divergence of the sentence classification is considered to be too large, and the packets are not included in the training set.

In a practical scenario, some users are involved in a plurality of questions in the whole question, such as the following: the Camera focuses on The horizontal stroke of The Camera targets pictures, and i found The game targets to The fast.

And S3, constructing a similarity calculation model based on the Bert algorithm. The Bert algorithm similarity calculation model is an Encoder of a bidirectional Transformer, is a common text similarity calculation model, scans records of a plurality of NLP tasks, is more efficient compared with RNN, and can acquire dependence at longer distance. It can obtain information in the true sense.

And S4, collecting character information of questions related to the mobile phone as data to be processed, converting the data to be processed into vector data, inputting the vector data into the classification AI model trained in the step S2 to obtain a classification corresponding to the data to be processed, and then obtaining a question-answer subset corresponding to the classification from the FAQ library obtained in the step S1.

After the word vector and the sentence vector of the user question are obtained, the distance of the vector can be calculated through the vector and the sentence vector of the question in the FAQ, and finally the question and the answer which are matched with the question in the FAQ best are determined and recommended to the user. There are two ways to calculate the vector distance: the euclidean distance and the cosine distance calculate the distance between the question input by the user and the vectors of all the questions in the FAQ using the distance calculation method as above.

In order to better calculate the sentence similarity, a new algorithm for calculating the similarity is introduced, namely a Bert algorithm of the great vessel, and BERT is an Encoder of a bidirectional Transformer. The main innovation of the model is that on the basis of a pre-training method, BERT scans the records of a plurality of NLP tasks through pre-training and fine-tuning, and the model is more efficient compared with RNN and can acquire longer-distance dependence. It can obtain information in the true sense.

The converted and cleaned expected data are subjected to effective fine adjustment to obtain a similarity calculation model suitable for mobile phone comments, the similarity calculation model can accurately calculate the similarity of two words after vectorization, and the closer the semantics are, the higher the similarity of the two words is.

Therefore, after the questions asked by the user are effectively classified through the classified AI model, the vectorized sentences and the question-answer subsets (after vectorization) aiming at the corresponding categories in the FAQ library are input into the similarity calculation model for distance calculation, the closest question is the question which is most matched with the mobile phone-related question asked by the user, and the answer corresponding to the question is provided for the user.

The embodiments of the present invention are described only for the preferred embodiments of the present invention, and not for the limitation of the concept and scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall into the protection scope of the present invention, and the technical content of the present invention which is claimed is fully set forth in the claims.

Claims

1. A mobile phone question intelligent question and answer method based on big data and AI algorithm is characterized in that: the method comprises the following steps:

s3, constructing a similarity calculation model based on a Bert algorithm;

2. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 1, wherein: the FAQ library in step S1 is formed by converting text information of the mobile phone-related questions and answers to the questions acquired from a plurality of channels into data.

3. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 1, wherein: in step S2, text information of a plurality of comments related to the mobile phone is acquired from the internet as expected data by using a crawler capture method.

4. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 3, wherein: a crawler service program consisting of a crawler scheduler, a URL manager, a webpage downloader and a webpage resolver is constructed based on a script frame to perform crawler capture, wherein the webpage downloader downloads webpage data from the Internet, the URL manager manages the downloaded webpage data, the webpage resolver identifies HTML tags of all the webpage data and resolves character information of a plurality of comments related to the mobile phone, the crawler scheduler is used for scheduling the URL manager, the webpage downloader and the webpage resolver, and finally obtained character information of the plurality of comments related to the mobile phone is stored in a Mysql database.

5. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 4, wherein: and an IP pool of the crawler is built in the crawler service program based on the dynamic IP proxy technology so as to ensure that the crawler continuously captures and downloads webpage data from the Internet.

6. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 1, wherein: before the training set is constructed in step S2, the text information of multiple comments related to the mobile phone in multiple languages is first translated and converted into the same language, and then the invalid text information is cleaned and removed, and the translated and converted text information is used as the expected data.

7. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 6, wherein: and (5) adopting sentence validity judgment and combining a word list deactivation mode to clean and remove invalid character information.

8. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 1, wherein: in step S2, the expectation data is converted into vector data by a word embedding method.

9. The intelligent question-answering method for the mobile phone based on the big data and AI algorithm as claimed in claim 1, wherein: in step S2, the constructed classified AI model includes an input layer, at least two hidden layers and an output layer, each hidden layer includes a plurality of neurons, and the classified AI model whose prediction accuracy meets the requirement is finally obtained by adjusting the number of layers of the hidden layers and the number of neurons in each hidden layer in the training process.