CN111797214A

CN111797214A - FAQ database-based problem screening method and device, computer equipment and medium

Info

Publication number: CN111797214A
Application number: CN202010591151.4A
Authority: CN
Inventors: 张山
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-20

Abstract

The embodiment of the application belongs to the field of artificial intelligence, and relates to a problem screening method, a device, computer equipment and a medium based on an FAQ database. In addition, the application also relates to a block chain technology, and the question sentences input by the user can be stored in the block chain. According to the method and the device, the problem screening accuracy can be improved, and high-quality problems are recommended for the user.

Description

FAQ database-based problem screening method and device, computer equipment and medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a problem screening method and device based on an FAQ database, computer equipment and a medium.

Background

In a common Question and answer (FAQ) dialogue, a Question and answer library including a large number of Question and answer pairs is constructed in advance in an FAQ system, and when a Question posed by a user is received, the FAQ system can find a Question matched with the Question posed by the user in the Question and answer library based on the Question posed by the user, and return the Question and the Question answer determined by the FAQ system to the user.

At present, in a question-answering system in the industry, the search of the FAQ standard problem is mostly realized by adopting a direct matching or direct word segmentation mode, and the accuracy of the search of the standard problem, the similar problem and the associated problem in the mode has defects, so that the problem with close semantics is difficult to search, and the matching degree between the obtained problem and the problem answer really wanted to be obtained by the user is poor; in addition, it is difficult to match the corresponding questions of the description of the questions by unlimited users through the limited configuration of the FAQ, and it is also difficult to recommend high quality questions for the customers. Therefore, the traditional FAQ question-answering needs to maintain a huge knowledge base, and the problem of low precision still exists in the aspect of problem screening.

Disclosure of Invention

The embodiment of the application aims to provide a problem screening method, a problem screening device, computer equipment and a medium based on an FAQ database, and mainly aims to quickly and accurately screen out problems matched with questions for a user.

In order to solve the above technical problem, an embodiment of the present application provides a problem screening method based on an FAQ database, which adopts the following technical scheme:

analyzing a question sentence input by a user, and performing word segmentation processing on the question sentence;

counting the word frequency of each word in the FAQ database after word segmentation, determining the weight of each word, and storing the weight of each word and the word segmentation result into the FAQ database;

inquiring candidate questions corresponding to the question sentences in the FAQ database, scoring the candidate questions according to the weight of each word, and screening out the candidate questions with the score larger than or equal to a preset score as inquiry results;

calculating a similarity value between the question statement and the query result according to a similarity algorithm model, and filtering the query result of which the similarity value is not in a preset range;

and calculating the filtered query result by adopting a classification algorithm, and determining the problems with the highest similarity to the input problem sentences and the preset number.

In order to solve the above technical problem, an embodiment of the present application further provides a problem screening apparatus based on an FAQ database, which adopts the following technical scheme:

the word segmentation module is used for analyzing question sentences input by a user and carrying out word segmentation processing on the question sentences;

the processing module is used for counting the word frequency of each word in the FAQ database after word segmentation, determining the weight of each word and storing the weight of each word and the word segmentation result into the FAQ database;

the query scoring module is used for querying candidate questions corresponding to the question sentences in the FAQ database and scoring the candidate questions according to the weight of each word;

the screening module is used for screening out the problems with the scores larger than or equal to the preset scores as query results;

the similarity calculation module is used for calculating a similarity value between the question statement and the query result according to a similarity calculation model and filtering the query result of which the similarity value is not in a preset range;

the classification calculation module is used for calculating the filtered query result by adopting a classification algorithm; and

and the determining module is used for determining the preset number of questions with the highest similarity to the input question sentences.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

the computer device comprises a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the FAQ database based question screening method as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the FAQ database-based problem screening method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the method comprises the steps of performing word segmentation processing on question sentences by analyzing the question sentences input by a user; counting the word frequency of each word in the FAQ database after word segmentation, determining the weight of each word, and storing the weight of each word and the word segmentation result into the FAQ database; candidate problems corresponding to the question sentences in the FAQ database are inquired, the candidate problems are scored according to the weight of each word, the candidate problems with the score being larger than or equal to a preset score are screened out to serve as inquiry results, and the inquired problems can be preliminarily screened to obtain accurate inquiry results; calculating the similarity value between the question statement and the query result according to the similarity algorithm model, filtering the query result of which the similarity value is not in a preset range, and further analyzing the similarity of the preliminarily screened result so as to find out the question which is practically similar to the question input by the user; and calculating the filtered query result by adopting a classification algorithm, determining the problems of the preset number with the highest similarity to the input question sentences, further improving the accuracy of the screened problems, and recommending high-quality problems for the user.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a FAQ database based question screening method according to the present application;

FIG. 3 is a flowchart of one embodiment of step S203 in FIG. 2;

FIG. 4 is a flowchart of one embodiment of step S204 of FIG. 2;

FIG. 5 is a flowchart of one embodiment of step S205 of FIG. 2;

FIG. 6 is a schematic block diagram of one embodiment of an FAQ database based question screening apparatus according to the present application;

FIG. 7 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

In order to solve the problem that the accuracy of the FAQ is low in problem screening, the present application provides a problem screening method based on an FAQ database, which relates to artificial intelligence semantic analysis, and can be applied to a system architecture 100 shown in fig. 1, where the system architecture 100 may include

terminal devices

101, 102, and 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the method for screening questions based on the FAQ database provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the apparatus for screening questions based on the FAQ database is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a method for problem screening based on an FAQ database according to an embodiment of the present invention, which is described by taking the service end in fig. 1 as an example. The FAQ database-based question screening method comprises the following steps:

step S201, analyzing the question sentence input by the user, and performing word segmentation processing on the question sentence.

The question sentence input by the user may be input by audio or by text, and is not limited herein. Further, under the condition that the information input by the user is the audio file, the audio file input by the user is subjected to voice recognition, the obtained voice recognition result is converted into text data, and word segmentation processing is performed on the text data to obtain a corresponding word segmentation result.

In this embodiment, a problem statement input by a user is received and analyzed, specifically, semantic analysis may be performed on the problem statement, where the semantic analysis includes word segmentation, part-of-speech analysis, named entity recognition, stop word removal, and the like, and the word segmentation is performed on the problem statement by using a word segmentation controller based on semantics.

It is emphasized that, to further ensure the privacy and security of the user input question statement, the user input question statement may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step S202, counting word frequency of each word in the FAQ database after word segmentation, determining weight of each word, and storing the weight of each word and word segmentation results into the FAQ database.

It should be noted that the FAQ database is pre-established, a search engine is established in the FAQ database, the question sentences input by the user are retrieved in the FAQ database by using the query, analysis and search functions of the search engine, and the query results corresponding to the question sentences input by the user are obtained through retrieval. Search engines in this application include, but are not limited to, Elastic Search.

In this embodiment, the question and answer history data of each user may be periodically collected, and the FAQ database may be updated by using the collected question and answer history data of each user, where the question and answer history data of the user may include: questions answered by the user, questions posed, browsed questions, queried questions and the like, and question-answer pairs matched with the questions posed by the user can be crawled from search engines such as Google, Baidu and Yahoo by using a web crawler to update the FAQ database.

The determination of the weight of each word can investigate the distribution condition of each word in an FAQ database based on the characteristics of statistical distribution, and the weight of each word is set according to the distribution characteristics such as word frequency. Specifically, each word after word segmentation is obtained, and the word frequency of each word in the FAQ database is counted, so that the weighted value of each word is as follows: word weight is 100 word frequency.

The reason for storing the weights and word segmentation results of the words in the database is that the user trains and learns in the process of inputting problems, the FAQ database is continuously updated, and meanwhile, the weights of the words are also continuously updated. In this embodiment, the word segmentation result may be updated to the FAQ database through the training model.

Step S203, candidate questions corresponding to the question sentences in the FAQ database are inquired, the candidate questions are scored according to the weight of each word, and the candidate questions with the score larger than or equal to the preset score are screened out to serve as inquiry results.

In the embodiment, the FAQ database is queried by utilizing the query function of the search engine built on the FAQ database, candidate problems corresponding to question sentences input by a user are queried, the candidate problems are scored through the functions of analysis and the like of the search engine, and the candidate problems with the score larger than or equal to the preset score are screened out to serve as query results.

In this embodiment, the screened query results may be further sorted according to the size of the score.

In this embodiment, the candidate questions with the score greater than or equal to the preset score are screened to primarily screen the candidate questions to be queried, so that a relatively accurate query result can be obtained.

And step S204, calculating a similarity value between the question statement and the query result according to the similarity algorithm model, and filtering the query result of which the similarity value is not in a preset range.

In this embodiment, in order to calculate the similarity between the query result and the user input question, a plurality of similarity algorithm models are used to perform similarity calculation, including: the similarity calculation method comprises a Jaccard similarity calculation method, a BM25 algorithm, a cosine similarity (cosine) algorithm and an Edit distance (Edit _ distance), specifically, the Jaccard similarity calculation method is used for calculating the similarity between samples, and the larger the calculated Jaccard coefficient value is, the higher the sample similarity is; the BM25 algorithm is an algorithm based on a probability retrieval model and is used for evaluating the correlation between search terms and documents; cosine similarity, measuring similarity between two vectors by measuring cosine values of included angles of the two vectors, wherein in information retrieval, each term is endowed with different dimensions, one dimension is represented by one vector, values of the dimensions correspond to the frequency of the term appearing in the document, and the cosine similarity can give the similarity of the two documents in the aspect of topics; the edit distance, which refers to the minimum number of edit operations required to change from one string to another string, describes how close two strings are, and if the distance between them is larger, indicates that they are more different.

It should be noted that, these four similarity models are combined and used according to different weights of different scenes, for example, similarity Sim (a, B) between the first question a and the second question B in scene a is calculated, and assuming that the Jaccard similarity algorithm weight in this scene is 0.3, BM25 algorithm is 0.2, the cosine similarity weight is 0.2, and the Edit distance similarity weight is 0.3, Sim (a, B) ═ Jaccard 0.3+ BM25 × 0.2+ cosine × 0.2+ Edit _ distance is 0.3.

Further, in some optional implementation manners of this embodiment, after candidate questions with scores greater than or equal to the preset scores are screened and ranked, a preset number of query results ranked first may be selected to perform similarity calculation by using a similarity calculation method, so that the screening efficiency may be improved, and efficient return of user questions may be ensured.

In this embodiment, similarity calculation is performed on the query results obtained by the primary screening, and query results with similarity values not within a preset range are filtered out, that is, the query results are subjected to secondary screening, so as to further find out a problem with semantic similarity to the question statement input by the user.

Step S205, a classification algorithm is adopted to calculate the filtered query result, and the problem with the highest similarity to the question sentence and the preset number is determined.

Specifically, in the filtered query result, a preset number of questions with the highest similarity to question sentences input by the user are obtained in the FAQ database by implementing a classification algorithm using a combination of two algorithms, FastText and textCNN. It should be noted that in the text classification task, FastText can often achieve the precision comparable to that of the deep network, but is many orders of magnitude faster than the deep network in the training time, and the use of textCNN applies the convolutional neural network to the text classification task, and extracts the key information in the sentence by using the kernel of multiple sizes.

In this embodiment, the filtered query result is subjected to classification algorithm calculation, and a preset number of questions with the highest similarity to question sentences are determined, which is the third screening, so that the questions of the user can be effectively and accurately returned to the user to the greatest extent.

In the embodiment, through analyzing the question sentences input by the user, the word segmentation processing is performed on the question sentences, the weights of the words after word segmentation are determined, the candidate problems corresponding to the question sentences are inquired, and the candidate problems inquired in the FAQ database are graded and screened according to the weights of the words; then, calculating a similarity value between the question statement and the screened query result according to a similarity algorithm model, and filtering the query result of which the similarity value is not in a preset range; and finally, calculating the filtered query result by adopting a classification algorithm, thereby determining the preset number of problems with the highest similarity to the input question sentences, realizing multiple screening of the queried problems, improving the accuracy of the screened problems and recommending high-quality problems for the user.

In some optional implementation manners of this embodiment, referring to fig. 3, in step S203, querying candidate questions corresponding to question statements in an FAQ database, and scoring the candidate questions according to weights of terms specifically includes the following steps:

step S301, extracting keywords in question sentences according to the weight of each word.

Keywords are important components of questions and may help to understand the user's questioning intent. In this embodiment, in step S202, word segmentation processing has been performed on the question sentence input by the user, and based on the word segmentation result and the weight of each word, a word with a weight greater than a preset weight value is determined as the keyword of the question sentence.

Step S302, determining the expansion words of the keywords.

In this embodiment, the obtaining of the expansion words may be obtained through a related word list, the words with high relevance to the keywords in the question sentences are obtained through the related word list, and the keywords are replaced with the words; the expansion words can also be obtained through a knowledge graph of an FAQ database, the knowledge graph can be constructed in advance, and the expansion words can be automatically generated by utilizing the previously constructed knowledge graph according to the superior and inferior words or related synonymous attributes of the keywords. It should be noted that the knowledge graph can continuously collect data according to the query of the user to train so that the scale of the expansion word is larger and more accurate. In the embodiment, the search range can be expanded by forming the keywords into the expansion words, so that the recall rate of the user for asking questions is improved.

Step S303, generating candidate questions corresponding to the question sentences according to the keywords and the expansion words.

In this embodiment, candidate questions corresponding to question sentences input by a user are obtained in the FAQ database by using the query, analysis, and exploration functions of the search engine, and the expansion words can increase the number of the candidate questions.

For example, the question sentence input by the user is: the extracted keyword is < Shenzhen > < where > < hottest alarm >, and the candidate problems generated according to the keyword and the expansion word are as follows:

A. where Shenzhen is most flourishing;

B. where Shenzhen is most flourishing.

And step S304, scoring the candidate questions according to the weight of the words.

In this embodiment, scoring the candidate problem may be to perform word segmentation on the generated candidate problem, perform superposition calculation according to the weight of each word after word segmentation, thereby performing scoring according to the total weight of the candidate problem, or perform superposition calculation on the weight of a keyword and/or an expanded word in the candidate problem, and perform scoring according to the result. It should be noted that the present application is not limited to using the superposition calculation of weights to score similar problems.

In the embodiment, the keywords of the question sentence input by the user are extracted, the expansion words of the keywords are determined, and the query is performed according to the keywords and the expansion words, so that the candidate question corresponding to the question sentence is generated, the search range can be expanded, and the recall rate of the user for asking the question is improved.

In some optional implementations, referring to fig. 4, step S204 specifically includes the following steps:

step S401, respectively calculating a Jaccard similarity, a BM25 similarity, a cosine similarity, and an edit distance similarity between the question sentence and the query result.

For convenience in describing the basic principles of embodiments of the present invention, T is used₁And T₂Representing the question sentence and the query result input by the user, and calculating the question sentence T by adopting a Jaccard similarity algorithm₁And query result T₂The similarity value S between₁：

Before calculating the similarity between two texts by using the Jaccard similarity algorithm, word segmentation is first performed, and the word segmentation processing is the same as that in step S201.

For example, question statement T₁And query result T₂The texts are respectively:

T₁: where Shenzhen is most hot;

T₂: where Shenzhen is most flourishing;

the word segmentation results are respectively:

T₁where, the most hot alarm]；

T₂Where of Shenzhen (Shenzhen, the most prosperous]；

Then

Computing a problem statement T using the BM25 algorithm₁And query result T₂The similarity value S between₂The algorithm formula adopted is as follows:

wherein q is_iRepresents T₁A morpheme after parsing, e.g. for ChineseWe can put the pairs T₁The participles of (a) are analyzed as morphemes, each word is considered as a morpheme q_i(ii) a dl as query result T₂Avgdl is the average length of all documents; k is a radical of₁B is a regulatory factor, k₁＝2，b＝0.75；f_iIs q_iAt T₂The frequency of occurrence of (1); n is a group containing q_iThe number of documents.

Computing a problem statement T using a cosine similarity algorithm₁And query result T₂The similarity value S between₃The algorithm formula adopted is as follows:

wherein x is_iStatement of presentation question T₁TF-IDF weight of the ith participle, y_iRepresenting query results T₂The TF-IDF (term frequency-inverse document frequency) weight of the ith participle is a commonly used weighting technique for information retrieval and data mining. TF means Term Frequency (Term Frequency), and IDF means Inverse text Frequency index (Inverse Document Frequency). After the words are divided, the weight of the words in the sentence is calculated by using TF-IDF to pick words from the sentence, the similarity measurement of the cosine included angle of the space vector is not influenced by index scales after the words are picked, and the cosine value falls in the interval [0, 1]]The larger the value, the smaller the difference.

Computing question sentences T using edit distance similarity₁And query result T₂The similarity value S between₄The algorithm formula adopted is as follows:

wherein editsim (w)_1i,w_2j) To question statement T₁Ith character string w of_1iAnd query result T₂J (th) character string w_2jThe similarity among the elements is that i, j, n and m are positive integers, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to m; editsim (w)_1i,w_2j) Is/are as followsThe calculation formula is as follows:

of these, editdis (w)_1i.length,w_2jLength) as question statement T₁Ith character string w of_1iAnd query result T₂J (th) character string w_2jLength between, w_1iLength is a question statement T₁The length of the ith character string; w is a_2jLength is query result T₂Length of the jth character string.

Step S402, weighting and summing the calculated Jaccard similarity, BM25 similarity, cosine similarity and edit distance similarity according to respective weight values to obtain a similarity value between the question sentence and the query result.

In this embodiment, the similarity value Sim (T) between the question statement and the query result₁,T₂) The following calculation formula is adopted:

Sim(T₁,T₂)＝α×S₁+β×S₂+γ×S₃+ω×S₄；

wherein, alpha is the weighted value of Jaccard similarity, beta is the weighted value of BM25 similarity, gamma is the weighted value of cosine similarity, and omega is the weighted value of edit distance similarity; alpha + beta + gamma + omega is 1, alpha is greater than or equal to 0, beta is greater than or equal to 0, gamma is greater than or equal to 0, and omega is greater than or equal to 0.

It should be noted that, in different scenarios, the values of α, β, γ, and ω are different, and may be set accordingly according to practical applications.

Step S403, filtering the query result whose similarity value is not within the preset range.

Specifically, the calculated similarity value is compared with a preset similarity threshold value, and a query result with the similarity value larger than or equal to the preset similarity threshold value is reserved; and removing the query result with the similarity value smaller than the preset similarity threshold value.

In this embodiment, by calculating the similarity value between the question statement input by the user and the query result, and filtering the query result whose similarity value is not within the preset range, a problem with higher similarity can be practically found out from a large number of similar problems.

In some optional implementations, as shown in fig. 5, step S205 specifically includes the following steps:

step S501, respectively obtaining the question sentence input by the user and the word vector of the query result through FastText.

The FastText model is an existing open-source word vector and text classification model in the field of natural language processing. The method takes each word expressed in a vector form and N-Gram characteristics corresponding to each word as input, and outputs a label corresponding to a text. In the output of the method, there is an output byproduct, namely an embedding vector corresponding to each word, i.e. the word vector in the embodiment. Wherein the embedding vector is a vector subjected to dimension reduction processing; the N-Gram feature refers to a word feature used to evaluate the degree of difference between words. In this embodiment, each word represented in the vector form and the N-Gram feature corresponding to each word are used as the input of the FastText model, so as to obtain the word vector corresponding to each word.

Step S502, inputting the word vectors into the textCNN model, constructing a similarity matrix of the question sentences and the query results after the operation of the convolution layer and the pooling layer, and outputting the similarity values of the question sentences and the query results through the full connection layer.

Specifically, word vectors of question sentences input by users and word vectors of query results are connected to obtain corresponding sentence vectors, the question sentence vectors and the query result vectors input by the users are subjected to convolution layers to obtain sequence information of sentences, sentence vector dimension compression is performed through a pooling layer to establish a similarity matrix of the question input by the users and the query results, the question sentence vectors input by the users and the query result vectors are converted into a vector through a full connection layer, and the vector is input into a logistic regression model, so that the similarity value of the question sentences input by the users and the query results can be obtained.

Note that the logistic regression model is used to compress the result of the input vector to [0, 1] and output the result.

For example, assume that after passing through the full connection layer, the resulting vector is X (X)₁,x₂,x₃,……,x_n) This vector is used as input to the logistic regression model, as follows:

further, the air conditioner is provided with a fan,

wherein, ω is₁，ω₂，……，ω_nAre respectively an input vector x₁,x₂,……,x_nThe weight of (c).

In the embodiment, the similarity value between the question statement and the query result is calculated by combining the FastText model and the textCNN model, so that the calculation efficiency can be improved, and the question of the user can be further ensured to be accurately returned to the maximum extent.

In some optional implementations, after querying the candidate question corresponding to the question statement in the FAQ database in step S203, the following steps are further included: and judging the number of the problems in the candidate problems, and executing corresponding operation according to the judgment result.

Specifically, if the number of the problems is zero, a classification algorithm is directly adopted for calculation; if the number of the problems is less than or equal to a preset threshold value, directly returning the candidate problems to the client; if the number of the problems is larger than a preset threshold value, scoring the candidate problems according to the weight of each word, screening the candidate problems with the scores larger than or equal to the preset score as query results, sequencing the candidate problems, and then executing the step S204 and the step S205; in the method, when candidate problems corresponding to the question sentences are not inquired, calculation can be directly carried out through a classification algorithm, the same types of problems as the questions can be returned to the user, and the recall rate is avoided to be zero; when the number of the problems is less than or equal to the preset threshold value, the candidate problems are directly returned to the client, so that the query efficiency can be improved; when the number of the problems is larger than the preset threshold value, the inquired problems are subjected to multiple times of calculation and screening, so that the accuracy of screening the problems can be improved, and the high-quality problems are recommended for the user.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In an embodiment, as shown in fig. 6, a problem screening apparatus based on an FAQ database is provided, and the problem screening apparatus based on the FAQ database corresponds to the problem screening method based on the FAQ database in the above embodiment one to one. This problem screening device based on FAQ database includes:

the word segmentation module 601 is configured to analyze a question sentence input by a user and perform word segmentation processing on the question sentence;

the processing module 602 is configured to count word frequencies of the words in the FAQ database after word segmentation, determine weights of the words, and store the weights of the words and word segmentation results in the FAQ database;

a query scoring module 603, configured to query candidate questions corresponding to the question statements in the FAQ database, and score the candidate questions according to weights of the terms;

a screening module 604, configured to screen out a question with a score greater than or equal to a preset score as a query result;

the similarity calculation module 605 is configured to calculate a similarity value between the question statement and the query result according to the similarity calculation model, and filter the query result whose similarity value is not within a preset range;

a classification calculation module 606, configured to calculate the filtered query result by using a classification algorithm; and

the determining module 607 is configured to determine the preset number of questions with the highest similarity to the input question sentences.

In some optional implementations of this embodiment, the query scoring module 603 includes:

the extraction unit is used for extracting keywords in the question sentences according to the weight of each word;

an expansion unit for determining an expansion word of the keyword;

and the query scoring unit is used for generating candidate problems corresponding to the question sentences according to the keywords and the expansion words and scoring the candidate problems according to the weights of the words.

In some optional implementations of this embodiment, the similarity calculation module 605 includes:

the calculation unit is used for calculating the similarity of the Jaccard, the similarity of the BM25, the similarity of the cosine and the similarity of the editing distance between the question sentence and the query result respectively, and weighting and summing the calculated similarity of the Jaccard, the similarity of the BM25, the similarity of the cosine and the similarity of the editing distance according to respective weight values to obtain a similarity value between the question sentence and the query result;

and the filtering unit is used for filtering the query results of which the similarity values are not in the preset range.

Specifically, the filtering unit is configured to compare the similarity value with a preset similarity threshold, retain the query result with the similarity value greater than or equal to the preset similarity threshold according to the comparison result, and remove the query result with the similarity value less than the preset similarity threshold.

In some optional implementations of this embodiment, the classification calculation module 606 includes:

the acquisition unit is used for respectively acquiring word vectors of the question sentences and the query results through FastText;

and the processing unit is used for inputting the word vectors into the textCNN model, constructing a similarity matrix of the question sentences and the query results after the operation of the convolution layer and the pooling layer, and outputting the similarity values of the question sentences and the query results through the full connection layer.

The problem screening device based on the FAQ database performs word segmentation on the problem sentences by analyzing the problem sentences input by the user, determines the weight of each word after word segmentation, queries the candidate problems corresponding to the problem sentences, and performs scoring screening on the candidate problems queried in the FAQ database according to the weight of each word; then, calculating a similarity value between the question statement and the screened query result according to a similarity algorithm model, and filtering the query result of which the similarity value is not in a preset range; and finally, calculating the filtered query result by adopting a classification algorithm, thereby determining the preset number of problems with the highest similarity to the input question sentences, realizing multiple screening of the queried problems, improving the accuracy of the screened problems and recommending high-quality problems for the user.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 7 comprises a memory 71, a processor 72, a network interface 73, which are communicatively connected to each other via a system bus. It is noted that only a computer device 7 having components 71-73 is shown, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable gate array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 71 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a flash Card (FlashCard), and the like, which are provided on the computer device 7. Of course, the memory 71 may also comprise both an internal storage unit of the computer device 7 and an external storage device thereof. In this embodiment, the memory 71 is generally used for storing an operating system installed on the computer device 7 and various types of application software, such as computer readable instructions of the FAQ database-based problem screening method. Further, the memory 71 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 72 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute computer readable instructions or process data stored in the memory 71, for example, computer readable instructions for executing the FAQ database-based question screening method.

The network interface 73 may comprise a wireless network interface or a wired network interface, and the network interface 73 is generally used for establishing a communication connection between the computer device 7 and other electronic devices.

In the embodiment, when the processor executes the computer readable instructions stored in the memory, the steps of the method for screening the problems based on the FAQ database according to the embodiment are implemented, so that the inquired problems can be calculated and screened for many times, the accuracy of screening the problems is improved, and the high-quality problems are recommended to the user.

The present application further provides another embodiment, that is, a computer-readable storage medium is provided, where computer-readable instructions are stored, and the computer-readable instructions are executable by at least one processor, so that the at least one processor performs the steps of the method for screening questions based on the FAQ database, as described above, to implement multiple times of calculation screening on the queried questions, improve the accuracy of screening the questions, and recommend high-quality questions to the user.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A problem screening method based on an FAQ database is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of querying candidate questions corresponding to the question sentences in the FAQ database and scoring the queried questions according to the weight of each term comprises:

extracting keywords in the question sentences according to the weight of each word;

determining an expansion word of the keyword;

generating a candidate question corresponding to the question sentence according to the keyword and the expansion word;

the candidate questions are scored according to the weight of the word.

3. The method for screening questions based on the FAQ database according to claim 1, wherein the step of calculating the similarity value between the question statement and the query result according to a similarity algorithm model, and filtering the query result whose similarity value is not within a preset range specifically comprises:

respectively calculating the similarity of Jaccard, the similarity of BM25, the similarity of cosine and the similarity of edit distance between the question sentence and the query result;

weighting and summing the calculated Jaccard similarity, BM25 similarity, cosine similarity and edit distance similarity according to respective weight values to obtain a similarity value between the question statement and the query result;

and filtering the query results of which the similarity values are not in a preset range.

4. The FAQ-database-based question screening method according to claim 3, wherein the step of filtering the query results whose similarity values are not within a preset range specifically comprises:

comparing the similarity value with a preset similarity threshold value;

and reserving the query results with the similarity values larger than or equal to a preset similarity threshold according to the comparison results, and removing the query results with the similarity values smaller than the preset similarity threshold.

5. The FAQ-database-based question screening method according to claim 1, wherein the step of calculating the filtered query result by using a classification algorithm and determining the question with the highest similarity to the input question sentence comprises:

respectively acquiring word vectors of the question sentences and the query results through FastText;

and inputting the word vector into a textCNN model, constructing a similarity matrix of the question statement and the query result after the operation of a convolution layer and a pooling layer, and outputting the similarity of the question statement and the query result through a full connection layer.

6. The FAQ-database-based question screening method according to claim 1, further comprising, after the step of querying candidate questions in the FAQ database corresponding to the question sentences:

and judging the number of the problems in the candidate problems, and executing corresponding operation according to the judgment result.

7. The method as claimed in claim 6, wherein the step of determining the number of questions in the candidate questions and performing corresponding operations according to the determination result specifically comprises:

if the number of the problems is zero, directly adopting a classification algorithm to calculate;

if the number of the problems is less than or equal to a preset threshold value, directly returning the candidate problems to the client;

and if the number of the problems is larger than a preset threshold value, scoring the candidate problems according to the weight of each word.

8. An FAQ database-based question screening apparatus, comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of the FAQ database-based question screening method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the FAQ database-based question screening method according to any one of claims 1 to 7.