CN114722176A - Intelligent question answering method, device, medium and electronic equipment - Google Patents

Intelligent question answering method, device, medium and electronic equipment Download PDF

Info

Publication number
CN114722176A
CN114722176A CN202210396931.2A CN202210396931A CN114722176A CN 114722176 A CN114722176 A CN 114722176A CN 202210396931 A CN202210396931 A CN 202210396931A CN 114722176 A CN114722176 A CN 114722176A
Authority
CN
China
Prior art keywords
similarity
question
candidate
answer
candidate answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210396931.2A
Other languages
Chinese (zh)
Inventor
吕凡
桑杉
张振伟
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Distance Education Holdings Ltd
Original Assignee
China Distance Education Holdings Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Distance Education Holdings Ltd filed Critical China Distance Education Holdings Ltd
Priority to CN202210396931.2A priority Critical patent/CN114722176A/en
Publication of CN114722176A publication Critical patent/CN114722176A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device, a medium and electronic equipment for intelligent question answering. The method comprises the following steps: if a question description asked by a user is received, performing word segmentation processing on the question description to obtain at least one morpheme; searching in a pre-constructed knowledge base based on a preset matching rule to obtain a candidate answer; calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity; determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value; and determining a target answer according to the similarity grade, and returning the target answer. By adopting the technical scheme, the real-time performance of question answering can be improved, and meanwhile, corresponding answers can be directly obtained through matching aiming at the same or similar questions, so that the question answering efficiency of repeated questions is effectively improved.

Description

Intelligent question answering method, device, medium and electronic equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a medium, and an electronic device for intelligent question answering.
Background
With the rapid development of the technology level, online answering becomes a very important part in the process of children learning and adults learning. After the user puts forward a question, the teacher gives the answer to the question online, so that the learning efficiency of the user is improved.
However, for an education institution, the difficulty of applying a teacher who is provided online is high, the repetitive problem causes unnecessary energy loss to the teacher, and the teacher cannot guarantee 24 hours online, so that the on-line problem is delayed in response.
Disclosure of Invention
The embodiment of the application provides an intelligent answering method, an intelligent answering device, a medium and electronic equipment. The questions asked by the user are answered on line through the pre-established database, the real-time answering of the questions can be improved, meanwhile, corresponding answers can be directly obtained through matching aiming at the same or similar questions, and the answering efficiency of repeated questions is effectively improved.
The embodiment of the application provides an intelligent answering method, which comprises the following steps:
if the question description asked by the user is received, performing word segmentation processing on the question description to obtain at least one morpheme;
searching in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer;
calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity;
determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value;
and determining a target answer according to the similarity grade, and returning the target answer.
Further, after receiving the question description of the user question, the method further includes:
preprocessing the problem description to obtain a preprocessing result; the preprocessing mode comprises at least one of full half angle conversion, case conversion, simplified and complex conversion and meaningless symbol removal;
performing error correction processing and expansion processing based on the preprocessing result to obtain a normalization processing result;
correspondingly, the word segmentation processing is carried out on the problem description, and the word segmentation processing comprises the following steps:
and performing word segmentation on the normalization processing result of the problem description.
Further, before performing word segmentation processing on the problem description to obtain at least one morpheme, the method further includes:
obtaining a context environment of the problem description;
determining a type to which the question description belongs based on the context; wherein the type comprises at least one of a tutoring mode, a subject, a question mode and whether to quote an original question;
correspondingly, searching is carried out in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer, and the method comprises the following steps:
and dividing the knowledge base in a pre-established knowledge base based on the type of the knowledge base, searching based on a preset matching rule, and obtaining candidate answers in the divided current partitions.
Further, performing word segmentation processing on the problem description to obtain at least one morpheme, including:
and based on predetermined segmentation granularity information, segmentation basis information and a user-defined dictionary, performing word segmentation on the problem description by adopting a HanLP word segmentation tool to obtain at least one morpheme.
Further, the similarity between each candidate answer and the question description is obtained through similarity calculation, and the candidate answers are ranked based on the similarity, including:
calculating the similarity between each candidate answer word vector and the word vector described by the question by adopting a pre-trained word vector model to obtain a first similarity score; calculating the similarity between each candidate answer and each morpheme described by the question by adopting a pre-constructed BM25 algorithm to obtain a second similarity score;
determining a relevance score of each candidate answer and the question description according to the first similarity score and the second similarity score;
ranking the candidate answers based on the relevance scores.
Further, calculating the similarity between each candidate answer and each morpheme described by the question by adopting a pre-constructed BM25 algorithm to obtain a second similarity score, which comprises:
for each candidate answer, calculating a sub-similarity score for each morpheme to the candidate answer,
finally, weighting and summing the sub-similarity scores of the morphemes relative to the candidate answers by adopting the following formula, thereby obtaining a second similarity score between the question description and the candidate answers;
Figure BDA0003599389380000031
wherein Q is a problem description, Q isiFor Q participles to get morphemes, d is a candidate answer, R (Q)iD) is the sub-similarity Score, and Score (Q, d) is the second similarity Score of the question description and the candidate answer.
Further, the second similarity score is calculated using the following formula:
Figure BDA0003599389380000032
wherein k is1Is a first adjustment factor, b is a second adjustment factor, fiIs qiThe frequency of occurrence in d, dl is the length of d, and avgdl is the average length of all candidate answers.
The embodiment of the present application further provides a device for intelligent answering, the device includes:
the word segmentation module is used for performing word segmentation processing on the question description to obtain at least one morpheme if the question description of a question asked by a user is received;
the candidate answer searching module is used for searching in a pre-constructed knowledge base based on a preset matching rule to obtain a candidate answer;
the similarity calculation module is used for obtaining the similarity between each candidate answer and the question description through similarity calculation and sequencing the candidate answers based on the similarity;
the similarity grade determining module is used for determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold;
and the target answer determining module is used for determining a target answer according to the similarity grade and returning the target answer.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for intelligent question answering according to the embodiments of the present application.
The embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of being executed by the processor, where the processor executes the computer program to implement the intelligent question answering method according to the embodiment of the present application.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:
the technical scheme who this application is favorable to providing, possess intelligent learning ability, based on powerful AI, big data, technologies such as cloud computing, can add user's question intelligence to the knowledge base, through continuously serving, can accumulate more questions and solve complicated problem, let the knowledge base constantly update optimization and perfect in continuous accumulation, accuracy and efficiency in the improvement reply, and along with the progress of technologies such as deep learning and NLP (Natural Language Processing), the advantage of intelligence answer can be more and more obvious. The intelligent answering and online answering platform forms a good data cycle and continuously promotes the development of two parties.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flowchart of a method for intelligent question answering according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for intelligent question answering according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of an intelligent answering device provided in the third embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Example one
Fig. 1 is a schematic flowchart of a method for intelligent answering provided in an embodiment of the present application, where the present embodiment is applicable to an online answering situation, and the method may be executed by the apparatus for intelligent answering provided in the embodiment of the present application, and the apparatus may be implemented by software and/or hardware, and may be integrated in an electronic device for intelligent answering.
As shown in fig. 1, the method includes:
s110, if the question description of the user question is received, performing word segmentation processing on the question description to obtain at least one morpheme.
Specifically, the user may ask questions by copying the questions, by editing the questions with the text of the user, or by using a video screenshot of online education. Wherein the question description may be a descriptive sentence of the question, such as "what is the largest freshwater lake on earth", "how many the number 4 is open to the third power", and so on.
In the scheme, after the problem description is obtained, word segmentation can be performed on the problem description. The basis for word segmentation herein may be based on conventional language and industry terminology. The morphemes resulting from the word segmentation may be word segments of the problem description. It is understood that after the morpheme is obtained by dividing, a label may be set for the morpheme, and the label content may be the position of the morpheme in the question description, the sentence component of the morpheme in the question description, the part of speech of the morpheme, and the like. According to the scheme, the search and the matching of the answers to the questions can be more accurately carried out through the labels.
In this scheme, optionally, after receiving a question description of a question asked by a user, the method further includes:
preprocessing the problem description to obtain a preprocessing result; the preprocessing mode comprises at least one of full half angle conversion, case conversion, simplified and complex conversion and meaningless symbol removal;
performing error correction processing and expansion processing based on the preprocessing result to obtain a normalization processing result;
correspondingly, the word segmentation processing is carried out on the problem description, and the word segmentation processing comprises the following steps:
and performing word segmentation on the normalization processing result of the problem description.
The preprocessing method can be full half angle conversion, case conversion, simplified and simplified conversion, meaningless symbol removal and the like. Specifically, the full-half-angle conversion may be to convert all english words into half angles and all punctuation marks into full angles. The capital and small case change can be to change all English words and English abbreviations into lowercase, that is, if the capital letters of English words appear in the question description are capitalized, the capital letters can be changed into lowercase. The traditional and simplified conversion can be to convert all traditional characters in the problem description into simplified characters. The meaningless symbol removal may be removal of a question mark at the end of a question sentence, or removal of a double quotation mark of a special word, or the like. The pretreatment mode of the scheme includes but is not limited to the above, and other pretreatment modes are also possible. The purpose of the preprocessing is to correspond to the question-answer information in the pre-constructed knowledge base so as to improve the accuracy of the matching result and the matching efficiency.
In this scheme, in addition to preprocessing, error correction, expansion, etc. may be performed on the problem description, for example, it is found that wrongly written words therein may be modified, or it is found that a missing statement component therein may be supplemented.
It is understood that the above preprocessing, error correction, and extension, etc. are adjusted according to the situations existing in the problem description, and if there is no corresponding situation, no adjustment may be made.
Through the arrangement, the standardization of the question description can be improved, and the follow-up more accurate matching of the question answers is facilitated.
In the scheme, the relevant parameters of the user question and the question description query are specifically acquired.
When a user asks a question, the context environment of the current question of the user can be collected, such as tutor, subject, question mode, whether to quote original questions and other information of the question, and the information plays an important role in understanding the question of the user. The search range can be narrowed by the aid of information such as tutoring, subjects and the like
Question description Query understanding:
the modules of the part mainly comprise query preprocessing, query error correction, query expansion, query normalization, synonyms, query participles, intention identification and the like.
The preprocessing is to process the query so as to facilitate the following processes, and mainly includes full half-angle conversion, case conversion, simplified and simplified conversion, meaningless symbol removal and the like.
And S120, searching in a pre-constructed knowledge base based on a preset matching rule to obtain a candidate answer.
The pre-constructed knowledge base can be used for screening high-quality questions and answers on the basis of user question data accumulated for a long time and existing manual answer data, an ElasticSearch search engine framework is used for building faq knowledge bases, various labels are built according to the attributes of answers, and the subsequent classified search is facilitated. The established faq knowledge base is an important basis for intelligent answering. And capturing newly added data after manual answering in real time and updating the data.
The preset matching rule can be a rule for matching based on word texts, morphemes are obtained after the preceding word segmentation processing, and then the word texts of the morphemes are matched with information in a knowledge base, so that a matching result is obtained. Here, the result obtained by the search mode of the preset matching rule may be used as a primary result, and may be preferentially selected by subsequent processing modes such as screening and sorting, and returned to the client.
On the basis of the foregoing technical solution, optionally, before performing word segmentation processing on the problem description to obtain at least one morpheme, the method further includes:
obtaining a context environment of the problem description;
determining a type to which the question description belongs based on the context; wherein the type comprises at least one of a tutoring mode, a subject, a question mode and whether to quote an original question;
correspondingly, searching is carried out in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer, and the method comprises the following steps:
and dividing the knowledge base in a pre-established knowledge base based on the type of the knowledge base, searching based on a preset matching rule, and obtaining candidate answers in the divided current partitions.
The context environment may be context information where the question description is located, may be a location where the question description is intercepted, such as in a textbook, or in a problem book, and so on. The scheme can determine the tutoring mode of the problem description to be textbook, problem book or lecture according to the context environment, and can determine the subject of the problem description, such as finance and accounting, data processing, communication and exploration. In addition, the context environment may also include the user's question mode, such as screenshot question, copy question, type question, voice question, and the like. It will be understood whether the question is referenced, i.e., whether the question is described as a question in an lecture, textbook, or problem book.
And then the type of the question description can be determined by combining the context environment, so that the answer matching is more accurate, and the answer matching efficiency and accuracy are improved.
In this scheme, optionally, performing word segmentation processing on the problem description to obtain at least one morpheme, including:
and based on predetermined segmentation granularity information, segmentation basis information and a user-defined dictionary, performing word segmentation on the problem description by adopting a HanLP word segmentation tool to obtain at least one morpheme.
Specifically, word segmentation is to segment the query into multiple term, and the word segmentation is used as the most basic lexical analysis component, so that the accuracy of the word segmentation greatly affects the subsequent processing, and for example, the word segmentation and the corresponding part-of-speech information can be used for subsequent term importance analysis and intention recognition. Meanwhile, the participles and the granularity of the query need to be consistent with those constructed by the item side index, so that the query can be effectively recalled. At present, word segmentation technologies are relatively mature, and word segmentation tools or services opened in academic circles or industrial circles are more, such as QQSeg, Baidu LAC, Jieba word segmentation, Qinghua THULAC, Beida pkuseg, Chinese academy ICTCCLAS, Hadoda PyLTP, HanLP, Stanford CoreNL and the like in Tencent. These word segmentation tools may have certain differences in function and performance. Based on the aspects of word cutting accuracy, granularity control, word cutting speed, whether NER Named Entity Recognition (NER) is provided or not, NER Recognition speed, whether a user-defined dictionary is supported or not and the like, HanLP is adopted in the scheme, and specific tuning training is carried out on multiple attributes of HanLP in proper nouns, synonyms, abbreviations and the like according to the education special fields of a website in accounting, construction, medicine, self-examination and the like. For example, when dealing with how to perform CPA remarks after the comment exam outline is published, the comment is divided into words and synonyms such as "registered accountant", "CPA", etc. are mapped, while the conventional word division does not link CPA and comment together.
Query error correction is a process of detecting and correcting errors occurring in a Query input by a user. When a user uses a search process, noise may be introduced in the input process, for example, a speech recognition error, a quick input error, and the like, and a certain error may exist in the input search query. If the query with the error is not corrected, the accuracy of other modules of the query is influenced, and the relevance and the reasonableness of the sorting of the recalls are also influenced.
By analyzing the question data in the past accounting field, the formula is a common condition in the question, so that the formula in the question is identified and extracted and is independently used as a parameter of the search, and the accuracy and the relevance of the search are improved.
S130, calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity.
The similarity calculation may be performed by using a preset calculation algorithm for the similarity between the candidate answers and the question descriptions, and the obtained result is a similarity value, which may be, for example, a similarity score. The respective candidate answers may then be ranked based on the similarity scores.
The similarity calculation method used here may be one or more, and if there are more, the final result may be obtained by normalizing the obtained scores according to a preset normalization principle, and then summing or weighting and summing.
In this scheme, optionally, the similarity between each candidate answer and the question description is obtained through similarity calculation, and the candidate answers are ranked based on the similarity, including:
calculating the similarity between each candidate answer word vector and the word vector described by the question by adopting a pre-trained word vector model to obtain a first similarity score; calculating the similarity between each candidate answer and each morpheme described by the question by adopting a pre-constructed BM25 algorithm to obtain a second similarity score;
determining a relevance score of each candidate answer and the question description according to the first similarity score and the second similarity score;
ranking the candidate answers based on the relevance scores.
The Doc2Vec model is a word vector model trained by using various text data of a website, so that the model and answering are more suitable for matching.
The Doc2Vec model is a word vector model trained by various text data of websites, so that the model and the answering are more suitable for matching.
Although the BM25 algorithm is an existing algorithm in the industry, the BM25 algorithm alone implements a unique BM25 algorithm without using logic common in the industry. For example, in the field of accounting education, the accuracy of the custom algorithm is improved greatly compared with the accuracy commonly used in the industry.
According to the scheme, a Doc2Vec model and a BM25 algorithm are adopted to calculate the first similarity score and the second similarity score respectively, and then summation or weighted summation processing is carried out, so that the relevance score of each candidate answer and the question description can be obtained. Through the arrangement, the accuracy of the target answer returned to the user can be improved, and the use experience of the user is improved.
It can be understood that the initial Doc2vec model is obtained by training based on the previous data, and the learning iteration can be continuously performed in the using process in the later stage, so that the quality of the model is improved. Such AI intelligence solutions, faq knowledge bases and models form a data loop that on one hand can feed back faq knowledge bases and models, and on the other hand can continuously improve the accuracy of the AI intelligence solutions.
S140, determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value.
Here, the judgment can be made according to a threshold value, and if the confidence level is high, a response is directly returned; if the intermediate confidence level exists, taking TOP5 after the secondary sorting result for recommendation returning; if the confidence coefficient is low, manual answering is directly forwarded.
In particular, two preset thresholds may be set, for example, two scores of 90 and 70, respectively. If the similarity of each candidate answer exceeds 90, determining the candidate answer to be high-confidence and directly returning; if between 70 and 90, then it can be determined to be a medium confidence, taking the top5 returns; if below 70, a low confidence is determined, and the manual question answering is directly forwarded without high confidence and medium confidence.
And S150, determining a target answer according to the similarity grade, and returning the target answer.
It will be appreciated that if there are a total of 6 candidate answers, 2 of which are high confidence and 4 of which are medium confidence, then all of the high confidence may be returned, and the medium confidence ranked and returned, i.e., all of the 6 candidate answers determined to be the target answer. If there are 8 candidate answers in total, 2 of which are high confidence, 5 of which are medium confidence and 1 of which is low confidence, all of the high confidence can be returned, and the medium confidence can be ranked and returned, that is, the first 7 candidate answers of all 8 are determined as the target answer. If there are 8 candidate answers in total, 1 of which is high confidence, 2 of which is medium confidence and 5 of which is low confidence, all of the high confidence can be returned, and the medium confidence is ranked and returned, that is, the first 3 candidate answers of all 8 are determined as the target answer.
The technical scheme that this embodiment provided filters a large amount of repetitious questions through online intelligent answer, directly returns for the user, has both reduced user's waiting data, has improved user's degree of adhesion, alleviates mr's pressure again, lets the mr concentrate on the high efficiency more. Meanwhile, the expenditure of enterprises is reduced by intelligent answering, the dependence on personnel is reduced, and the cost of the warranty is really reduced. Moreover, the intelligent answering can construct a knowledge base according to the accumulated huge learning data, and the accumulated data can be utilized to a large extent. In addition, various similarity contrast algorithms are fused, the advantages of the algorithms are fully exerted, and the similarity contrast capability of intelligent question answering is effectively improved.
On the basis of the above technical solutions, optionally, calculating the similarity between each candidate answer and each morpheme described in the question by using a pre-constructed BM25 algorithm to obtain a second similarity score, including:
for each candidate answer, calculating a sub-similarity score for each morpheme to the candidate answer,
finally, weighting and summing the sub-similarity scores of the morphemes relative to the candidate answers by adopting the following formula, thereby obtaining a second similarity score between the question description and the candidate answers;
Figure BDA0003599389380000111
wherein Q is a problem description, Q isiFor Q participles to get morphemes, d is a candidate answer, R (Q)iD) is the sub-similarity Score, and Score (Q, d) is the second similarity Score of the question description and the candidate answer.
Wherein, 2.1BM25 is used for calculating the correlation score between the recall result and the question, and the algorithm describes:
performing morpheme analysis on the Query to generate morphemes;
then, for each search result D, calculating the relevance score of each morpheme and D;
and finally, carrying out weighted summation on the correlation scores of qi relative to D, thereby obtaining the correlation score of Query and D.
The general formula of the BM25 algorithm is as follows:
Figure BDA0003599389380000112
wherein Q represents Query; q. q.siA morpheme after Q-analysis is represented (for Chinese, we can take the participle of Query as morpheme analysis, and each word is regarded as morpheme); d representsA search result document; wiRepresenting morphemes qiThe weight of (c); r (q)iAnd d) represents morpheme qiA relevance score to document d;
further evolution can be made for this algorithm:
in the following we see how W is definedi
There are various methods for determining the weight of the relevance of a word to a document, and the IDF is more commonly used.
We then see morpheme qiRelevance score R (q) with document di,d)。
Looking first at the general form of the relevance score in BM 25:
Figure BDA0003599389380000121
Figure BDA0003599389380000122
k1,k2b is an adjustment factor, usually set empirically, and is generally k1=2,b=0.75;
fiIs q isiFrequency of occurrence in d;
qfiis qiFrequency of occurrence in Query;
dl is the length of document d;
avgdl is the average length of all documents;
since q is in most casesiOnly once in Query, qfi1, so the formula can be simplified as:
Figure BDA0003599389380000123
as can be seen from the definition of K, the function of parameter b is to adjust the size of the influence of the document length on the relevance. The larger b, the greater the impact of document length on relevance scores,the smaller the opposite. And the longer the relative length of the document, the greater the value of K will be, and the smaller the relevance score will be. This can be understood as when the document is long, containing qiThe greater the chance of (f), and therefore, the same fiIn the case of (1), a long document is associated with qiShould be more relevant than the short document and qiThe correlation of (2) is weak. In summary, the correlation score formula of the BM25 algorithm can be summarized as:
Figure BDA0003599389380000131
from the formula of BM25, different methods for calculating the search relevance score can be derived by using different methods for analyzing morphemes, determining the weight of morphemes, and determining the relevance between morphemes and documents, which provides greater flexibility for the algorithm design.
Example two
The present embodiment is a preferred solution provided on the basis of the above-described embodiments.
Fig. 2 is a schematic flowchart of an intelligent question answering method according to a second embodiment of the present application. As shown in fig. 2, the process mainly includes:
the student asks questions;
selecting a coat type, filling in questions and describing the questions;
AI intelligent answering;
the student determines whether to manually answer a question or not;
if yes, manually answering; if not, the process ends.
The specific process of AI intelligent answering comprises the following steps:
acquiring relevant parameters of a question and preprocessing the parameters;
retrieving preliminary recall TOPN similarities faq based on faq knowledge base;
matching and sorting the recall results based on similarity calculation;
performing threshold comparison;
if the confidence coefficient is low, data is not removed, if the confidence coefficient is middle, a TOP5 result is obtained, and if the confidence coefficient is high, a detailed result with high confidence coefficient is obtained;
and organizing data to return, and finishing the AI intelligent question answering process.
Wherein faq knowledge base can be built using the ElasticSearch search engine framework. The Elasticisearch is a real-time distributed search and analysis engine established on the basis of Apache Lucene, and is a Lucene-based, most advanced and efficient full-function open source search engine framework.
By comparing the commercially mainstream solr and the elastosearch, the elastosearch has the following characteristics:
the search for the elastic search uses the Lucene as an internal index engine, and in practical use, only a uniformly developed API is used, and the complex Lucene working principle behind the Lucene working principle does not need to be understood. The ElasticSearch is simple to use, low in development difficulty and easy to use.
The ElasticSearch is a distributed real-time file store that allows each field to be indexed so that it can be searched.
The distributed search engine for real-time analysis responds to mass data in a near real-time second level.
It is easy to expand, handling PB-level structured or unstructured data.
When the index is established in real time, the Solr can generate io blockage, the query performance is poor, and the ElasticSearch has more advantages.
As the amount of data increases, the search efficiency of Solr becomes lower, while the ElasticSearch does not change significantly.
Therefore, the elastic search is very suitable for being used as a knowledge base for intelligent question answering, and the problem solved by the teacher is built in the knowledge base in real time, so that the teacher can recall the problem conveniently.
In the process of similarity calculation, a Doc2Vec word vector model can be adopted. Quantitative representation of text documents is a challenging task in machine learning. Many applications require a document quantization process, such as: document retrieval, web search, spam filtering, topic modeling, and the like. However, there are few good ways to do this. Many methods use the well-known but simplified bag of words method (BOW), but the results will mostly be very general, since BOW does not take into account conscience in many aspects, e.g., the order of words; semantic information of the word is ignored. Potential Dirichlet allocation (LDA), a common technique for topic modeling (extracting topics/keywords from text), is difficult to adjust and the results difficult to evaluate.
Doc2Vec is proposed by two big cattle of google Quoc Le and Tomas Mikolov, Doc2Vec is called paragraph2Vec, and sensor embeddings, is an unsupervised algorithm, can obtain vector expression of sensenees/paragraphs/documents, and is an extension of word2 Vec. The learned vector can find the similarity between sensenes/paragrams/documents by calculating the distance, can be used for text clustering, and can also be used for text classification by using a method for supervising learning for data with labels, such as a classical emotion analysis problem.
A distributed memory model;
The method of training sentence vectors is very similar to the method of word vectors. The core idea of training a word vector is that it can be predicted from the context of each word, i.e. the word pair of the context is influential. Then doc2vec can be trained in the same way, as well. For example, for a sentence S: i water to drag water, if the word water in the sentence is to be predicted, then the feature can be generated not only from other words, but also from other words and sentences ss for prediction.
Each paragraph/sentence is mapped into vector space and can be represented by one column of the matrix D. Each word is also mapped to vector space, which can be represented by a column of the matrix W. And then, cascading or averaging the paragraph vector and the word vector to obtain features, and predicting the next word in the sentence.
This Paragraph vector/sentence vector can also be considered as a word, which acts as a Memory unit of the context or as a subject of the Paragraph, so we generally call this training method as Distributed Memory Model of Paragraph Vectors (PV-DM).
During training, the length of a context is fixed, and a training set is generated by a sliding window method. Paragraph/sentence vectors are shared in this context.
To summarize the doc2vec process, there are two main steps:
and training the model, and obtaining a word vector W, parameters U and b of softmax and a paragraph vector/sentence vector D in known training data.
The inference process (inference stage) gets its vector expression for the new paragraph. Specifically, more columns are added to the matrix D, and in the case of fixing W and U, training is performed by using the above method, and a gradient descent method is used to obtain a new D, so as to obtain a vector expression of a new paragraph.
Yet another training method is to ignore the context of the input and let the model predict a random word in the paragraph. In each iteration, a window is obtained by sampling from a text, a word is randomly sampled from the window to serve as a prediction task, a model is used for predicting, and the input is a paragraph vector.
This model is generally referred to as a Distributed Bag of Words version of Paragraph Vector (PV-DBOW).
In the two methods, the scheme can obtain paragraph vectors/sentence vectors by using PV-DM or PV-DBOW. The method of PV-DM performs well for most tasks, but the combination of the two methods is also strongly recommended.
The Doc2Vec model of the online answering platform is formed by combining the two methods and training the accumulated mass answering data.
And the recalling results are sorted again by adopting two dimensionality similarity calculation methods, so that the similarity accuracy is improved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an intelligent answering device provided in the third embodiment of the present application. As shown in fig. 3, the apparatus includes:
a word segmentation module 310, configured to, if a question description of a question asked by a user is received, perform word segmentation processing on the question description to obtain at least one morpheme;
the candidate answer searching module 320 is used for searching in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer;
the similarity calculation module 330 is configured to obtain similarities between the candidate answers and the question descriptions through similarity calculation, and rank the candidate answers based on the similarities;
a similarity level determination module 340, configured to determine a similarity level of each candidate answer according to a comparison result between the similarity of each candidate answer and a preset threshold;
and a target answer determining module 350, configured to determine a target answer according to the similarity level, and return the target answer.
The device can execute the intelligent question answering method provided by each embodiment and has corresponding functional modules and beneficial effects. And will not be described in detail herein.
Example four
Embodiments of the present application further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for intelligent question answering, the method including:
if a question description asked by a user is received, performing word segmentation processing on the question description to obtain at least one morpheme;
searching in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer;
calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity;
determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value;
storage medium-any of various types of memory electronics or storage electronics. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory, such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.
Of course, the storage medium provided in the embodiments of the present application and containing computer-executable instructions is not limited to the above-described operations for intelligent answering, and may also perform related operations in the method for intelligent answering provided in any embodiments of the present application.
EXAMPLE five
The embodiment of the application provides electronic equipment. Fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application. As shown in fig. 4, the present embodiment provides an electronic device 400, which includes: one or more processors 420; the storage device 410 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 420, the one or more processors 420 are enabled to implement the method for intelligently answering a question provided in the embodiment of the present application, where the method includes:
if the question description asked by the user is received, performing word segmentation processing on the question description to obtain at least one morpheme;
searching in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer;
calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity;
determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value;
the electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, the electronic device 400 includes a processor 420, a storage device 410, an input device 430, and an output device 440; the number of the processors 420 in the electronic device may be one or more, and one processor 420 is taken as an example in fig. 4; the processor 420, the storage device 410, the input device 430, and the output device 440 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 450 in fig. 4.
The storage device 410 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the method for intelligent question answering in the embodiment of the present application.
The storage device 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 410 may further include memory located remotely from processor 420, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 430 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic device. The output device 440 may include a display screen, speakers, or other electronic equipment.
The electronic equipment provided by the embodiment of the application can answer questions provided by users on line through the pre-built database, can improve the real-time property of answering the questions, can directly obtain corresponding answers through matching aiming at the same or similar questions, and effectively improves the answering efficiency of repeated questions.
The intelligent answering device, the medium and the electronic equipment provided by the embodiment can operate the intelligent answering method provided by any embodiment of the application, and have corresponding functional modules and beneficial effects for operating the method. For technical details not described in detail in the above embodiments, reference may be made to the intelligent question answering method provided in any embodiment of the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for intelligent question answering, the method comprising:
if the question description asked by the user is received, performing word segmentation processing on the question description to obtain at least one morpheme;
searching in a pre-constructed knowledge base based on a preset matching rule to obtain a candidate answer;
calculating the similarity between each candidate answer and the question description through the similarity, and ranking the candidate answers based on the similarity;
determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value;
and determining a target answer according to the similarity grade, and returning the target answer.
2. The method of claim 1, wherein after receiving a question description of a user question, the method further comprises:
preprocessing the problem description to obtain a preprocessing result; the preprocessing mode comprises at least one of full half angle conversion, case conversion, simplified and complex conversion and meaningless symbol removal;
performing error correction processing and expansion processing based on the preprocessing result to obtain a normalization processing result;
correspondingly, the word segmentation processing is carried out on the problem description, and the word segmentation processing comprises the following steps:
and performing word segmentation on the normalization processing result of the problem description.
3. The method of claim 1, wherein before performing word segmentation on the question description to obtain at least one morpheme, the method further comprises:
obtaining a context environment of the problem description;
determining a type to which the question description belongs based on the context; the type comprises at least one of a tutoring mode, a subject, a question mode and whether a theme is quoted or not;
correspondingly, searching is carried out in a pre-established knowledge base based on a preset matching rule to obtain a candidate answer, and the method comprises the following steps:
and dividing the knowledge base in a pre-established knowledge base based on the type of the knowledge base, searching based on a preset matching rule, and obtaining candidate answers in the divided current partitions.
4. The method of claim 1, wherein performing word segmentation on the question description to obtain at least one morpheme comprises:
and based on predetermined segmentation granularity information, segmentation basis information and a user-defined dictionary, performing word segmentation on the problem description by adopting a HanLP word segmentation tool to obtain at least one morpheme.
5. The method of claim 1, wherein calculating similarity of each candidate answer to question descriptions through similarity, and ranking the candidate answers based on the similarity comprises:
calculating the similarity between each candidate answer word vector and the word vector described by the question by adopting a pre-trained word vector model to obtain a first similarity score; calculating the similarity between each candidate answer and each morpheme described by the question by adopting a pre-constructed BM25 algorithm to obtain a second similarity score;
determining a relevance score of each candidate answer and the question description according to the first similarity score and the second similarity score;
ranking the candidate answers based on the relevance scores.
6. The method of claim 5, wherein calculating the similarity between each candidate answer and each morpheme of the question description by using a pre-constructed BM25 algorithm to obtain a second similarity score comprises:
for each candidate answer, calculating a sub-similarity score for each morpheme to the candidate answer,
finally, weighting and summing the sub-similarity scores of the morphemes relative to the candidate answers by adopting the following formula, thereby obtaining a second similarity score between the question description and the candidate answers;
Figure FDA0003599389370000021
wherein Q is a problem description, Q isiFor Q participles to get morphemes, d is a candidate answer, R (Q)iD) is the sub-similarity Score, and Score (Q, d) is the second similarity Score of the question description and the candidate answer.
7. The method of claim 6, wherein the second similarity score is calculated using the following formula:
Figure FDA0003599389370000031
wherein k is1Is a first adjustment factor, b is a second adjustment factor, fiIs qiThe frequency of occurrence in d, dl is the length of d, and avgdl is the average length of all candidate answers.
8. An intelligent answering device, comprising:
the word segmentation module is used for performing word segmentation on the question description if the question description asked by the user is received to obtain at least one morpheme;
the candidate answer searching module is used for searching in a pre-constructed knowledge base based on a preset matching rule to obtain a candidate answer;
the similarity calculation module is used for obtaining the similarity between each candidate answer and the question description through similarity calculation and sequencing the candidate answers based on the similarity;
the similarity grade determining module is used for determining the similarity grade of each candidate answer according to the comparison result of the similarity of each candidate answer and a preset threshold value;
and the target answer determining module is used for determining a target answer according to the similarity grade and returning the target answer.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of intelligent question answering according to any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of intelligent question answering according to any one of claims 1 to 7 when executing the computer program.
CN202210396931.2A 2022-04-15 2022-04-15 Intelligent question answering method, device, medium and electronic equipment Pending CN114722176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210396931.2A CN114722176A (en) 2022-04-15 2022-04-15 Intelligent question answering method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210396931.2A CN114722176A (en) 2022-04-15 2022-04-15 Intelligent question answering method, device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114722176A true CN114722176A (en) 2022-07-08

Family

ID=82244402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210396931.2A Pending CN114722176A (en) 2022-04-15 2022-04-15 Intelligent question answering method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114722176A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115129847A (en) * 2022-08-30 2022-09-30 北京云迹科技股份有限公司 Intelligent answering method and device
CN116628167A (en) * 2023-06-08 2023-08-22 四维创智(北京)科技发展有限公司 Response determination method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115129847A (en) * 2022-08-30 2022-09-30 北京云迹科技股份有限公司 Intelligent answering method and device
CN115129847B (en) * 2022-08-30 2023-01-06 北京云迹科技股份有限公司 Intelligent answering method and device
CN116628167A (en) * 2023-06-08 2023-08-22 四维创智(北京)科技发展有限公司 Response determination method and device, electronic equipment and storage medium
CN116628167B (en) * 2023-06-08 2024-04-05 四维创智(北京)科技发展有限公司 Response determination method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN111125334A (en) Search question-answering system based on pre-training
US10915707B2 (en) Word replaceability through word vectors
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
Mizumoto et al. Sentiment analysis of stock market news with semi-supervised learning
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
Kandhro et al. Sentiment analysis of students’ comment using long-short term model
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
CN112434164A (en) Network public opinion analysis method and system considering topic discovery and emotion analysis
Sharma et al. BioAMA: towards an end to end biomedical question answering system
Kurniawan et al. Indonesian twitter sentiment analysis using Word2Vec
Liu et al. Extract Product Features in Chinese Web for Opinion Mining.
CN114840685A (en) Emergency plan knowledge graph construction method
CN108694176B (en) Document emotion analysis method and device, electronic equipment and readable storage medium
Jawad et al. Combination Of Convolution Neural Networks And Deep Neural Networks For Fake News Detection
CN107291686B (en) Method and system for identifying emotion identification
Ye et al. A sentiment based non-factoid question-answering framework
CN112667791A (en) Latent event prediction method, device, equipment and storage medium
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
Mabrouk et al. Profile Categorization System based on Features Reduction.
CN114912446A (en) Keyword extraction method and device and storage medium
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination