CN116361638A

CN116361638A - Question and answer searching method, device and storage medium

Info

Publication number: CN116361638A
Application number: CN202211549500.1A
Authority: CN
Inventors: 毛宇; 黄凯; 贾钱森; 徐伟; 张文锋; 纳颖泉
Original assignee: Merchants Union Consumer Finance Co Ltd
Current assignee: Merchants Union Consumer Finance Co Ltd
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-06-30

Abstract

The application relates to an artificial intelligence technology and provides a question-answer searching method, a question-answer searching device and a storage medium, wherein the method comprises the following steps: acquiring an answer search request of a target object for a target question; determining text features of the target problem; inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to a historical dialogue data set, and the second sample set is a sample set obtained by processing according to the historical dialogue data set and a marked file. By adopting the method and the device, the diversity of the samples can be improved, the learning breadth of the model on the training samples can be improved, and the accuracy of searching answers can be improved.

Description

Question and answer searching method, device and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, and mainly relates to a question-answer searching method, a question-answer searching device and a storage medium.

Background

The text question-answer matching algorithm is mainly used for business scenes such as customer service, outbound robots and the like. With the rapid development of the Internet, a plurality of question-answer search systems with manual participation gradually change to an automatic and manual combination mode, and partial problems are solved by using automatic question-answer recommendation, so that the manual participation can be reduced, and the user requirements can be responded quickly.

At present, training samples adopted by a question-answer search model in a question-answer search system are obtained by manually marking historical dialogue data. However, some types of samples in the actual question-answer scene, for example, negative samples, difficult samples, etc., are fewer in number, resulting in fewer numbers of such samples, and difficulty in improving the accuracy of answer search is low.

Disclosure of Invention

The embodiment of the application provides a question and answer searching method, a question and answer searching device and a storage medium, which can improve the diversity of samples, improve the learning breadth of a model on training samples and facilitate the improvement of the accuracy of searching answers.

In a first aspect, an embodiment of the present application provides a question-answer searching method, including: acquiring an answer search request of a target object for a target question; determining text features of the target problem; inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to a historical dialogue data set, and the second sample set is a sample set obtained by processing according to the historical dialogue data set and a marked file.

In one possible example, the method further includes: analyzing the historical dialogue data set to obtain a domain word stock; screening the domain word stock to obtain a high-frequency domain word stock; supplementing a domain word stock to obtain an associated domain word stock; and constructing a first sample set according to the high-frequency domain word stock and the related domain word stock.

In one possible example, the filtering the domain word stock to obtain a high-frequency domain word stock includes: obtaining vector characterization of each domain word in the domain word library; clustering the vector representations of the domain words to obtain at least two types of domain word clusters; acquiring the frequency of word clusters in various fields; and forming a domain word cluster with the frequency larger than the frequency threshold value into a high-frequency domain word library.

In one possible example, supplementing the domain lexicon to obtain the associated domain lexicon includes: searching the replacement word of each domain word in the domain word library according to a preset rule corresponding to the domain type of the domain word library; obtaining similar words of each domain word bank in the domain word banks; and supplementing the replacement words and the similar words to the domain word stock to obtain the related domain word stock.

In one possible example, constructing a first sample set from the high frequency domain word stock and the associated domain word stock includes: searching a target historical dialogue data set where at least one domain word in a domain word library is located from the historical dialogue data set; constructing a first sub-sample containing at least one domain word in the high-frequency domain word stock according to the target historical dialogue data; replacing the domain words in the target historical dialogue data according to at least one domain word in the associated domain word library to obtain a plurality of second sub-samples; and fusing the first sub-sample and the plurality of second sub-samples to obtain a first sample set.

In one possible example, the method further includes: selecting a reference sample corresponding to a preset sample type from the marked file; obtaining a similarity value between each historical dialogue data and a reference sample in the historical dialogue data set; and screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set.

In one possible example, determining text characteristics of a target question includes: determining keywords in the target problem and the technical field of the target problem; and determining the text characteristics of the target problem according to the technical field and the keywords.

In a second aspect, an embodiment of the present application provides a question-answer search apparatus, wherein:

the communication unit is used for acquiring an answer search request of the target object for the target question;

the processing unit is used for determining the text characteristics of the target problem; inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to historical dialogue data, and the second sample set is a sample set obtained by processing according to the historical dialogue data and a marked file.

In one possible example, the processing unit is further configured to analyze the historical dialogue dataset to obtain a domain word stock; screening the domain word stock to obtain a high-frequency domain word stock; supplementing a domain word stock to obtain an associated domain word stock; and constructing a first sample set according to the high-frequency domain word stock and the related domain word stock.

In one possible example, the processing unit is specifically configured to obtain a vector representation of each domain word in the domain word stock; clustering the vector representations of the domain words to obtain at least two types of domain word clusters; acquiring the frequency of word clusters in various fields; and forming a domain word cluster with the frequency larger than the frequency threshold value into a high-frequency domain word library.

In one possible example, the processing unit is specifically configured to search for a replacement word of each domain word in the domain word stock according to a preset rule corresponding to a domain type of the domain word stock; obtaining similar words of each domain word bank in the domain word banks; and supplementing the replacement words and the similar words to the domain word stock to obtain the related domain word stock.

In one possible example, the processing unit is specifically configured to search a target historical dialogue data set in which at least one domain word in the domain word library is located from the historical dialogue data set; constructing a first sub-sample containing at least one domain word in the high-frequency domain word stock according to the target historical dialogue data; replacing the domain words in the target historical dialogue data according to at least one domain word in the associated domain word library to obtain a plurality of second sub-samples; and fusing the first sub-sample and the plurality of second sub-samples to obtain a first sample set.

In one possible example, the processing unit is further configured to select a reference sample corresponding to the preset sample type from the noted file; obtaining a similarity value between each historical dialogue data and a reference sample in the historical dialogue data set; and screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set.

In one possible example, the processing unit is specifically configured to determine a keyword in the target problem and a technical field of the target problem; and determining the text characteristics of the target problem according to the technical field and the keywords.

In a third aspect, embodiments of the present application provide a computer device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for part or all of the steps as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program causes a computer to execute to implement some or all of the steps as described in the first aspect.

By implementing the embodiment of the application, after obtaining the answer search request of the target object for the target question, the text characteristics of the target question can be determined first. And inputting the text features into a question-answer search model to obtain target answers. The question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, wherein the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set. Therefore, the training samples of the question-answer search model adopt at least two different types of samples, so that the diversity of the samples can be improved, and the learning breadth of the model on the training samples can be improved. The first sample set is a sample set obtained by processing according to the historical dialogue data set, and the second sample set is a sample set obtained by processing according to the historical dialogue data set and the marked file. Therefore, the practicability and the accuracy of the sample can be improved, and the accuracy of the search answers can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

fig. 1 is a flow chart of a question-answer searching method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a question-answer searching device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The network architecture applied by the embodiment of the application comprises a server and electronic equipment. The embodiments of the present application do not limit the number of electronic devices and servers, and a server may provide services for multiple electronic devices at the same time. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (content delivery network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The server may alternatively be implemented by a server cluster composed of a plurality of servers.

The electronic device may be a personal computer (personal computer, PC), a notebook computer or a smart phone, or may be an all-in-one machine, a palm computer, a tablet computer (pad), a smart television playing terminal, a vehicle-mounted terminal or a portable device. The operating system of the electronic device at the PC side, such as an all-in-one machine, may include, but is not limited to, linux system, unix system, windows series system (such as Windows xp, windows 7, etc.), mac OS X system (operating system of apple computer), etc. The operating system of the electronic device of the mobile terminal, such as a smart phone, may include, but is not limited to, an android system, an IOS (operating system of an apple phone), a Window system, and other operating systems.

The electronic device can install and run the application program, and the server can be a server corresponding to the application program installed by the electronic device and provides application service for the application program. The application program may be a single integrated application software, or an embedded applet in other applications, or a system on a web page, etc., which is not limited herein. In the embodiment of the application program, the application program can have a question and answer search function, for example, a community question and answer application and the like, can be used for searching answers of questions, and can also be used for accepting input of the answers and the like. The question-answer search function provided in the application may be included in a search function, for example, a browser has a search function, and answers to questions can be searched for by the search function of the browser. The business data of the question search according to the present application may include financial data, medical data, e-commerce data, etc., and is not limited herein.

In embodiments of the present application, a stored question and an answer to the question may be pre-associated. The set of questions and their answers stored may be referred to as a preset question-answer text library. The preset problem text library can be stored in a block created on the blockchain network, so that data sharing of information among different platforms can be realized while data security is ensured.

The embodiment of the application provides a question-answer searching method which can be executed by a question-answer searching device, wherein the device can be realized by software and/or hardware and can be generally integrated in electronic equipment or a server, so that the diversity of samples can be improved, the learning breadth of a model on training samples can be improved, and the accuracy of searching answers can be improved.

Referring to fig. 1, fig. 1 is a flow chart of a question-answer searching method provided in the present application. The method is applied to a server for illustration, and comprises the following steps S101 to S103, wherein:

s101: and obtaining an answer search request of the target object for the target question.

In the embodiment of the present application, the answer search request is used for searching the answer of the target question, and the answer obtained by searching may be referred to as the target answer. The answer search request may be obtained by converting text or voice input to the server by the electronic device for the target object, which is not limited herein. The application does not limit the target object, and can be any registered user in the question-answering application or tourist using the question application. The method and the device are not limited to target problems, and can be fields formed by any text and punctuation, fields corresponding to retrieval, and the like.

The target question may be composed of words and/or phrases. The method for dividing the characters and the words is not limited, for example, a word dividing method based on character string matching, a word dividing method based on statistics, a word dividing method based on understanding, and the like. For another example, a jieba word segmentation tool or a word2vec model segmentation tool is used. The jieba word segmentation tool can perform the functions of word segmentation, part-of-speech tagging, keyword extraction and the like on the Chinese text, and supports a custom dictionary. The word2vec model is a group of related models used to generate word vectors. The part of speech of each word and/or word may be obtained simultaneously with the obtaining of the word and/or word, for example, two major classes of nouns, verbs, and also person names, place names, institution names, etc., or auxiliary verbs, name verbs, etc. Along with the word and/or word, the word sense of each word and/or word may be obtained to determine the meaning of the word and/or word in a sentence, such as a target question.

S102: text features of the target question are determined.

In embodiments of the present application, the text features of the target question may include characters and/or features of words, such as part of speech, word sense, and the like. The text features of the target question may alternatively include features corresponding to the text of the target question, e.g., semantics, emotion, etc. The text characteristics of the target question may alternatively include the technical field of the question and the like, without limitation. The technical field of questions may be categorized according to business type, e.g., finance, medical, e-commerce, etc. The technical field of the problem may alternatively be classified according to disciplines, such as information technology, biotechnology, new materials, energy technology, etc., without limitation. The number of technical fields is not limited, and one or more technical fields can be adopted in the application.

In some possible examples, step S102 may include the steps of: determining keywords in the target problem and the technical field of the target problem; and determining the text characteristics of the target problem according to the technical field and the keywords.

Keywords may be words that affect the meaning of the target question. The method for determining the keywords is not limited, and optionally, the target problem is segmented to obtain a plurality of words, and word senses and part of speech of each word; keywords are selected from the plurality of words according to the word sense and the word part of each word.

The word segmentation method can refer to the foregoing, and is not repeated herein. Weights corresponding to parts of speech of each word may be obtained first, for example: the weight corresponding to the verb may be 1.3; the weight corresponding to the secondary verb may be 1.2; the preset weight corresponding to the place noun can be 1.4; the preset weight corresponding to the indication pronoun may be 0.7; the weight corresponding to the stop word may be 0.1, etc. The ratio between the meaning of each word and the meaning of the target question can also be obtained, for example: a similarity value between word senses and semantics, a specific gravity of the same character, and the like are acquired. And then, weighting calculation is carried out on the weight corresponding to the part of speech of each word and the ratio between the meaning of the word and the meaning of the target problem to obtain the key values of a plurality of words. And selecting keywords from the plurality of words according to the size of the key value, for example, the maximum first N keywords, wherein N can be larger than 1.

The number of the selected keywords is not limited, and may be a fixed preset number, for example, 3. The number of keywords may be dynamically set for the number of words obtained by word segmentation, or may be based on the number of parts of speech, etc. The keywords are selected according to the part of speech and the word sense of each word in the target problem, that is, the keywords are selected from the actually occurring scenes, so that the accuracy of keyword selection can be improved.

The method for determining the technical field is not limited, whether the node corresponding to the keyword exists in the knowledge graph of each technical field or not can be determined, and if yes, the technical field for determining the target problem is the technical field corresponding to the knowledge graph. Or the problem similar to the target problem can be searched from the marked file or the preset problem text library, and the problem is used as the technical field of the target problem.

It can be understood that searching is performed according to the keywords of the target problem, so that searching efficiency can be improved. Searching is performed within the technical field, so that the searching accuracy can be improved. In this example, the text features of the target question are determined from the technical field of the target question and keywords in the target question. The text features adopted by the recognition comprise the keywords and the technical field of searching, so that the searching efficiency and the searching accuracy are further improved.

S103: and inputting the text features into a question-answer search model to obtain target answers.

In the embodiment of the application, the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions. The preset condition is not limited in this application, and the preset condition may be that the training number exceeds a threshold number, for example, 50 times, or the number of training samples, etc. The preset condition may alternatively be that the training accuracy exceeds an accuracy threshold, for example, the loss value of the loss function is less than 0.1, the accuracy is greater than 90%, the recall is greater than 80%, etc.

The target sample may include a sample set obtained by fusing the first sample set and the second sample set, and the target sample may also include a labeled file or the like, which is not limited herein. The annotated document may be text derived from human annotation, or may be text derived based on a trained annotation model, or the like. The labeled text-tagged tags may include reference answers to reference questions, and may also include sample types, such as: positive samples, negative samples, simple samples, difficult samples, etc.

The positive samples refer to samples of target categories corresponding to the true values, and the negative samples refer to samples of all other target categories not corresponding to the true values. Illustratively, when faces are detected, faces are positive samples, and non-faces are negative samples, such as other things like trees, flowers, etc. next to.

Simple samples refer to less error with the truth label during prediction, and difficult samples refer to more error with the truth label during prediction. Illustratively, the truth labels are [1, 0], when the probability distribution is predicted to be [0.3,0.3,0.4], the truth labels differ significantly, at which point the sample is a difficult sample. And when [0.98,0.01,0.01] is predicted, the difference from the truth label is smaller, and the sample is a simple sample.

The marked text may mark one or more labels, e.g., the sample type may be a negative sample, a difficult sample, etc. The negative and difficult samples may be referred to below simply as difficult negative samples, and the negative and simple samples may be referred to simply as simple negative samples. Illustratively, a simple negative may be "i forget the password" and a difficult negative may be "what is my loan interest? ".

The method for fusing the first sample set and the second sample set is not limited, the same sample can be deleted, and the first sample set and the second sample set can be expanded, for example: the additional samples may be samples or the like that are searched based on at least one of the first sample set and the second sample set.

In an embodiment of the present application, the first sample set may be a sample set processed according to a historical dialog data set. The historical dialogue data set may include a plurality of historical dialogue data, each of the historical dialogue data may be dialogue content in a historical question-answer search process, and answers to questions may be manually given or obtained by adopting a preset question-answer search model. That is, the historical dialog data set contains questions posed by various types of users, as well as answers obtained by various types of methods. Therefore, the sample set obtained by processing the historical dialogue data set not only comprises the questions, but also can comprise the answers, so that the practicability of the sample can be improved, and the searching accuracy can be improved. The second sample set may be a sample set processed from the historical dialog data set and the annotated document. And processing the marked file and the historical dialogue data set to obtain a second sample set, so that the marking accuracy can be improved, and the searching accuracy can be improved.

Optionally, the first sample set and the second sample set are both difficult negative samples. Therefore, the number of difficult negative samples can be increased, so that the diversity of the samples can be increased, the learning breadth of the model on training samples can be increased, and the accuracy of searching answers can be improved.

The present application is not limited to a method of obtaining a first sample set, and in some possible examples, the method further includes: analyzing the historical dialogue data set to obtain a domain word stock; screening the domain word stock to obtain a high-frequency domain word stock; supplementing a domain word stock to obtain an associated domain word stock; and constructing a first sample set according to the high-frequency domain word stock and the related domain word stock.

Wherein the domain word stock comprises words in the same technical domain. The number of domain word banks may be equal to the number of technical domains, and when the question-answer search model relates to a plurality of technical domains, a plurality of domain word banks may be acquired. The method for analyzing the domain word stock is not limited, words in the historical dialogue data set can be processed based on the point mutual information algorithm, the relevant value of each word is obtained, and the words with the relevant value larger than the relevant threshold value are used as words in the domain word stock.

The point-to-point information algorithm is to calculate the correlation between two words, and the formula can be shown as (1) below.

Wherein w is _1, w ₂ Respectively, represent words. p (w) ₁ ) Is w ₁ Probability of occurrence, p (w ₂ ) Is w ₂ Probability of occurrence. p (w) ₁ ) May be w ₁ The ratio between the number of occurrences and the total number of occurrences, p (w ₂ ) May be w ₂ The ratio between the number of occurrences and the total number. p (w) ₁ & ₂ ) Representing w _1, w ₂ The probability of the two words simultaneously occurring can be w _1, w ₂ The square of the ratio between the number of times two words occur simultaneously and the total number of words.

The larger PMI indicates the stronger correlation. For example, if PMI (w _1, w ₂ )>0, then represents w _1, w ₂ The two terms are related and the larger the value, the stronger the correlation. PMI (w) _1, w ₂ ) =0, then represents w _1, w ₂ The two words are statistically independent, uncorrelated and not mutually exclusive. PMI (w) _1, w ₂ )<0, then represents w _1, w ₂ The two words are uncorrelated, mutually exclusive.

The association threshold is not limited in the present application, and may be a fixed value, for example, 0. The association threshold may alternatively be determined based on the average value of PMIs, the number of words, etc. It can be understood that the relevance between adjacent words and characters is calculated through the point mutual information algorithm, words with higher frequency can be obtained, and words with higher frequency can be used as domain words, so that compared with a general word stock, the specificity is improved, and the downstream task effect is improved.

In the embodiment of the application, the high-frequency domain word stock refers to high-frequency words in the domain word stock. The method for obtaining the high-frequency domain word stock is not limited, and in some feasible examples, the method for screening the domain word stock to obtain the high-frequency domain word stock can comprise the following steps: obtaining vector characterization of each domain word in the domain word library; clustering the vector representations of the domain words to obtain at least two types of domain word clusters; acquiring the frequency of word clusters in various fields; and forming a domain word cluster with the frequency larger than the frequency threshold value into a high-frequency domain word library.

Vector characterization is understood as the word coding of domain words, and text can be converted into a numerical matrix through data transformation or mapping. The vector characterization can be obtained by word coding by a word2vec algorithm, or can be obtained by a abstract extraction mode, or an algorithm of a statistical language model, and the like. The statistical language model may be an n-gram, the basic idea being to operate the content inside the text by a sliding window of size n according to bytes, forming a sequence of byte fragments (grams) of length n. And counting the occurrence frequencies of all the grams, and filtering according to a preset threshold value to form a key gram list, namely a vector feature space of the text.

Clustering is the partitioning of a data set into different classes or clusters according to a particular label, such as distance, etc., so that the similarity of data objects within the same cluster is as large as possible, while the variability of data objects that are not in the same cluster is as large as possible. That is, the data of the same class after clustering are aggregated together as much as possible, and the data of different classes are separated as much as possible. The method is not limited to the types of vector characterization and clustering algorithms, and the clustering algorithms can be hierarchical clustering algorithms, k-means clustering algorithms (k-means clustering algorithm), density-based clustering algorithms (ordering points to identify the clustering structure, OPTICS), density-based clustering methods with noise (density-based spatial clustering of applications with noise, DBSCAN), sparse matrixes such as spectral clustering and the like.

The frequency of the domain word clusters can be obtained by statistics according to the frequency of each domain word, for example, a statistical average value, a weighted value and the like. The frequency threshold is not limited in this application, and may be a fixed value, for example, 0.6. The frequency threshold may alternatively be determined based on an average of the frequencies of the domain word clusters, the number of domain word clusters, and the like.

It can be understood that the high-frequency domain word clusters are selected according to the frequency of the domain word clusters obtained by the vector representation clustering of each domain word in the domain word library, so that the accuracy of screening the high-frequency domain word library can be improved, the specificity of selecting domain words can be further improved, and the effect of sample training can be improved.

In the embodiment of the application, the related domain word stock refers to words related to the domain word stock except the domain word stock, and may include similar words, or may include alternative words, and the like. The method for acquiring the related domain vocabulary is not limited, and in some feasible examples, the method for supplementing the domain vocabulary library to obtain the related domain vocabulary library may include the following steps: searching the replacement word of each domain word in the domain word library according to a preset rule corresponding to the domain type of the domain word library; obtaining similar words of each domain word bank in the domain word banks; and supplementing the replacement words and the similar words to the domain word stock to obtain the related domain word stock.

The domain type of the domain word library may be the type of the technical domain. The preset rule corresponding to the domain type refers to a preset searching rule. The search rules may include search granularity, search scope, and the like. For example, [ helps me inquire about my account ], and [ helps me inquire about my funds ], these two texts have only two different words, and the meanings are completely different, and the keywords < account > and < funds > are distinguished. Therefore, the search granularity of the customer service area can be set to 2.

The lookup rules may include a lookup threshold, or the like. For example, the similar words may be topK samples most similar to the domain words, K may be 5,K in the customer service domain, 10 in the medical domain, and so on. For another example, the recall rate of similar words may be 40% to 60% in the customer service field, 50% to 60% in the medical field, and the like.

The search method of the similar words is not limited, and the domain words can be sent to the BM25 coarse arrangement module for acquisition, or the similar words of the domain words can be searched in the knowledge graph corresponding to the domain type.

It can be understood that the replacement words of the terms in each field are searched according to the preset rules corresponding to the field types of the field word stock, and the similar terms of the field word stock can be obtained. Then, the replacement words and the similar words are supplemented to the field word stock, so that the replacement words with different meanings from the field words can be supplemented to the field word stock, the similar words with similar meanings from the field words can be supplemented to the field word stock, the diversity of samples can be improved, and the training effect of the samples can be improved.

The method for constructing the first sample set by the high-frequency domain word stock and the related domain word stock is not limited, and in some feasible examples, the method for constructing the first sample set according to the high-frequency domain word stock and the related domain word stock may include the following steps: searching a target historical dialogue data set where at least one domain word in a domain word library is located from the historical dialogue data set; constructing a first sub-sample containing at least one domain word in the high-frequency domain word stock according to the target historical dialogue data; according to each domain word in the associated domain word library, replacing the domain word in the target historical dialogue data to obtain a plurality of second sub-samples; and fusing the first sub-sample and the second sub-sample to obtain a first sample set.

Wherein the target historical dialog data set is historical dialog data comprising at least one domain word. The first sub-sample comprises at least one domain word in the high-frequency domain word stock and is a sample processed according to the target historical dialogue data. The first sub-sample may be obtained by scrambling the target historical dialogue data in a random sampling manner, and then combining the scrambled target historical dialogue data.

The second sub-sample contains at least one domain word in the related domain word library and is a sample obtained by replacement on the basis of the target historical dialogue data. That is, the domain words in the target history dialogue data are replaced with the domain words in the associated domain word library. The domain words in the associated domain word library may be referred to as associated domain words, and the replaced associated domain words may be determined according to the domain words to be replaced. For example, the replaced associated domain word is a similar word, a replacement word, or the like of the domain word to be replaced. The method for fusing the first sub-sample and the second sub-sample may include deleting the same sub-sample, further expanding the first sub-sample set and the second sub-sample set in the next step, and the like, which is not limited herein.

It can be understood that a first sub-sample containing at least one domain word in the high-frequency domain word bank is constructed according to the target history dialogue data, and a plurality of second sub-samples are obtained by replacing the domain word in the target history dialogue data according to the at least one domain word in the associated domain word bank. The first sub-sample and the second sub-sample are then fused to obtain a first sample set. In this way, the diversity of samples can be improved. The first sample set is obtained according to the historical dialogue data set processing, so that the labeling practicability can be improved.

The method for obtaining the second sample set is not limited in this application, and in some possible examples, the method may further include the following steps: selecting a reference sample corresponding to a preset sample type from the marked file; obtaining a similarity value between each historical dialogue data and a reference sample in the historical dialogue data set; and screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set.

The reference samples corresponding to the preset sample types may be the positive samples, the negative samples, the simple samples, the difficult samples, and the like. Optionally, the reference sample is a negative sample and the second sample set is a difficult sample. The method for selecting the reference sample can be divided according to the labels in the marked file; or clustering the marked files, taking the cluster with large similarity value as a positive sample, taking the cluster with smaller similarity value as a negative sample, and the like; or may recognize the intent of the historical dialog data, take the same graph as a positive sample, take a different intent as a negative sample, etc., and the method of selecting the reference sample is not limited herein.

The method for obtaining the similarity value between the historical dialogue data and the reference sample is not limited, and a Jaccard (Jaccard) similarity coefficient, an editing distance, a text similarity evaluation (round-oriented understudy for gisting evaluation, rouge), a term frequency-inverse text frequency (term frequency-inverse document frequency, TF-IDF) and the like can be adopted. Where the Jacquard similarity coefficient is used to describe the degree of similarity between two samples, the greater the Jacquard similarity coefficient, the more similar the two samples. The edit distance is an indicator used to measure the degree of similarity of two sequences. The rouge index is an evaluation index commonly used in the fields of machine translation, automatic abstract, question and answer generation and the like. rouge obtains a corresponding score by comparing the model-generated abstract or answer to a reference answer (typically manually generated). TF-IDF is used to evaluate how important a word is to one of the documents in a set of documents or a corpus.

The following is an example of the Jacquard similarity coefficient, and the calculation formula of the Jacquard similarity coefficient J (A, B) is shown in the following (2).

Wherein A, B is two samples to be compared. The +|B| represents the word or the word where the two samples have an intersection of A, B, and the +|A|+|B|- |A.cndot.B| represents the union of the word or the word of A, B.

It can be understood that, firstly, a reference sample corresponding to a preset sample type is selected from the marked file, and then, a similarity value between each historical dialogue data and the reference sample in the historical dialogue data set is obtained. And screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set. That is, similar samples are searched based on samples of preset sample types in the marked file, so that the marking accuracy can be improved, and the searching accuracy can be improved.

Optionally, after the target sample set is acquired, the method may further include: and sending the target sample set to the rechecking object. The conforming object may be a labeling person, or may be a labeling model, etc. Therefore, after the conforming object receives the target sample set, the target sample set can be rechecked, and the accuracy of sample labeling is improved.

Optionally, the target sample set is a difficult negative sample. Therefore, simple negative samples can be filtered and removed from the historical dialogue data set, difficult negative samples are marked and trained, the number of target sample sets can be reduced, the marking efficiency and effectiveness are improved, and the answer searching accuracy is improved.

In the question-answer search method shown in fig. 1, after acquiring an answer search request of a target object for a target question, text features of the target question may be determined first. And inputting the text features into a question-answer search model to obtain target answers. The question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, wherein the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set. Therefore, the training samples of the question-answer search model adopt at least two different types of samples, so that the diversity of the samples can be improved, and the learning breadth of the model on the training samples can be improved. The first sample set is a sample set obtained by processing according to the historical dialogue data set, and the second sample set is a sample set obtained by processing according to the historical dialogue data set and the marked file. Therefore, the practicability and the accuracy of the sample can be improved, and the accuracy of the search answers can be improved.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a question-answer searching device according to the present application, consistent with the embodiment shown in fig. 1. As shown in fig. 2, the question-answer searching apparatus 200 includes:

a communication unit 201, configured to obtain an answer search request of a target object for a target question;

a processing unit 202 for determining text features of the target question; inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to historical dialogue data, and the second sample set is a sample set obtained by processing according to the historical dialogue data and a marked file.

In one possible example, the processing unit 202 is further configured to analyze the historical dialogue dataset to obtain a domain word stock; screening the domain word stock to obtain a high-frequency domain word stock; supplementing a domain word stock to obtain an associated domain word stock; and constructing a first sample set according to the high-frequency domain word stock and the related domain word stock.

In one possible example, the processing unit 202 is specifically configured to obtain a vector representation of each domain word in the domain word stock; clustering the vector representations of the domain words to obtain at least two types of domain word clusters; acquiring the frequency of word clusters in various fields; and forming a domain word cluster with the frequency larger than the frequency threshold value into a high-frequency domain word library.

In one possible example, the processing unit 202 is specifically configured to search for a replacement word of each domain word in the domain word stock according to a preset rule corresponding to the domain type of the domain word stock; obtaining similar words of each domain word bank in the domain word banks; and supplementing the replacement words and the similar words to the domain word stock to obtain the related domain word stock.

In one possible example, the processing unit 202 is specifically configured to search the historical dialogue data set for a target historical dialogue data set in which at least one domain word in the domain word library is located; constructing a first sub-sample containing at least one domain word in the high-frequency domain word stock according to the target historical dialogue data; replacing the domain words in the target historical dialogue data according to at least one domain word in the associated domain word library to obtain a plurality of second sub-samples; and fusing the first sub-sample and the plurality of second sub-samples to obtain a first sample set.

In a possible example, the processing unit 202 is further configured to select a reference sample corresponding to a preset sample type from the annotated file; obtaining a similarity value between each historical dialogue data and a reference sample in the historical dialogue data set; and screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set.

In one possible example, the processing unit 202 is specifically configured to determine a keyword in the target problem and a technical field of the target problem; and determining the text characteristics of the target problem according to the technical field and the keywords.

The detailed process performed by each unit in the question-answer searching apparatus 200 may refer to the execution steps in the foregoing method embodiment, and will not be described herein.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 3, the computer device 300 includes a processor 310, a memory 320, a communication interface 330, and one or more programs 340. The processor 310, the memory 320 and the communication interface 330 are interconnected by a bus 350. The relevant functions performed by the communication unit 201 shown in fig. 2 may be implemented by the communication interface 330, and the relevant functions performed by the processing unit 202 shown in fig. 2 may be implemented by the processor 310.

The one or more programs 340 are stored in the memory 320 and configured to be executed by the processor 310, the program 340 including instructions for:

acquiring an answer search request of a target object for a target question;

determining text features of the target problem;

inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to a historical dialogue data set, and the second sample set is a sample set obtained by processing according to the historical dialogue data set and a marked file.

In one possible example, program 340 is further configured to execute instructions for:

analyzing the historical dialogue data set to obtain a domain word stock;

screening the domain word stock to obtain a high-frequency domain word stock;

supplementing a domain word stock to obtain an associated domain word stock;

and constructing a first sample set according to the high-frequency domain word stock and the related domain word stock.

In one possible example, in terms of screening the domain thesaurus to obtain a high frequency domain thesaurus, the program 340 is specifically configured to execute the following instructions:

Obtaining vector characterization of each domain word in the domain word library;

clustering the vector representations of the domain words to obtain at least two types of domain word clusters;

acquiring the frequency of word clusters in various fields;

and forming a domain word cluster with the frequency larger than the frequency threshold value into a high-frequency domain word library.

In one possible example, in supplementing the domain thesaurus to obtain an associated domain thesaurus, the program 340 is further configured to execute instructions for:

searching a target historical dialogue data set where at least one domain word in a domain word library is located from the historical dialogue data set;

searching the replacement word and/or the similar word of each domain word in the domain word stock according to the preset rule corresponding to the domain type of the domain word stock to obtain the associated domain word stock.

In one possible example, in constructing the first sample set from the high frequency domain thesaurus and the associated domain thesaurus, the program 340 is specifically configured to:

constructing a first sub-sample containing a word stock of the high-frequency domain according to the target historical dialogue data and the preset sample type;

replacing the domain words in the target historical dialogue data according to the associated domain word library to obtain a second sub-sample;

and fusing the first sub-sample and the second sub-sample to obtain a first sample set.

selecting a negative sample from the marked file;

obtaining a similarity value between each historical dialogue data and the negative sample in the historical dialogue data set;

and screening the historical dialogue data set with the similarity value larger than the similarity threshold value from the historical dialogue data set to obtain a second sample set.

In one possible example, in determining the text characteristics of the target question, program 340 is specifically configured to execute instructions for:

determining keywords in the target problem and the technical field of the target problem;

and determining the text characteristics of the target problem according to the technical field and the keywords.

The embodiment of the application also provides a computer storage medium, wherein the computer storage medium is used for storing a computer program, the computer program makes a computer execute part or all of the steps of any one of the methods described in the embodiment of the method, and the computer comprises an electronic device and a server.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to execute to perform some or all of the steps of any one of the methods recited in the method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device and a server.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modes referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements, merely a logical division of functionality, and there may be additional divisions of actual implementation, e.g., at least one element or component may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over at least one network element. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program mode.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. In light of such understanding, the technical solutions of the present application, or portions thereof, that are in essence or contribute to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a memory, comprising several instructions for causing a computer (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned memory includes: a U-disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the contents of the present specification should not be construed as limiting the present application in summary.

Claims

1. A question-answer search method, comprising:

acquiring an answer search request of a target object for a target question;

determining text features of the target question;

2. The method as recited in claim 1, further comprising:

analyzing the historical dialogue data set to obtain a domain word stock;

screening the domain word stock to obtain a high-frequency domain word stock;

supplementing the domain word stock to obtain an associated domain word stock;

and constructing the first sample set according to the high-frequency domain word stock and the associated domain word stock.

3. The method of claim 2, wherein the screening the domain word stock to obtain a high-frequency domain word stock comprises:

obtaining vector characterization of each domain word in the domain word stock;

acquiring the frequency of each field word cluster;

and forming the domain word clusters with the frequency larger than a frequency threshold into a high-frequency domain word stock.

4. The method of claim 2, wherein the supplementing the domain word stock to obtain the associated domain word stock comprises:

searching the replacement word of each domain word in the domain word stock according to a preset rule corresponding to the domain type of the domain word stock;

obtaining similar words of each domain word stock in the domain word stock;

And supplementing the replacement words and the similar words to the domain word stock to obtain an associated domain word stock.

5. The method of claim 3, wherein constructing the first sample set from the high frequency domain thesaurus and the related domain thesaurus comprises:

searching a target historical dialogue data set where at least one domain word is located in the domain word library from the historical dialogue data set;

constructing a first sub-sample containing at least one domain word in the high-frequency domain word stock according to the target historical dialogue data;

replacing the domain words in the target historical dialogue data according to at least one domain word in the associated domain word library to obtain a plurality of second sub-samples;

and fusing the first sub-sample and the plurality of second sub-samples to obtain the first sample set.

6. The method of any one of claims 1-5, further comprising:

selecting a reference sample corresponding to a preset sample type from the marked file;

obtaining a similarity value between each historical dialog data in the historical dialog data set and the reference sample;

and screening the historical dialogue data set with the similarity value larger than a similarity threshold value from the historical dialogue data set to obtain the second sample set.

7. The method of any of claims 1-5, wherein the determining text characteristics of the target question comprises:

8. A question-answering search device, comprising:

a processing unit, configured to determine a text feature of the target question; inputting the text features into a question-answer search model to obtain target answers; the question-answer search model is a model which is obtained by training according to a target sample set and meets preset conditions, the target sample set comprises a sample set obtained by fusing a first sample set and a second sample set, the first sample set is a sample set obtained by processing according to historical dialogue data, and the second sample set is a sample set obtained by processing according to the historical dialogue data and a marked file.

9. A computer device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-7.

10. A computer readable storage medium storing a computer program that causes a computer to execute to implement the method of any one of claims 1-7.