CN114090757A - Data processing method of dialogue system, electronic device and readable storage medium - Google Patents

Data processing method of dialogue system, electronic device and readable storage medium Download PDF

Info

Publication number
CN114090757A
CN114090757A CN202210039580.XA CN202210039580A CN114090757A CN 114090757 A CN114090757 A CN 114090757A CN 202210039580 A CN202210039580 A CN 202210039580A CN 114090757 A CN114090757 A CN 114090757A
Authority
CN
China
Prior art keywords
data
cluster
question
knowledge
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210039580.XA
Other languages
Chinese (zh)
Other versions
CN114090757B (en
Inventor
罗雪峰
谢延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202210039580.XA priority Critical patent/CN114090757B/en
Publication of CN114090757A publication Critical patent/CN114090757A/en
Application granted granted Critical
Publication of CN114090757B publication Critical patent/CN114090757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Abstract

The application provides a data processing method of a dialogue system, an electronic device and a readable storage medium. According to the method, the question and answer data of a dialogue system are obtained in real time, and candidate clusters matched with the questions are searched according to the similarity between the sentence vectors of the questions and the central vectors of the existing semantic clusters; matching the labeling knowledge and the matching knowledge of the conversation labeling data belonging to the candidate cluster, and determining whether first labeling data with consistent labeling knowledge and matching knowledge exists in the conversation labeling data; if the first annotation data does not exist, the annotation knowledge of all the dialogue annotation data in the candidate cluster is inconsistent with the matching knowledge, an identification error may exist, wrong annotation data may exist in the knowledge question-answer library, and new dialogue annotation data corresponding to the current question-answer data is generated; if the knowledge question-answering inventory is in wrong annotation data, the new dialogue annotation data comprises the wrong annotation data, and the wrong annotation data can be found in time during annotation.

Description

Data processing method of dialogue system, electronic device and readable storage medium
Technical Field
The present application relates to artificial intelligence technologies, and in particular, to a data processing method of a dialog system, an electronic device, and a readable storage medium.
Background
The dialogue system is a man-machine dialogue product based on natural language processing technology and dialogue management technology, such as a dialogue robot, and the main dialogue capability includes task-based dialogue, Frequently Asked Questions (FAQ), map Questions, document Questions, form Questions, dialogue central control capability, and the like. For a dialog system, the dialog effect is the most key index for improving the core competitiveness and is the most direct factor capable of bringing value, so how to improve the dialog effect is the most core problem to be solved by the dialog system.
Data annotation and algorithm are necessary factors for guaranteeing the conversation effect of the conversation system, and in order to improve the conversation effect, in addition to optimizing the natural language understanding algorithm, the conversation data annotation is also very critical. And the dialogue data labeling is to label the problems of the user to the corresponding knowledge, and the labeled data is used as the training data of the algorithm model so as to achieve the purpose of improving the dialogue effect.
The first of the traditional dialogue data labeling schemes is a labeling scheme based on dialogue detail data, question and answer data generated on line are directly stored in a database to be labeled, and labeling is carried out manually by a labeling person one by one; the scheme has the defects of large data volume to be labeled, large workload, low efficiency and incapability of performing targeted labeling on hot spot data. The second scheme is a labeling scheme based on off-line clustering data, the question and answer data generated on line are stored in a question and answer database, the question and answer data are clustered off-line or clustered at regular time when dialogue data are labeled, and labeling personnel label according to clustering results; the scheme can realize automatic labeling, greatly reduces the data volume and workload to be labeled, improves the labeling efficiency to a certain extent, and can also find hot data based on the number of problems contained in clustering. However, the method has poor real-time dialogue data labeling, requires more system resources for batch clustering tasks, and cannot support parallel processing of larger labeling task quantity. The third scheme is based on the labeling of semantic real-time clustering data, the question and answer data generated on the line are clustered in real time based on semantic similarity, and labeling personnel label according to the clustering result, so that the real-time performance of conversation labeling can be improved to a certain extent. For a question labeled with certain knowledge, the labeled knowledge of the question can be identified according to the labeled data, and a correct answer of the question is obtained. However, if a question is incorrectly labeled to another knowledge by the correct labeling knowledge for various reasons (e.g., incorrect operation, etc.), the dialog system will give an incorrect answer based on the incorrect labeling data after the user has presented the question. Because the problem does not change, the process of semantic clustering based on the problem semantic similarity does not change, the problem of inconsistent knowledge before and after the problem cannot be found, and wrong labeled data cannot be found.
Disclosure of Invention
The application provides a data processing method of a dialogue system, an electronic device and a readable storage medium.
In one aspect, the present application provides a data processing method for a dialog system, including:
acquiring question and answer data of a dialog system in real time, wherein the question and answer data comprise questions and matching knowledge;
searching candidate clusters matched with the problem according to the similarity between the sentence vector of the problem and the central vector of the existing semantic cluster;
if the candidate clusters are found, matching the labeling knowledge of the conversation labeling data belonging to the candidate clusters with the matching knowledge, and determining whether first labeling data with the labeling knowledge consistent with the matching knowledge exists in the conversation labeling data, wherein each conversation labeling data comprises a problem, labeling knowledge and cluster identification;
if the first labeling data does not exist, the question and answer data is classified into a candidate cluster with the highest similarity between a central vector and the sentence vector, and dialogue labeling data corresponding to the question and answer data is generated.
In another aspect, the present application provides an electronic device comprising:
a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes the computer execution instructions stored in the memory to realize the data processing method of the dialog system.
In another aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are executed by a processor to implement the data processing method of the dialog system described above.
According to the data processing method, the electronic equipment and the readable storage medium of the dialogue system, the question and answer data of the dialogue system are obtained in real time, and the candidate cluster matched with the question is searched according to the similarity between the sentence vector of the question and the center vector of the existing semantic cluster; if the candidate cluster is found, matching the labeling knowledge of the dialogue labeling data belonging to the candidate cluster with the matching knowledge, and determining whether first labeling data with consistent labeling knowledge and matching knowledge exists in the dialogue labeling data or not so as to find out recognition errors of a dialogue system and error labeling data possibly existing in a question-answer knowledge base; if the first annotation data does not exist, it is indicated that the annotation knowledge of all the dialogue annotation data in the candidate cluster is inconsistent with the matching knowledge, that is, the annotation knowledge of the question in the historical question-answer data, which is similar to the question semantic provided by the current user, is inconsistent with the matching knowledge of the question provided by the current user, an identification error may exist, wrong annotation data may exist in the knowledge question-answer library, and new dialogue annotation data corresponding to the current question-answer data is generated. If the knowledge question-answering inventory is in wrong annotation data, the new dialogue annotation data comprises the wrong annotation data, and the wrong annotation data can be found in time during annotation.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a general block diagram of data processing of the dialog system provided herein;
fig. 2 is a flowchart of a data processing method of a dialog system according to an embodiment of the present application;
fig. 3 is a flowchart of a data processing method of a dialog system according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of a data processing apparatus of a dialog system according to an exemplary embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an example embodiment of the present application.
With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terms referred to in this application are explained first:
a dialog system: a man-machine conversation system realized based on a natural language processing technology and a conversation management technology is widely applied to scenes such as intelligent customer service, intelligent outbound, intelligent marketing and the like.
Clustering: an unsupervised machine learning method, given a set of data points, can use a clustering algorithm to partition each data point into a particular cluster (also referred to as a cluster or class) such that data points in the same cluster have similar characteristics and data points in different clusters have dissimilar characteristics.
Knowledge: the main dialog definition data in the dialog system is used for defining the service range which can be processed by the dialog system, including question-answer knowledge, flow-type knowledge, table-type knowledge, document-type knowledge, map-type knowledge and the like.
And (3) dialogue data annotation: data and algorithms are necessary factors for guaranteeing the effect of AI products, and in order to improve the conversation effect, besides optimizing the natural language understanding algorithm, proper data labeling is also very critical. And the dialogue data labeling is to label the problems of the user to the corresponding knowledge, and the labeled data is used as the training data of the algorithm model so as to achieve the purpose of improving the dialogue effect.
Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit to a number of indicated technical features. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
After the dialogue system is on-line, a large amount of on-line real question and answer data can be generated, the real business situation of the current scene can be reflected, the business value of the part of data is the highest, and how to do the dialogue data labeling work aiming at the part of data becomes one of the important means for improving the core competitiveness of the dialogue system.
The application provides a data processing method of a dialogue system, and the question and answer data generated on line of the dialogue system is clustered in real time based on the semantic similarity and knowledge matching condition of the questions, so that the real-time performance and efficiency of dialogue data labeling can be improved, the problem that the knowledge matched before and after the same problem is inconsistent can be found in time, the labeling personnel can be helped to find wrong labeling data in time, and the dialogue effect of the dialogue system is continuously improved.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a general framework diagram of data processing of a dialog system provided by the present application, an execution main body of a data processing method of the dialog system provided by the present application may be a server where the dialog system is deployed, or may also be other electronic devices independent of a device where the dialog system is located and capable of acquiring question and answer data from the dialog system in real time, a dotted box in fig. 1 is used to distinguish a normal question and answer flow of the dialog system from the data processing method of the dialog system provided by the present application, and no particular limitation is imposed on whether the execution main body of the data processing method of the dialog system provided by the present application is the server where the dialog system is deployed.
As shown in fig. 1, the normal question-answering flow of the dialogue system includes: the user sends a question-answer request to the dialogue system, the dialogue system receives the question-answer request of the user, the question-answer knowledge base is inquired for matching knowledge matched with the question, and the matching knowledge is output as a reply. The question-answer knowledge base comprises data such as preconfigured knowledge and label information, and can comprise question-answer knowledge, flow knowledge, tabular knowledge, document knowledge, map knowledge and the like. The dialog system determines matching knowledge matching the question through natural language understanding and dialog management techniques and replies to the user.
The data processing method of the dialogue system can acquire question and answer data generated by the dialogue system in real time, perform real-time clustering on the question and answer data based on semantic similarity and knowledge matching of the questions, and update dialogue annotation data in the annotation library. And after the dialogue labeling data is labeled, updating a question-answer knowledge base of the dialogue system according to the labeling result. In addition, data statistics can be carried out according to the labeling effect of the conversation labeling data based on the labeling result of the conversation labeling data, and the statistical result can be used for optimizing the real-time clustering process based on the semantic similarity and knowledge matching of the problem.
Fig. 2 is a flowchart of a data processing method of a dialog system according to an embodiment of the present application. The data processing method of the dialog system provided in this embodiment may be specifically applied to an electronic device in which the dialog system is located, and the electronic device may be a dialog robot, a terminal device, a server, and the like.
As shown in fig. 2, the method comprises the following specific steps:
step S201, obtaining question and answer data of the dialogue system in real time, wherein the question and answer data comprise questions and matching knowledge.
In practical application, the dialogue system generates question and answer data during online use, wherein the question and answer data comprises questions posed by users and matching knowledge matched by the dialogue system for the questions posed by the users.
In this embodiment, the electronic device may obtain the question and answer data generated by the dialog system in real time. Optionally, after matching the matching knowledge of the question, the dialog system sends the question-answer data to the message queue in an asynchronous processing mode. The electronic equipment is used as the question and answer data in the consumer consumption message queue, so that the influence on the response time of the question and answer service is reduced while the question and answer data is acquired in real time.
In addition, the electronic device may also acquire the question and answer data of the dialog system in real time in other manners, and the specific implementation manner of acquiring the question and answer data in real time is not specifically limited in this embodiment.
And S202, searching candidate clusters matched with the problems according to the similarity between the sentence vectors of the problems and the central vectors of the existing semantic clusters.
In this embodiment, the existing semantic clusters are clusters (or clusters) generated by clustering existing question-answer data of the dialog system in real time, and each cluster includes one or more pieces of dialog label data. Each conversation marking data comprises a question, marking knowledge and a cluster identifier of the corresponding semantic cluster. The problem semantics of the dialogue annotation data within the same cluster are similar.
For question and answer data acquired in real time, in the step, based on the similarity between the sentence vector of the question and the central vector of the semantic cluster, candidate clusters matched with the question are searched, and the semantic cluster is carried out on the question and answer data.
In this step, if the candidate cluster matching the problem is found, there may be one or more candidate clusters matching the problem, and for the candidate cluster matching the problem, the labeling knowledge in the dialogue labeling data belonging to the candidate cluster may be further matched with the matching knowledge in step S203.
Step S203, if the candidate cluster is found, matching the labeling knowledge and the matching knowledge of the dialogue labeling data belonging to the candidate cluster, and determining whether first labeling data with consistent labeling knowledge and matching knowledge exists in the dialogue labeling data, wherein each dialogue labeling data comprises a problem, labeling knowledge and a cluster identifier.
In this embodiment, the dialog annotation data in which the annotation knowledge matches the matching knowledge in the candidate cluster is represented by the first annotation data.
In the step, if a candidate cluster matched with the problem exists, the semantic meaning of the problem of each dialogue labeling data in the candidate cluster is similar to that of the problem provided by the current user, the labeling knowledge and the matching knowledge of the dialogue labeling data in each candidate cluster are further matched, whether the labeling knowledge is consistent with the matching knowledge is determined, and the first labeling data with the labeling knowledge consistent with the matching knowledge is further determined.
And if the labeling knowledge of all the dialogue labeling data in the candidate cluster is inconsistent with the matching knowledge, determining that the first labeling data does not exist.
And step S204, if the first annotation data does not exist, classifying the question and answer data into the candidate cluster with the highest similarity between the central vector and the sentence vector, and generating dialogue annotation data corresponding to the question and answer data.
If the first annotation data does not exist, it is described that the annotation knowledge of all the dialogue annotation data in the candidate cluster is inconsistent with the matching knowledge, that is, the annotation knowledge of the questions in the historical question-answer data, which are similar to the semantics of the questions presented by the current user, is inconsistent with the matching knowledge of the questions presented by the current user, the matching knowledge of the question pairs presented by the current user may be identified incorrectly, the knowledge question-answer base may have wrong annotation data, and new dialogue annotation data corresponding to the current question-answer data is generated.
If the question and answer is identified wrongly and the knowledge question and answer is stored in the wrong annotation data, the new dialogue annotation data comprises the wrong annotation data, and the wrong annotation data can be found in time during annotation.
According to the method and the device, the question and answer data of the dialogue system are obtained in real time, and the candidate cluster matched with the question is searched according to the similarity between the sentence vector of the question and the central vector of the existing semantic cluster; if the candidate cluster is found, matching the labeling knowledge and the matching knowledge of the dialogue labeling data belonging to the candidate cluster, and determining whether first labeling data with consistent labeling knowledge and matching knowledge exists in the dialogue labeling data to find out recognition errors of the dialogue system and possible wrong labeling data in a question-answer knowledge base; if the first annotation data does not exist, it is indicated that the annotation knowledge of all the dialogue annotation data in the candidate cluster is inconsistent with the matching knowledge, that is, the annotation knowledge of the questions in the historical question-answer data, which are similar to the semantics of the questions presented by the current user, is inconsistent with the matching knowledge of the questions presented by the current user, the matching knowledge of the question pairs presented by the current user may be identified incorrectly, the knowledge question-answer base may have wrong annotation data, and new dialogue annotation data corresponding to the current question-answer data is generated. If the question and answer is identified wrongly and the knowledge question and answer is stored in the wrong annotation data, the new dialogue annotation data comprises the wrong annotation data, and the wrong annotation data can be found in time during annotation.
Fig. 3 is a flowchart of a data processing method of a dialog system according to another embodiment of the present application. On the basis of the embodiment corresponding to fig. 2, in this embodiment, if a candidate cluster is found, matching the annotation knowledge and the matching knowledge of the dialogue annotation data belonging to the candidate cluster, determining whether there is first annotation data with consistent annotation knowledge and matching knowledge in the dialogue annotation data, and if there is the first annotation data, then using the first annotation data with the maximum similarity between the central vector of the semantic cluster and the sentence vector as second annotation data, and determining a second cluster to which the second annotation data belongs; if the similarity between the central vector of the second cluster and the sentence vector is smaller than the merging threshold, classifying the question and answer data into the second cluster and generating dialogue marking data corresponding to the question and answer data; and if the similarity between the central vector of the second cluster and the sentence vector is smaller than or equal to the merging threshold, adding 1 to the merged data amount corresponding to the second label data.
As shown in fig. 3, the method comprises the following specific steps:
step S301, obtaining question and answer data of the dialogue system in real time, wherein the question and answer data comprise questions and matching knowledge.
In practical application, the dialogue system generates question and answer data during online use, wherein the question and answer data comprises questions posed by users and matching knowledge matched by the dialogue system for the questions posed by the users.
In this embodiment, the electronic device may obtain the question and answer data generated by the dialog system in real time. Optionally, after matching the matching knowledge of the question, the dialog system sends the question-answer data to the message queue in an asynchronous processing mode. The electronic equipment is used as the question and answer data in the consumer consumption message queue, so that the influence on the response time of the question and answer service is reduced while the question and answer data is acquired in real time.
In addition, the electronic device may also acquire the question and answer data of the dialog system in real time in other manners, and the specific implementation manner of acquiring the question and answer data in real time is not specifically limited in this embodiment.
Optionally, if the matching knowledge in the question-answer data is empty or preset information, it indicates that the question-answer knowledge base of the dialog system does not cover the problem proposed by the current user, and a subsequent step S309 may be executed to generate a new semantic cluster using a sentence vector as a central vector, put the question-answer data in the new semantic cluster, and generate dialog annotation data corresponding to the question-answer data, so that the new problem not covered by the question-answer knowledge base can be timely found when the dialog annotation data is annotated.
In order to distinguish the question posed by the current user from the questions in the existing dialogue labeling data and the historical question and answer data, the current question is used in this embodiment to refer to the question in the question and answer data acquired in real time.
And S302, searching candidate clusters matched with the current problem according to the similarity between the sentence vector of the current problem and the central vector of the existing semantic cluster.
The existing semantic clusters are clusters (or clusters) generated by real-time clustering of existing question and answer data of the dialogue system, and each cluster comprises one or more dialogue marking data. Each conversation marking data comprises a question, marking knowledge and a cluster identifier of the corresponding semantic cluster. The problem semantics of the dialogue annotation data within the same cluster are similar.
For question and answer data acquired in real time, a sentence vector corresponding to the current question is calculated in the step, candidate clusters matched with the current question are searched based on the similarity between the sentence vector of the current question and a central vector of a semantic cluster, and the question and answer data are subjected to semantic clustering.
Optionally, the sentence vector corresponding to the problem may be calculated, the sentence vector may be generated by using an algorithm such as Word vector + WMD (Word move Distance), Word vector weighted averaging (SIF), or the like, or may be generated directly by using a sentence vector model obtained through training, or may be implemented by using any conventional method for generating the sentence vector, which is not specifically limited in this embodiment.
Optionally, in this step, according to the similarity between the sentence vector of the current problem and the center vector of the existing semantic cluster, the candidate cluster matched with the current problem is searched, which may be specifically implemented by the following method:
searching the existing dialogue labeling data according to the current problem, and determining third labeling data with the problem similar to the current problem and a third cluster to which the third labeling data belong; calculating the similarity between the sentence vector of the current problem and the central vector of the third cluster; and determining a third cluster of which the similarity between the center vector and the sentence vector of the current problem is greater than or equal to the semantic clustering threshold as a candidate cluster matched with the current problem.
The semantic clustering threshold represents the size of the clustering radius, the larger the threshold is, the smaller the clustering radius is, the more similar the data in the same cluster is, the finer the clustering classification is, the more the classification quantity is, the higher the workload of labeling is, and the higher the labeling precision is. The semantic clustering threshold can be set and adjusted according to the requirements of practical application scenes, and the number of semantic clusters can be adjusted by flexibly adjusting the semantic clustering threshold, so that the adjustment of the labeling workload is realized, and the balance of the labeling workload and the labeling effect is realized.
And in the dialogue labeling data of the labeling library, acquiring dialogue labeling data with a problem similar to the current problem in a retrieval mode to obtain third labeling data, and further determining semantic clusters to which the third labeling data belong to obtain a third cluster. And in the third cluster, further screening out a third cluster with high similarity according to the similarity between the central vector and the sentence vector of the current problem, and taking the third cluster as a candidate cluster matched with the problem.
Optionally, the similarity between the center vector of all existing semantic clusters and the sentence vector of the current problem and a preset semantic cluster threshold may be calculated, and the semantic cluster with the similarity between the center vector and the sentence vector of the current problem greater than or equal to the semantic cluster threshold is directly determined as a candidate cluster matched with the current problem.
Optionally, the similarity between the sentence vector of the current problem and the center vector of the semantic cluster may be calculated by any one or more of a cosine distance, an euclidean distance, a manhattan distance, a hamming distance, and the like between the vectors, or by other methods for calculating the text similarity, which is not specifically limited in this embodiment.
In this step, if a candidate cluster is found, step S303 is executed. If no candidate cluster is found, step S309 is executed to generate a new semantic cluster using the sentence vector of the current question as the central vector, put the question-answer data into the new semantic cluster, and generate the dialogue marking data corresponding to the question-answer data.
Step S303, matching the labeling knowledge and the matching knowledge of the conversation labeling data belonging to the candidate cluster, and determining whether first labeling data with consistent labeling knowledge and matching knowledge exists in the conversation labeling data.
Wherein each dialogue labeling data comprises a question, labeling knowledge and a cluster identifier.
In this embodiment, the dialog annotation data in which the annotation knowledge matches the matching knowledge in the candidate cluster is represented by the first annotation data.
In the step, if a candidate cluster matched with the current problem exists, the semantic meaning of each dialogue labeling data in the candidate cluster is similar to that of the current problem, the labeling knowledge and the matching knowledge of the dialogue labeling data in each candidate cluster are further matched, whether the labeling knowledge is consistent with the matching knowledge is determined, and the first labeling data with the labeling knowledge consistent with the matching knowledge is further determined.
In this step, if the labeling knowledge of all the dialogue labeling data in the candidate clusters is inconsistent with the matching knowledge, it is determined that the first labeling data does not exist, step S308 is executed, the question and answer data is classified into the candidate cluster with the highest similarity between the central vector and the sentence vector of the current question, and the dialogue labeling data corresponding to the question and answer data is generated.
In this step, if there is first annotation data with annotation knowledge matching the matching knowledge in the dialog annotation data, step S304 is executed.
Optionally, by matching the labeling knowledge and the matching knowledge of the dialogue labeling data belonging to the candidate cluster, if it is determined that fourth labeling data with inconsistent labeling knowledge and matching knowledge exists in the dialogue labeling data of the candidate cluster, pushing labeling reminding early warning information, wherein the labeling reminding early warning information is used for reminding to label the fourth labeling data in time.
The specific content and the pushing mode of the labeling reminding early warning information can be set and adjusted according to the needs of the actual application scene, and are not specifically limited here.
Optionally, by matching the labeling knowledge and the matching knowledge of the dialogue labeling data belonging to the candidate cluster, if there is not only the fourth labeling data with inconsistent labeling knowledge and matching knowledge but also the first labeling data with consistent labeling knowledge and matching knowledge, it is described that the problem of similar semantics corresponds to different labeling knowledge, and the early warning information can be pushed, so that the relevant personnel is prompted to label the first labeling data and the fourth labeling data in time, and correct the wrong labeling data in time.
And step S304, if the first label data exists, taking the first label data with the maximum similarity between the center vector of the semantic cluster and the sentence vector of the current problem as second label data, and determining the second cluster to which the second label data belongs.
And the semantics of the problem of the first labeling data are similar to the current problem, and the labeling knowledge of the first labeling data is consistent with the matching knowledge of the current problem.
If first labeled data with similar problem semantics and consistent knowledge exists, in this step, a semantic cluster with the highest similarity between the center vector and the sentence vector of the current problem in the semantic clusters to which the first labeled data belongs can be determined as a second cluster according to the similarity between the center vector of the semantic cluster to which the first labeled data belongs and the sentence vector of the current problem. The second cluster is also a semantic cluster corresponding to the question and answer data.
Step S305, whether the similarity between the central vector of the second cluster and the sentence vector of the current question is smaller than a merging threshold value.
Wherein the merging threshold is greater than the semantic clustering threshold.
In this embodiment, the dialog annotation data with the semantic similarity greater than or equal to the merge threshold and the consistent knowledge is merged into one piece of dialog annotation data, and the merged data size corresponding to the dialog annotation data is recorded. The merged data amount corresponding to the dialogue annotation data means that the dialogue annotation data is merged by a plurality of pieces of dialogue annotation data, and the higher the merged data amount is, the higher the degree of heat of the problem of the dialogue annotation data is. The combined data amount of the newly generated dialogue annotation data is 1.
In this step, if the similarity between the central vector of the second cluster and the sentence vector of the current question is smaller than the merge threshold, step S306 is executed to put the question and answer data into the second cluster and generate the dialogue marking data corresponding to the question and answer data.
If the similarity between the central vector of the second cluster and the sentence vector of the current problem is greater than or equal to the merge threshold, step S307 is executed to add 1 to the merge data amount corresponding to the second label data. In this case, it is equivalent to merge the dialogue mark data corresponding to the question-answer data with the second mark data.
In addition, the merging threshold may be set and adjusted according to the needs of the actual application scenario, and is not specifically limited herein.
And S306, if the similarity between the central vector of the second cluster and the sentence vector of the current question is smaller than a merging threshold, classifying the question and answer data into the second cluster and generating dialogue marking data corresponding to the question and answer data.
If the similarity between the central vector of the second cluster and the sentence vector of the current question is smaller than the merging threshold, the dialogue marking data corresponding to the current question and answer data is not merged with the second marking data, but the question and answer data is classified into the second cluster, and a new dialogue marking data is generated for the question and answer data.
In this embodiment, a question in the dialog annotation data corresponding to the question-answer data is a question of the question-answer data, annotation knowledge in the dialog annotation data is matching knowledge of the question-answer data, and a cluster identifier in the dialog annotation data is an identifier of a semantic cluster to which the question-answer data belongs.
Step S307, if the similarity between the central vector of the second cluster and the sentence vector of the current problem is greater than or equal to the merge threshold, add 1 to the merge data amount corresponding to the second label data.
If the similarity between the central vector of the second cluster and the sentence vector of the current question is greater than or equal to the merging threshold, merging the dialogue marking data corresponding to the current question and answer data with the second marking data, and at the moment, adding 1 to the merged data amount corresponding to the second marking data, without generating new dialogue marking data for the current question and answer data, thereby greatly reducing the dialogue marking data needing to be marked, reducing the workload of dialogue marking and improving the dialogue marking efficiency.
Alternatively, the center vector of the second cluster may be updated according to the sentence vector of the current question before steps S306 and S307.
After the semantic cluster corresponding to the question and answer data is determined, the central vector of the semantic cluster is updated according to the sentence vector of the question and answer data, and the cluster radius is controlled according to the semantic cluster threshold value, so that the controllability of the similarity of the data in the semantic cluster can be ensured, and the phenomenon that the data with lower similarity is gathered together due to similarity transmission is avoided.
Optionally, the central vector of the target cluster is updated according to the sentence vector of the current problem, which may be specifically implemented in the following manner:
determining the merged data volume corresponding to the target cluster, wherein the merged data volume corresponding to the target cluster is equal to the sum of the merged data volumes corresponding to all the dialogue marking data belonging to the target cluster; and taking the merged data volume corresponding to the target cluster as the weight of the current central vector of the target cluster, weighting the current central vector of the target cluster and the sentence vector of the current problem to obtain a mean value, obtaining a new central vector of the target cluster, and updating the central vector of the target cluster.
The target cluster may be any one of the clusters, such as the second cluster described above.
Illustratively, a sentence vector of the current question is represented by V0
Figure DEST_PATH_IMAGE001
The merged data amount of the ith dialogue marked data belonging to the target cluster can be expressed as the merged data amount of all the dialogue marked data belonging to the target cluster
Figure 744946DEST_PATH_IMAGE002
V1 represents the central vector of the target cluster before updating, V2 represents the central vector of the target cluster after updating, and V2
Figure DEST_PATH_IMAGE003
Step S308, if the first labeling data does not exist, the question and answer data are classified into the candidate cluster with the highest similarity between the central vector and the sentence vector of the current question, and the dialogue labeling data corresponding to the question and answer data are generated.
In step S303, if it is determined that the first annotation data does not exist, the candidate cluster indicates that the question of the existing dialogue annotation data is similar to the current question in semantics, but the annotation knowledge is not consistent with the matching knowledge of the current question, and the question and answer data is classified into the candidate cluster with the highest similarity between the central vector and the sentence vector of the current question.
Optionally, before generating dialogue marking data corresponding to the question-answer data, determining a first cluster with the highest similarity between a center vector and a sentence vector of the current question; and updating the central vector of the first cluster according to the sentence vector of the current problem. When new question and answer data are added into the semantic cluster, the central vector of the semantic cluster is updated according to the sentence vector of the question and answer data, and the cluster radius is controlled according to the semantic cluster threshold value, so that the controllability of the similarity of the data in the semantic cluster can be ensured, and the phenomenon that the data with lower similarity are gathered together due to similarity transmission is avoided.
Wherein the first cluster refers to a candidate cluster with the center vector having the highest similarity with the sentence vector of the current problem.
Optionally, the candidate clusters may be ranked according to the similarity between the sentence vector of the current problem and the center vector of each candidate cluster, and the candidate cluster with the center vector having the highest similarity with the sentence vector of the current problem is determined according to the ranking result.
Optionally, the central vector of the target cluster is updated according to the sentence vector of the current problem, which may be specifically implemented in the following manner:
determining the merged data volume corresponding to the target cluster, wherein the merged data volume corresponding to the target cluster is equal to the sum of the merged data volumes corresponding to all the dialogue marking data belonging to the target cluster; and taking the merged data volume corresponding to the target cluster as the weight of the current central vector of the target cluster, weighting the current central vector of the target cluster and the sentence vector of the current problem to obtain a mean value, obtaining a new central vector of the target cluster, and updating the central vector of the target cluster.
The target cluster may be any one of the clusters, such as the first cluster described above.
Step S309, generating a new semantic cluster taking the sentence vector of the current question as a central vector, classifying the question and answer data into the new semantic cluster, and generating dialogue marking data corresponding to the question and answer data.
If the candidate cluster matching the current problem is not found in step S302, a new semantic cluster is generated for the previous problem in this step, and the sentence vector of the current problem is used as the central vector of the semantic cluster. After determining the semantic cluster corresponding to the current question, generating dialogue labeling data corresponding to the question and answer data, and adding the dialogue labeling data into the semantic cluster.
In this embodiment, if the matching knowledge in the question-answer data is null or preset information, it indicates that the question-answer knowledge base of the dialog system does not cover the current question, a new semantic cluster is generated for the current question, and a sentence vector of the current question is used as a central vector of the semantic cluster. After determining the semantic cluster corresponding to the current question, generating dialogue labeling data corresponding to the question and answer data, and adding the dialogue labeling data into the semantic cluster.
In this embodiment, the dialog annotation data may be stored in the annotation library, and after the dialog annotation data corresponding to the question-answer data is generated in any of the above steps, the newly generated dialog annotation data is inserted into the annotation library. In step S307, the merged data amount corresponding to the second annotation data is added by 1, and the annotation library is updated.
In this embodiment, in order to consider both the labeling workload and the richness of the labeling data, a merging threshold may be set, the setting of the merging threshold is greater than the semantic clustering threshold, only when the semantic clusters corresponding to the two pieces of conversation labeling data are consistent and the matched knowledge is also consistent, the merging of the two pieces of conversation labeling data is considered, and the clustering of the labeling data simultaneously considers the semantic similarity and the knowledge matching condition, so that the condition that the knowledge matching condition changes due to various reasons can be found in time, the identification error and the wrong labeling data can be found in time, and the relevant personnel can be reminded to intervene in time, so as to improve the accuracy of the conversation system.
Step S310, responding to the dialogue labeling data request, and displaying dialogue labeling data in a classified mode according to semantic clustering to which the dialogue labeling data belong; and sequentially displaying the dialogue marking data belonging to the same semantic cluster according to the sequence of the combined data volume from large to small.
In practical application, when a annotating person performs annotation, the annotating person can look up the dialogue annotation data in the annotation library and annotate the dialogue annotation data to be annotated.
The dialogue annotation data request is triggered when the annotation personnel checks the dialogue annotation data in the annotation library, and the specific triggering mode is not specifically limited here.
In this step, the dialogue annotation data in the annotation library is displayed in response to the dialogue annotation data request.
When displaying the dialogue labeling data in the labeling library, the dialogue labeling data can be displayed in a classified manner according to the semantic cluster to which the dialogue labeling data belongs. Alternatively, the semantic clusters may be arranged in descending order according to the merged data amount corresponding to each semantic cluster.
For the dialogue labeling data belonging to the same semantic cluster, the dialogue labeling data can be displayed in sequence according to the descending order of the merged data volume of the dialogue labeling data.
The high-frequency hot data are preferentially displayed when the dialogue marking data in the marking library are displayed, so that a marking person can preferentially mark the high-frequency hot data according to the combined data volume of each dialogue marking data and find that the identification of the hot data is wrong.
Aiming at the clustered dialogue labeling data, according to the arrangement of the combined data volume in the cluster from high to low, a labeling person can find a high-frequency question without answers, find an uncovered new knowledge point at the first time, and also can find a high-frequency question with wrong answers in time, find a high-frequency error of the dialogue labeling at the first time, and realize a dialogue labeling effect which is more efficient than that of non-cluster labeling through intervention and quick repair.
In an optional implementation manner, after a piece of dialog marking data is generated, the generated dialog marking data is marked as a state to be marked; and preferentially displaying the dialogue marking data in the state to be marked for the dialogue marking data belonging to the same semantic cluster.
Optionally, in response to a confirmation operation on any dialog annotation data, marking the dialog annotation data as a marked state; and responding to the adjustment operation of any conversation marking data, updating the conversation marking data, and marking the conversation marking data into a marked state.
By marking the marked conversation marking data as the marked state and marking the newly generated conversation marking data as the state to be marked, the conversation marking data in the unmarked state can be preferentially displayed during display, so that the marking personnel can conveniently find new problems in time and find new hot knowledge. Along with the continuous progress of the labeling work, the coverage of the labeled data is gradually increased, the proportion of new question and answer data clustered to the labeled data is higher and higher, and the required labeling work load is gradually reduced.
Optionally, a question of the dialog annotation data tagged with empty knowledge is derived, or dialog annotation data tagged with empty knowledge is derived, in response to a new question derivation operation.
Optionally, part of the dialog marking data in the state to be marked can be recommended according to the merged data volume of the dialog marking data in the state to be marked, and manual marking can be performed on the dialog marking data in the state to be marked for related personnel.
Optionally, according to the labeling result of the dialog labeling data, whether the labeling knowledge in the dialog labeling data is accurate or not can be determined, and the accuracy of the labeling knowledge can be calculated, so that the effect of the dialog labeling can be accurately reflected. According to the accuracy rate of the annotation knowledge, the quantity of the dialogue annotation data for recommending annotations can be adjusted, the higher the accuracy rate of the annotation knowledge is, the smaller the quantity of the recommended annotations is, the lower the accuracy rate of the annotation knowledge is, and the larger the quantity of the recommended annotations is.
Illustratively, the recommended annotation data proportion can be adjusted according to the accuracy of annotation knowledge.
Optionally, for long-tail data, sampling and labeling can be performed according to the scale of labeling personnel and specific conditions.
In an optional embodiment, if the dialog system has a clarification function, when the dialog system finds a plurality of pieces of knowledge matching the question according to the question input by the user, the found knowledge is fed back to the user, and the user selects one of the pieces of knowledge as the correct matching knowledge of the input question, that is, the knowledge matching the question is determined through user clarification. After the question and answer data of the dialog system are obtained, if the question and answer data are clarified by the user, namely the current question and answer data are clarified by the user, according to all the clarified data, if the frequency of the current question and answer data as the clarified data is greater than or equal to the threshold value of the clarification frequency, the question of the current question and answer data is clarified for multiple times and matched with the knowledge of the question and answer data, the current question can be marked to the matched knowledge, and the question and answer knowledge base of the dialog system is synchronously updated, so that the question and answer knowledge base is updated in real time, the knowledge in the question and answer data is automatically matched for the question in the question and answer data subsequently, and the dialog effect of the dialog system is improved.
In the embodiment, in the real-time clustering process, the question and answer data generated on the line of the dialogue system are clustered in real time based on the semantic similarity and knowledge matching condition of the questions, so that the real-time performance and efficiency of dialogue data labeling can be improved, correct labeled data can be identified previously, if the identification is wrong due to various reasons, new dialogue labeled data to be labeled can be generated for the question and answer data, and the problem that the knowledge matched before and after the same question is inconsistent can be found in time, so that a labeling person can be helped to find the wrong labeled data in time, and the dialogue effect of the dialogue system is ensured to be continuously improved; by adjusting the semantic clustering threshold and the merging threshold, the dialogue marking data with high semantic similarity and consistent knowledge are merged, and the marking amount and the marking data richness can be considered; the cluster center vector is dynamically adjusted in the real-time clustering process, the clustering radius is controlled, the controllable similarity of data inside clusters is ensured, the problem of cluster divergence caused by similarity transmission is effectively reduced, and the phenomenon that data with low similarity are gathered together due to the similarity transmission is avoided.
Fig. 4 is a schematic structural diagram of a data processing apparatus of a dialog system according to an exemplary embodiment of the present application. The data processing device of the dialog system provided by the embodiment of the application can execute the processing flow provided by the data processing method embodiment of the dialog system. As shown in fig. 4, the data processing apparatus 40 of the dialogue system includes: a question-answer data acquisition module 401, a semantic clustering module 402, a knowledge matching module 403 and a clustering module 404.
Specifically, the question-answer data acquiring module 401 is configured to acquire question-answer data of the dialog system in real time, where the question-answer data includes questions and matching knowledge.
And a semantic clustering module 402, configured to search a candidate cluster matching the problem according to a similarity between a sentence vector of the problem and a center vector of an existing semantic cluster.
The knowledge matching module 403 is configured to, if a candidate cluster is found, match the labeling knowledge and the matching knowledge of the dialog annotation data belonging to the candidate cluster, and determine whether there is first labeling data in which the labeling knowledge and the matching knowledge are consistent in the dialog annotation data, where each dialog annotation data includes a question, labeling knowledge, and a cluster identifier.
And the cluster processing module 404 is configured to, if the first annotation data does not exist, classify the question and answer data into a candidate cluster with the highest similarity between the center vector and the sentence vector, and generate dialogue annotation data corresponding to the question and answer data.
Optionally, the cluster processing module is further configured to:
if the candidate cluster is not found, generating a new semantic cluster taking the sentence vector as a central vector; and classifying the question and answer data into a new semantic cluster, and generating dialogue marking data corresponding to the question and answer data.
Optionally, the cluster processing module is further configured to:
if the first marking data exist, the first marking data with the maximum similarity between the central vector of the corresponding semantic cluster and the sentence vector is used as second marking data, and the second cluster to which the second marking data belong is determined; if the similarity between the central vector of the second cluster and the sentence vector is smaller than the merging threshold, classifying the question and answer data into the second cluster and generating dialogue marking data corresponding to the question and answer data; and if the similarity between the central vector of the second cluster and the sentence vector is greater than or equal to the merging threshold, adding 1 to the merged data amount corresponding to the second label data.
Optionally, the cluster processing module is further configured to:
classifying the question and answer data into a candidate cluster with the highest similarity between a center vector and a sentence vector, and determining a first cluster with the highest similarity between the center vector and the sentence vector before generating dialogue marking data corresponding to the question and answer data; and updating the central vector of the first cluster according to the sentence vector.
Optionally, the cluster processing module is further configured to:
and if the first annotation data exists, taking the first annotation data with the maximum similarity between the central vector of the corresponding semantic cluster and the sentence vector as second annotation data, and updating the central vector of the second cluster according to the sentence vector after determining the second cluster to which the second annotation data belongs.
Optionally, the cluster processing module is further configured to:
determining the merged data volume corresponding to the target cluster, wherein the merged data volume corresponding to the target cluster is equal to the sum of the merged data volumes corresponding to all the dialogue marking data belonging to the target cluster; taking the merged data volume corresponding to the target cluster as the weight of the current central vector of the target cluster, weighting the current central vector and the sentence vector of the target cluster to obtain a mean value, obtaining a new central vector of the target cluster, and updating the central vector of the target cluster; wherein the target cluster is a first cluster or a second cluster.
Optionally, the semantic clustering module is further configured to:
searching the existing dialogue labeling data according to the question, and determining third labeling data with a question similar to the question and a third cluster to which the third labeling data belong; calculating the similarity between the sentence vector of the problem and the central vector of the third cluster; and determining the third cluster of which the similarity between the center vector and the sentence vector is greater than or equal to the semantic clustering threshold as a candidate cluster matched with the problem.
Optionally, the data processing apparatus of the dialog system may further include:
a label processing module for:
responding to the dialogue labeling data request, and displaying dialogue labeling data in a classified mode according to semantic clustering to which the dialogue labeling data belong; and sequentially displaying the dialogue marking data belonging to the same semantic cluster according to the sequence of the combined data volume from large to small.
Optionally, the annotation processing module is further configured to:
marking the generated dialogue marking data as a state to be marked; and preferentially displaying the dialogue marking data in the state to be marked for the dialogue marking data belonging to the same semantic cluster.
Optionally, the annotation processing module is further configured to:
in response to the confirmation operation of any conversation marking data, marking the conversation marking data into a marked state; and responding to the adjustment operation of any conversation marking data, updating the conversation marking data, and marking the conversation marking data into a marked state.
Optionally, the annotation processing module is further configured to:
and if the fact that fourth labeled data with inconsistent labeling knowledge and matching knowledge exists in the conversation labeled data of the candidate clusters is determined, pushing labeling reminding early warning information, wherein the labeling reminding early warning information is used for reminding to label the fourth labeled data in time.
Optionally, the annotation processing module is further configured to:
and if the question and answer data is clarified data clarified by the user, marking the question to the matching knowledge and synchronously updating a question and answer knowledge base of the dialogue system according to the historical clarified data and if the times of the question and answer data as the clarified data are greater than or equal to the threshold value of the clarified times.
The apparatus provided in the embodiment of the present application may be specifically configured to execute the method embodiment provided in any one of the above method embodiments, and specific functions and effects are not described herein again.
Fig. 5 is a schematic structural diagram of an electronic device according to an example embodiment of the present application. As shown in fig. 5, the electronic device 50 includes: a processor 501, and a memory 502 communicatively coupled to the processor 501, the memory 502 storing computer-executable instructions.
The processor executes the computer execution instructions stored in the memory to implement the data processing method of the dialog system provided in any of the above method embodiments, and the specific functions and technical effects that can be achieved are not described herein again.
The embodiment of the present application further provides a question and answer robot, including: an output device, a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions. The processor executes the computer execution instructions stored in the memory to implement the data processing method of the dialog system provided in any of the above method embodiments, and the specific functions and technical effects that can be achieved are not described herein again.
The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the data processing method of the dialog system provided in any of the above method embodiments.
An embodiment of the present application further provides a computer program product, where the program product includes: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, where execution of the computer program by the at least one processor causes the electronic device to perform the data processing method of the dialog system provided by any of the above-mentioned method embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A data processing method for a dialog system, comprising:
acquiring question and answer data of a dialog system in real time, wherein the question and answer data comprise questions and matching knowledge;
searching candidate clusters matched with the problem according to the similarity between the sentence vector of the problem and the central vector of the existing semantic cluster;
if the candidate clusters are found, matching the labeling knowledge of the conversation labeling data belonging to the candidate clusters with the matching knowledge, and determining whether first labeling data with the labeling knowledge consistent with the matching knowledge exists in the conversation labeling data, wherein each conversation labeling data comprises a problem, labeling knowledge and cluster identification;
if the first labeling data does not exist, the question and answer data is classified into a candidate cluster with the highest similarity between a central vector and the sentence vector, and dialogue labeling data corresponding to the question and answer data is generated.
2. The method of claim 1, wherein after searching for candidate clusters matching the problem according to the similarity between the sentence vector of the problem and the center vector of the existing semantic cluster, the method further comprises:
if the candidate cluster is not found, generating a new semantic cluster taking the sentence vector as a central vector;
and classifying the question and answer data into the new semantic cluster, and generating dialogue marking data corresponding to the question and answer data.
3. The method of claim 1, wherein if the candidate cluster is found, matching the labeled knowledge of the dialogue labeled data belonging to the candidate cluster with the matching knowledge, and determining whether the dialogue labeled data includes a first labeled data with labeled knowledge consistent with the matching knowledge, further comprising:
if the first marking data exist, taking the first marking data with the maximum similarity between the central vector of the corresponding semantic cluster and the sentence vector as second marking data, and determining the second cluster to which the second marking data belong;
if the similarity between the central vector of the second cluster and the sentence vector is smaller than a merging threshold, classifying the question and answer data into the second cluster and generating dialogue marking data corresponding to the question and answer data;
and if the similarity between the central vector of the second cluster and the sentence vector is greater than or equal to a merging threshold, adding 1 to the merged data amount corresponding to the second label data.
4. The method according to claim 1, wherein before the classifying the question-answer data into the candidate cluster with the highest similarity between the central vector and the sentence vector and generating the dialogue marking data corresponding to the question-answer data, the method further comprises:
determining a first cluster with the highest similarity between the central vector and the sentence vector;
and updating the central vector of the first cluster according to the sentence vector.
5. The method of claim 3, wherein if the first annotation data exists, the first annotation data with the highest similarity between the central vector of the semantic cluster and the sentence vector is used as the second annotation data, and after the second annotation data is determined to belong to the second cluster, the method further comprises:
and updating the central vector of the second cluster according to the sentence vector.
6. The method of claim 4 or 5, wherein updating the center vector of the target cluster based on the sentence vector comprises:
determining the merged data volume corresponding to the target cluster, wherein the merged data volume corresponding to the target cluster is equal to the sum of the merged data volumes corresponding to all the dialogue marking data belonging to the target cluster;
taking the merged data volume corresponding to the target cluster as the weight of the current central vector of the target cluster, weighting the current central vector of the target cluster and the sentence vector to obtain a mean value, obtaining a new central vector of the target cluster, and updating the central vector of the target cluster;
wherein the target cluster is a first cluster or a second cluster.
7. The method according to any one of claims 1-5, wherein the finding the candidate cluster matching the problem according to the similarity of the sentence vector of the problem and the center vector of the existing semantic cluster comprises:
searching the existing dialogue labeling data according to the question, and determining third labeling data with a question similar to the question and a third cluster to which the third labeling data belong;
calculating the similarity between the sentence vector of the problem and the center vector of the third cluster;
and determining the third cluster of which the similarity between the central vector and the sentence vector is greater than or equal to a semantic clustering threshold as a candidate cluster matched with the problem.
8. The method according to any one of claims 1-5, further comprising:
responding to a dialogue labeling data request, and displaying the dialogue labeling data in a classified mode according to semantic clusters to which the dialogue labeling data belong;
and sequentially displaying the dialogue marking data belonging to the same semantic cluster according to the sequence of the combined data volume from large to small.
9. The method of claim 8, after generating a piece of dialog annotation data, further comprising:
marking the generated dialogue marking data as a state to be marked;
and preferentially displaying the dialogue marking data in the state to be marked for the dialogue marking data belonging to the same semantic cluster.
10. The method of claim 9, further comprising:
in response to a confirmation operation on any conversation marking data, marking the conversation marking data into a marked state;
and responding to the adjustment operation of any conversation marking data, updating the conversation marking data, and marking the conversation marking data into a marked state.
11. The method of claim 1, wherein if the candidate cluster is found, after matching the labeling knowledge of the dialogue labeling data belonging to the candidate cluster with the matching knowledge, further comprising:
and if determining that fourth labeled data with labeled knowledge inconsistent with the matched knowledge exists in the conversation labeled data of the candidate clusters, pushing labeled reminding early warning information, wherein the labeled reminding early warning information is used for reminding to label the fourth labeled data in time.
12. The method according to claim 1, wherein after obtaining the question-answer data of the dialog system in real time, the method further comprises:
if the question-answer data is clarified data clarified by the user, according to historical clarified data, if the times of the question-answer data appearing as the clarified data are greater than or equal to a threshold value of the clarified times, the question is labeled to the matching knowledge, and a question-answer knowledge base of the dialogue system is synchronously updated.
13. An electronic device, comprising:
a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1-12.
14. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of any one of claims 1-12.
CN202210039580.XA 2022-01-14 2022-01-14 Data processing method of dialogue system, electronic device and readable storage medium Active CN114090757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210039580.XA CN114090757B (en) 2022-01-14 2022-01-14 Data processing method of dialogue system, electronic device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210039580.XA CN114090757B (en) 2022-01-14 2022-01-14 Data processing method of dialogue system, electronic device and readable storage medium

Publications (2)

Publication Number Publication Date
CN114090757A true CN114090757A (en) 2022-02-25
CN114090757B CN114090757B (en) 2022-04-26

Family

ID=80308616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210039580.XA Active CN114090757B (en) 2022-01-14 2022-01-14 Data processing method of dialogue system, electronic device and readable storage medium

Country Status (1)

Country Link
CN (1) CN114090757B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008166A (en) * 2014-05-30 2014-08-27 华东师范大学 Dialogue short text clustering method based on form and semantic similarity
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN108846126A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 Generation, question and answer mode polymerization, device and the equipment of related question polymerization model
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
US20200265218A1 (en) * 2019-02-20 2020-08-20 Peng Dai Semi-supervised hybrid clustering/classification system
CN111858869A (en) * 2020-01-03 2020-10-30 北京嘀嘀无限科技发展有限公司 Data matching method and device, electronic equipment and storage medium
CN112148859A (en) * 2020-09-27 2020-12-29 深圳壹账通智能科技有限公司 Question-answer knowledge base management method, device, terminal equipment and storage medium
CN112749563A (en) * 2021-01-21 2021-05-04 北京明略昭辉科技有限公司 Named entity identification data labeling quality evaluation and control method and system
CN112989040A (en) * 2021-03-10 2021-06-18 河南中原消费金融股份有限公司 Dialog text labeling method and device, electronic equipment and storage medium
CN113220856A (en) * 2021-05-28 2021-08-06 天津大学 Multi-round dialogue system based on Chinese pre-training model

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008166A (en) * 2014-05-30 2014-08-27 华东师范大学 Dialogue short text clustering method based on form and semantic similarity
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
US20190065576A1 (en) * 2017-08-23 2019-02-28 Rsvp Technologies Inc. Single-entity-single-relation question answering systems, and methods
CN108846126A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 Generation, question and answer mode polymerization, device and the equipment of related question polymerization model
US20200265218A1 (en) * 2019-02-20 2020-08-20 Peng Dai Semi-supervised hybrid clustering/classification system
CN111858869A (en) * 2020-01-03 2020-10-30 北京嘀嘀无限科技发展有限公司 Data matching method and device, electronic equipment and storage medium
CN112148859A (en) * 2020-09-27 2020-12-29 深圳壹账通智能科技有限公司 Question-answer knowledge base management method, device, terminal equipment and storage medium
CN112749563A (en) * 2021-01-21 2021-05-04 北京明略昭辉科技有限公司 Named entity identification data labeling quality evaluation and control method and system
CN112989040A (en) * 2021-03-10 2021-06-18 河南中原消费金融股份有限公司 Dialog text labeling method and device, electronic equipment and storage medium
CN113220856A (en) * 2021-05-28 2021-08-06 天津大学 Multi-round dialogue system based on Chinese pre-training model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔娜娜等: "基于研究热点的语义标注知识资源聚合研究", 《情报探索》 *

Also Published As

Publication number Publication date
CN114090757B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN110909165B (en) Data processing method, device, medium and electronic equipment
CN106095842B (en) Online course searching method and device
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN112667805B (en) Work order category determining method, device, equipment and medium
CN112699645B (en) Corpus labeling method, apparatus and device
US9418058B2 (en) Processing method for social media issue and server device supporting the same
KR20200119393A (en) Apparatus and method for recommending learning data for chatbots
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
US9104946B2 (en) Systems and methods for comparing images
JP2018063600A (en) Information processing device, information processing method, and program
CN113157867A (en) Question answering method and device, electronic equipment and storage medium
CN110377721B (en) Automatic question answering method, device, storage medium and electronic equipment
CN114090757B (en) Data processing method of dialogue system, electronic device and readable storage medium
CN117195046A (en) Abnormal text recognition method and related equipment
CN112182140A (en) Information input method and device combining RPA and AI, computer equipment and medium
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
JP2009053743A (en) Document similarity derivation apparatus, document similarity derivation method and document similarity derivation program
CN111931480B (en) Text main content determining method and device, storage medium and computer equipment
CN112328752B (en) Course recommendation method and device based on search content, computer equipment and medium
CN115640376A (en) Text labeling method and device, electronic equipment and computer-readable storage medium
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN113704422A (en) Text recommendation method and device, computer equipment and storage medium
CN115795184B (en) RPA-based scene get-on point recommendation method and device
CN115033701B (en) Text vector generation model training method, text classification method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240229

Address after: Room 553, 5th Floor, Building 3, No. 969 Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province, 311121

Patentee after: Hangzhou Alibaba Cloud Feitian Information Technology Co.,Ltd.

Country or region after: China

Address before: 310023 Room 516, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee before: Alibaba Dharma Institute (Hangzhou) Technology Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right