CN111368060B

CN111368060B - Self-learning method, device and system for conversation robot, electronic equipment and medium

Info

Publication number: CN111368060B
Application number: CN202010462950.1A
Authority: CN
Inventors: 吴岳灏
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-01-26
Anticipated expiration: 2040-05-27
Also published as: CN111368060A

Abstract

The embodiment of the specification discloses a self-learning method of a conversation robot, which is used for acquiring user question data in real time through log service; executing robot self-learning calculation on user question data acquired in real time in a distributed manner, wherein the executing robot self-learning calculation step comprises the following steps: vectorizing the user question data by using local sensitive hashing, and clustering the obtained question vectors of the user question data to obtain a question cluster set; and constructing a reverse index aiming at the question cluster set by utilizing the question vector of the user question data.

Description

Self-learning method, device and system for conversation robot, electronic equipment and medium

Technical Field

The embodiment of the specification relates to the technical field of conversation robots, in particular to a self-learning method, a self-learning device, a self-learning system, an electronic device and a self-learning medium of a conversation robot.

Background

With the rapid development of mobile electronic devices, applications on the mobile electronic devices are more and more, users using the applications are more and more, various problems of the users in the application using process need to be solved, the manual solving efficiency is low, the cost is high, and questions asked by the telephone robot for answering the users appear.

In a conversation robot in the prior art, generally, historical conversation data of a user and the robot is obtained, then, off-line cleaning calculation is performed on the historical conversation data, uncovered questions and questions in a conversation knowledge base are analyzed, and then, a response scheme of the uncovered questions is updated in the conversation knowledge base, so that the conversation robot responds to the questions of the user by using the updated conversation knowledge base.

Disclosure of Invention

The embodiment of the specification provides a self-learning method, a self-learning device, a self-learning system, electronic equipment and a self-learning medium of a conversation robot, real-time learning can be rapidly completed, and timeliness of self-learning of the robot is improved.

The first aspect of the embodiments of the present specification provides a self-learning method for a conversation robot, including:

acquiring user question data in real time through log service;

executing robot self-learning calculation on user question data acquired in real time in a distributed mode, wherein the executing robot self-learning calculation step comprises the following steps:

vectorizing the user question data by using local sensitive hashing, and clustering the obtained question vectors of the user question data to obtain a question cluster set;

and constructing a reverse index aiming at the question cluster set by utilizing the question vector of the user question data.

The second aspect of the embodiments of the present specification provides a self-learning method for a conversation robot, including:

acquiring current question data of a current user;

vectorizing the current question data by using local sensitive hash to obtain a question vector of the current question data;

according to the question vector of the current question data and the inverted index provided in the first aspect, acquiring a similar question cluster similar to the current question data from the question cluster set provided in the first aspect;

similarity calculation is carried out on the current question data and each question cluster in the similar question clusters, and a target question cluster matched with the current question data is obtained, wherein the similarity between the target question cluster and the current question data is not less than preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

A third aspect of the embodiments of the present specification provides a self-learning apparatus for a conversation robot, including:

the system comprises a questioning data acquisition unit, a questioning data acquisition unit and a questioning data acquisition unit, wherein the questioning data acquisition unit is used for acquiring user questioning data in real time through log service;

the robot self-learning unit is used for executing robot self-learning calculation on user question data acquired in real time in a distributed mode, and the executing robot self-learning calculation step comprises the following steps: vectorizing the user question data by using local sensitive hashing, and clustering the obtained question vectors of the user question data to obtain a question cluster set; and constructing a reverse index aiming at the question cluster set by utilizing the question vector of the user question data.

A fourth aspect of the embodiments of the present specification provides a self-learning apparatus for a conversation robot, including:

the current data acquisition unit is used for acquiring current question data of a current user;

the vectorization processing unit is used for carrying out vectorization processing on the current question data by using local sensitive hash to obtain a question vector of the current question data;

a similar question cluster acquiring unit, configured to acquire a similar question cluster similar to the current question data from the question cluster set provided in the first aspect according to the question vector of the current question data and the inverted index provided in the first aspect;

the updating unit is used for carrying out similarity calculation on the current question data and each question cluster in the similar question clusters to obtain a target question cluster matched with the current question data, wherein the similarity between the target question cluster and the current question data is not less than the preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

A fifth aspect of the embodiments of the present specification provides a self-learning system for a conversation robot, including a conversation server, a log server and a distributed machine cluster, including:

the conversation server is used for carrying out online conversation with a user through a conversation robot and writing conversation data into the log server in real time while the conversation robot and the user carry out online conversation;

the log server is used for caching the dialogue data written by the dialogue server in real time;

the distributed machine cluster is used for monitoring conversation data written into the log server in real time and executing robot self-learning calculation on the conversation data, and the executing robot self-learning calculation step comprises the following steps:

The sixth aspect of the embodiments of the present specification further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the self-learning method of the dialog robot when executing the program.

The seventh aspect of the embodiments of the present specification further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the self-learning method for a dialogue robot.

The beneficial effects of the embodiment of the specification are as follows:

based on the technical scheme, the log server acquires the dialogue data in real time, and the robot self-learning calculation is performed on the user question data acquired in real time in a distributed mode, so that the streaming dialogue data of the user can be calculated in real time in a distributed mode, the data does not need to be collected in advance, and the question asked by the user can be continuously learned and mined in 24 hours; the distributed type can greatly improve the service throughput, so that the self-learning efficiency of the robot is greatly improved, real-time learning can be rapidly completed in high-timeliness scenes with dynamic changes, and the self-learning timeliness of the robot is higher.

Drawings

FIG. 1 is a schematic structural diagram of a self-learning system of a conversation robot in an embodiment of the present specification;

FIG. 2 is a system architecture diagram of a self-learning system of a conversation robot in an embodiment of the present disclosure;

FIG. 3 is an overall flow chart of the self-learning system of the dialogue robot in the embodiment of the present specification;

FIG. 4 is a flow chart of a first method of a self-learning method of a dialogue robot in an embodiment of the present description;

FIG. 5 is a flow chart of a second method of a self-learning method of a dialogue robot in an embodiment of the present description;

FIG. 6 is a schematic diagram of a first structure of a self-learning apparatus of a conversation robot in an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a second structure of a self-learning apparatus of a conversation robot in an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device in an embodiment of this specification.

Detailed Description

In order to better understand the technical solutions, the technical solutions of the embodiments of the present specification are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and are not limitations of the technical solutions of the present specification, and the technical features of the embodiments and embodiments of the present specification may be combined with each other without conflict.

In a first aspect, as shown in fig. 1, the present specification provides a self-learning system of a conversation robot, including a conversation server 10, a log server 20 and a distributed machine cluster 30; the conversation server 10 is used for online conversation between the conversation robot 101 and the user, and writing conversation data into the log server 20 in real time while the conversation robot 101 and the user are in online conversation; a log server 20 for caching the dialogue data written by the dialogue server 10 in real time; the distributed machine cluster 30 is used for monitoring conversation data written into the log server 20 in real time and executing robot self-learning calculation on the conversation data, and the executing step of the robot self-learning calculation comprises the following steps: vectorizing user question data by using local sensitive Hash, and clustering obtained question vectors of the user question data to obtain a question cluster set; and constructing a reverse index aiming at the question cluster set by using the question vector of the user question data.

Specifically, the conversation robot 101 makes an online conversation with the user, and writes the conversation data in the log server 20 in real time while the conversation robot 101 makes an online conversation with the user; the distributed machine cluster 30 monitors the conversation data written in the log server 20 in real time, performs robot self-learning calculation on the conversation data, temporarily stores the intermediate result of the calculation in a cache, and pushes the intermediate result to a conversation knowledge base used by the conversation robot when detecting that a certain intermediate result reaches a service threshold.

In the embodiment of the present specification, the session server 10 and the log server 20 may be electronic devices such as a desktop computer, a notebook computer, an all-in-one machine, and a smart phone; further, the distributed machine cluster 30 may adopt a master-slave architecture, and when the distributed machine cluster 30 adopts the master-slave architecture, the master node in the distributed machine cluster 30 reads the streaming session data of the log server 20 from the log server 20 in real time, and then distributes the read streaming session data to the slave nodes in the distributed machine cluster 30 to perform autonomous learning calculation.

Specifically, referring to fig. 1, when a user uses a local device to perform a conversation with a conversation robot 101 in a conversation server 10, the conversation robot 101 acquires user question data of the user in real time, and performs intention recognition on the user question data using the conversation robot 101 to obtain a conversation intention of the user; then, the answer scheme matching the dialog intention is searched from the dialog knowledge base, and the dialog robot 101 carries out dialog with the user using the matching answer scheme.

However, in the actual application process, since the user questions relate to various complex scenes, it is difficult to fully cover the online user question-asking methods only by manpower, so that the conversation robot 101 must have self-learning capability, can insights the customer appeal, analyzes and extracts the question-asking methods which cannot be well matched, further updates and perfects the conversation knowledge base content in time, and improves the answering rate of the conversation robot 101. In order to improve the recognition capability of the conversation robot 101, some representative questions are extracted from the user's question, which are defined as standard questions, and then the conversation robot 101 is trained using the standard sentences.

And, in order to improve the answering efficiency of the conversation robot 101, abstracting a plurality of questions of a question by a user, and regarding the abstracted questions as a standard question, for example, ten questions are provided for one question, and at this time, clustering the ten questions of the question to form a standard question; and after each standard question is determined, configuring a targeted response scheme for the standard question in the dialog knowledge base. In this way, a targeted response scheme can be directly searched from the dialogue knowledge base for each standard question, and the answering efficiency of the dialogue robot 101 can be effectively improved.

Specifically, referring to fig. 1, when a user has a conversation with the conversation robot 101 using a local device, conversation data of the user and the conversation robot 101 is written into the log server 20 in real time. At this time, the session data may be automatically reported by the local device, or the log server 20 may collect the session data in real time, which is not limited in this specification.

In the embodiment of the present specification, the log server 20 provides a real-time log service, that is, the log server 20 can obtain the dialogue data of the user and the dialogue robot 101 in real time; since the distributed machine cluster 30 listens to the streaming session data written in the log server 20 in real time, and thus after a certain session data is written in the log server 20, the session data is read from the log server 20 by the distributed machine cluster 30, since the session data between the user and the session robot 101 has no boundary and the session data arrives, processes and is continuously transmitted backward, the session data acquired by the distributed machine cluster 30 in real time is streaming data.

The embodiment of the specification is generally applied to a big data environment, and in the big data environment, the streaming data also has high speed, the data can rapidly arrive in a high concurrency mode, and the business calculation requirements are corresponding in a rapid and continuous manner. The speed of data processing can at least match the speed of data arrival. Over time, the knowledge value contained in the data tends to decay, i.e., the importance of the data items in the stream is different, and recently arrived data tends to be more valuable than earlier arrived data. In the embodiment of the present specification, the dialogue data of the user and the dialogue robot 101 is typical streaming data, the user questions continuously enter the dialogue robot scenario, the value of the questions also changes with the product service over time, and the data value also continuously declines, so it is necessary to ensure the timeliness of the robot self-learning of the dialogue robot 101.

In the embodiment of the description, the robot self-learning is performed around the data of the conversation knowledge base, the conversation knowledge base is the brain of the robot, the conversation knowledge base has mass data, and the questions and the methods in the conversation knowledge base comprehensively cover the real appeal of the user, so that the conversation robot 101 can show a higher intelligent level. The robot autonomous learning is also called question learning, and means that the conversation robot 101 can autonomously learn and mine user appeal in the service process, analyze and extract question and question methods which cannot answer well and match, find missing contents of a conversation knowledge base, continuously perfect knowledge base data according to the found missing contents of the conversation knowledge base, and improve the answering rate of the conversation robot.

After the distributed machine cluster 30 acquires the session data, filtering the session data according to a filtering rule to obtain filtered session data; wherein the filtering rules include filtering conversations that do not require learning to mine, in which case, small spoken words and conversations for which the conversation robot 101 has complete resolution are filtered out. Of course, the filtering rules may also include filtering for dialogs corresponding to standard questions; further, in the present embodiment, the dialogue data includes user question data and response data of the dialogue robot 101.

After the distributed machine cluster 30 acquires the dialogue data, the distributed machine cluster 30 performs robot self-learning on user question data in the dialogue data acquired in real time, specifically, the dialogue data monitored in real time is distributed to nodes in the distributed machine cluster 30 to perform robot autonomous learning calculation, and the step of performing robot self-learning includes:

and step A1, vectorizing the user question data by using the locality sensitive hash, and clustering the user question data subjected to vectorization to obtain a question cluster set.

Specifically, local sensitive hash (LsHash) is used for vectorizing the user question data to obtain a question vector of the user question data; and clustering the question vectors of the user question data to obtain a question cluster set.

Specifically, as time goes by, the dialogue data becomes more and more, so that the quantity of the user question data is more and more, and the question clusters formed by clustering on the basis of the more and more quantity of the user question data are more and more, thereby promoting that the question clusters in the question cluster set are more and more.

Specifically, for each question cluster in the question cluster set, a cluster center of the question cluster is obtained, a question vector closest to the cluster center is obtained from the question cluster, and the obtained question vector is used as a central point of the question cluster.

For example, the distributed machine cluster 30 acquires the dialogue data in real time, wherein the dialogue data includes query1 and query2, and the query1 includes the following user question data: the account password of an application is forgotten; query2 contains user query data as: the account password locks how to get back. At this time, the user question data included in the query1 is vectorized by using the LsHash, and the question vector, which is vectorized representation, is obtained as follows: [ "0010101", "0001001", "0101000", "0100010" ]; using LsHash to carry out vectorization processing on the user question data contained in the query2, and obtaining vectorization representation, namely a question vector, as follows: [ "0010111", "0001101", "1101000", "0100010" ];

secondly, after obtaining the question vectors of the user question data contained in the query1 and the query2, clustering the question vectors of the query1 and the query2 to form a question cluster C1, so that the question cluster set comprises C1. After determining C1, obtaining the cluster center point of C1, and if the question vector of query1 is detected to be closest to the center point of C1, determining that the center point of C1 can be vectorized as: ["0010101","0001001","0101000","0100010"].

In the embodiment of the present specification, in the process of Clustering the question vectors of the user question data, a Clustering algorithm may be used To cluster the question vectors of the user question data, where the Clustering algorithm may be, for example, a K-Means (K-Means) Clustering algorithm, a K-medoid algorithm, a CLARANS algorithm Based on random selection, a Density-Based Clustering of Applications with Noise (DBSCAN) algorithm Based on a high-Density connection region, an identifying Clustering Structure by object Ordering (Ordering of the Clustering Structure, abbreviated as "OPTICS") algorithm, and a Clustering algorithm Based on a set of Density distribution functions (hierarchical clustered used, abbreviated as "dellue) algorithm.

And step A2, constructing an inverted index for the question class cluster by using the question vector of the user question data.

Specifically, the inverted index is constructed from the question vectors contained in each question cluster in the question cluster set. For example, if the question clusters included in the question cluster set are C1, C2, C3 and C4, if C2 includes question vector 0000111, C3 includes question vector 0100010, and C4 includes question vectors 0001110 and 0101000, the index ordering constructed may be:

0000111：[C1，C2]；

0001110：[C1，C4]；

0101000：[C1，C4]；

0100010：[C1，C3]；

……。

specifically, when constructing the index ordering, the index ordering may be constructed by using the vector of the center point of each class cluster, or the index ordering may be constructed by using all the question vectors in each class cluster, which is not limited in this specification;

and when acquiring the current question data of the current user in real time, the distributed machine cluster 30 performs robot self-learning on the dialogue data of the current user, specifically including the following steps:

and step B1, vectorizing the current question data by using the locality sensitive hash to obtain the question vector of the current question data.

Specifically, the current question data is vectorized by using the LsHash, and a question vector of the current question data is obtained. For example, if the current query data is query3, how the flower account number promotes the amount; after vector calculation is performed on query3 by using lshash, a vectorized representation, namely, a question vector of query3, is obtained as follows: ["0000111","0001110","1101010","0100011"].

And step B2, acquiring similar question clusters similar to the current question data from the question cluster set according to the question vector and the inverted index of the current question data.

Specifically, one or more question clusters having the same bit as the question vector of the current question data are searched from the inverted index by using the question vector of the current question data, and the searched one or more question clusters are used as similar question clusters.

For example, if the distributed machine cluster 30 acquires the current question data of the current user in real time as query3, and the question class cluster in the distributed machine cluster 30 includes C1, C2, C3 and C4, the question vector according to the query3 is [ "0000111", "0001110", "1101010", "0100011" ], and the question vector according to the query3 is compared with the inverted index to obtain the LsHash value; since the inverted index has the same bits as 0000111 and 0001110 in the query3, then according to 0000111: [ C1, C2] and 0001110: [ C1, C4], and if the question class clusters with the same bits as the question vector of query3 are determined to be C1, C2 and C4, then C1, C2 and C4 are used as similar question class clusters.

Step B3, carrying out similarity calculation on the current question data and each question cluster in the similar question clusters to obtain a target question cluster matched with the current question data, wherein the similarity between the target question cluster and the current question data is not less than the preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

Specifically, the similarity between the current question data and each question class cluster can be calculated by using a SinglePass algorithm, so that the similarity between the current question data and each question class cluster is obtained; comparing the similarity of the current question data and each question cluster with a preset similarity, and searching a target question cluster from the similar question clusters according to a comparison result; and after the target question cluster is found, updating the target question cluster, and updating the inverted index according to the updated target question cluster.

In the embodiment of the present specification, the preset similarity may be set by a person or an electronic device, or may be set according to actual requirements, and the preset similarity may be set to a value not less than 50% and less than 100%, for example, 50%, 65%, 75%, or the like; of course, the preset similarity may also be set to a value less than 50%, and the present specification is not particularly limited.

In an embodiment of this specification, after the distributed machine cluster 30 updates the target question cluster, the center point of the updated target question cluster is updated, and the updating step includes: acquiring a cluster center of the updated target question cluster; and acquiring a question vector nearest to the cluster center from the updated target question cluster, and taking the acquired question vector as the central point of the updated question cluster.

For example, if the distributed machine cluster 30 acquires current question data of the current user in real time as query3, and determines that similar question clusters of the query3 are C1, C2 and C4, the single pass algorithm is used to calculate the similarity between the query3 and C1, and the similarity between the C2 and C4 is C1-3, C2-3 and C4-3 in sequence, and then the C1-3, C2-3 and C4-3 are compared with the preset similarity; if the compared C1-3 is not less than the preset degree, updating the C1 to obtain updated C1, and updating the inverted index according to the updated C1, wherein the process of updating the inverted index specifically refers to the discussion of constructing the inverted index in the step a2, and is not described herein again for brevity of the specification.

Acquiring cluster center points D1 of the query1, the query2 and the query3 after the updated C1 is obtained; and acquiring a question vector which is closest to the D1 and is query2 from the updated C2, and determining vectors of the center point of the updated C2 as [ "0010111", "0001101", "1101000", "0100010" ].

In this embodiment of the present specification, after the distributed machine cluster 30 updates the target question cluster, it is determined whether the service parameter of the updated target question cluster is not less than the service threshold, and if it is determined that the service parameter of the updated target question cluster is not less than the service threshold, the dialog knowledge base used by the dialog robot 101 is updated, and the dialog robot 101 is trained using the updated dialog knowledge base, where the updated dialog knowledge base sets a response scheme corresponding to the updated target question cluster. Therefore, after the training of the conversation robot 101 is completed, the conversation robot 101 can learn about the problems such as the updated target question cluster, and when the conversation robot 101 recognizes that the user question data belongs to the updated target question cluster, the conversation robot 101 can extract the reply scheme corresponding to the updated target question cluster from the conversation database to reply, so that the conversation robot 101 can autonomously learn and mine the user appeal in a mass data source in real time, continuously update and perfect the conversation knowledge base, and improve the answer rate of the conversation robot 101.

And if the service parameter of the updated target question cluster is judged to be smaller than the service threshold value, no operation is carried out, and the original state is kept unchanged.

In this specification, the service parameter includes a class cluster size and a class cluster heat, at this time, the service threshold includes a set class cluster value and a set heat, when the number of user questions included in a question class cluster is not less than the set class cluster value and the class cluster heat is not less than the set heat, it may be determined that the service parameter of the question class cluster is not less than the service threshold, otherwise, it is determined that the service parameter of the question class cluster is less than the service threshold.

In this embodiment, the size of the class cluster may refer to the number of questions of a user included in the question class cluster, and the heat degree of the class cluster refers to the heat degree of the question class cluster concerned by the user.

In an embodiment of this specification, after similarity calculation is performed on current question data and each question cluster in similar question clusters, if it is compared that the similarity between the current question data and any one of the similar question clusters is smaller than a preset similarity, similar scatter points similar to the current question data are obtained, cluster calculation is performed on the current question data and the similar scatter points, a new question cluster composed of the current question data and the similar scatter points is obtained, and the new question cluster is added to a question cluster set.

Specifically, after similarity calculation is carried out on the current question data and each question cluster in the similar question clusters, the similarity of the current question data and each question cluster is compared with a preset similarity; if the similarity between the current question data and any one of the similar question clusters is smaller than the preset similarity, acquiring similar scatter points similar to the current question data according to the LsHash value; and clustering the current question data and the similar scattered points to obtain a new question cluster consisting of the current question data and the similar scattered points, and adding the new question cluster into the question cluster set.

Specifically, when similar scatter points similar to the current question data are obtained according to the LsHash value, scatter points having the same position as the question vector of the current question data can be searched from all scatter points cached in the distributed machine cluster 30 according to the question vector of the current question data, and the searched scatter points are used as similar scatter points similar to the current question data; clustering calculation is carried out on the current question data and the similar scattered points by using a clustering algorithm, and whether a new question cluster can be formed is judged according to a cluster forming condition; and if the new question cluster can be formed, adding the new question cluster into the question cluster set, determining the central point of the new question cluster, and updating the inverted index. The determination of the center point of the new question cluster refers to the above description of obtaining the center point of the question cluster, and the updating process of the inverted index specifically refers to the discussion of constructing the inverted index in step a 2.

In this embodiment of the present specification, the cluster forming condition may be that the word vector distance is within a set threshold, and the set threshold may be set according to an actual situation, or may be set manually or by an apparatus. If the word vector distance between the current question data and the similar scatter points is within a set threshold value, determining that the current question data and the similar scatter points can form a new question cluster; otherwise, a new question class cluster cannot be formed.

After the new question cluster is judged to be formed, whether the service parameter of the new question cluster is not smaller than a service threshold value or not needs to be judged, if the service parameter of the new question cluster is judged to be not smaller than the service threshold value, a dialogue knowledge base used by the dialogue robot 101 is updated, the dialogue robot 101 is trained by using the new question cluster, and the updated dialogue knowledge base is provided with a reply scheme corresponding to the new question cluster; if the service parameter of the new question cluster is judged to be smaller than the service threshold value, no operation is carried out, and the original state is kept unchanged.

And if the new question cluster cannot be formed, writing the current question data into the temporary cache as scatter points.

Referring to fig. 2, a system architecture diagram of a self-learning system of a conversation robot in an embodiment of the present specification is shown. The system architecture diagram comprises an online conversation service 40, a real-time log service 41 and a robot self-learning service 42, wherein the conversation online service 40 is supported by a conversation server 10, the online conversation service 40 provides a conversation service for a user 44 by using a conversation robot 101, and the conversation robot 101 writes conversation data into the real-time log service 41 in real time while carrying out conversation with the user 44; the robot self-learning service 42 is configured to listen to streaming dialogue data written in the real-time log service 41 in real time, to read dialogue data from the real-time log service 41, and to distribute the read dialogue data to the nodes of the distributed machine cluster 30 to perform robot autonomous learning calculation.

The real-time log service 41 is a log service provided by the log server 20, the robot self-learning service 42 is a service provided by the distributed machine cluster 30, the real-time log service 41 presets the number of the boards according to the actual data volume, and as the conversation robot 101 has a conversation with a large number of users at the same time, the data volume of the conversation is large, so that the number of the boards is large; therefore, in actual use, real-time logging service 41 usually has a large number of shards, which can be represented by shard _1, shard _2, shard _3 … … through shard _ N, where N is an integer greater than 3.

Accordingly, when the robot self-learning service 42 provides the service, each master node in the distributed machine cluster 30 reads the streaming dialogue data from the board in the log server 20 in real time, each master node distributes the read streaming dialogue data to the slave nodes in the distributed machine cluster 30 to perform the self-learning calculation, each question cluster in the question cluster set is obtained as a temporary cluster, and all the temporary clusters are stored in the temporary cluster cache 43.

The Master nodes in the robot self-learning service 42 comprise masters 1, 2, 3 … … and Master, and the Slave nodes comprise Slave1, Slave2, Slave3 … … and Slave; the temporary cluster cache 43 stores temporary cluster 1, temporary cluster 2, temporary cluster 3 … … to MasterQ; wherein M, K and Q are both integers greater than 3.

Thus, the distributed architecture and the real-time log service adopted by the system in the embodiment of the present specification enable the distributed machine cluster 30 to calculate the streaming dialogue data of the user in real time, the data does not need to be collected in advance, and the problem asked by the user can be continuously learned and mined in 24 hours. The distributed machine cluster 30 is adopted to distribute the user session data acquired in real time to the slave nodes in the distributed machine cluster 30 for processing, so that the service throughput can be greatly improved, and the real-time learning can be rapidly completed in some high-timeliness scenes with dynamic changes.

The system in the embodiment of the specification can be applied to the scenes of large promotion activities such as double 11 and new spring red envelope, the characteristics of the customer group have very obvious trend of changing along with time along with the business change of the activities, and the user problem characteristics are dynamically changed in real time; at the moment, the system can learn the questions of the user in real time, so that the standard questions of the user are mined and fed back to the conversation knowledge base, the conversation knowledge base sets a corresponding response scheme in time after acquiring the standard questions, and the real-time performance and the response efficiency of the conversation robot can be improved.

Referring to fig. 3, it is an overall flowchart of the self-learning system of the dialog robot in the embodiment of the present specification. Firstly, executing step 301, collecting the conversation between a user and a conversation robot in real time; then, step 302 is executed, the dialogue is filtered according to the dialogue filtering rule, and the dialogue which does not need to be learned and mined by the dialogue robot can be filtered; after step 302, step 303 is executed to perform vectorization processing on the user question in the dialogue data, specifically using LsHash to perform vectorization processing; next, step 304 is executed to compare the current question with the central points of the temporary clusters, specifically, the question vector of the current question is obtained by using the LsHash, and the question vector of the current question is compared with the central points of each temporary cluster of the temporary cache to determine whether the same bits exist.

And when step 304 compares that there are one or more temporary clusters of the same bits, step 305 is executed to obtain possibly similar clusters; step 306 is executed again, SinglePass clustering is carried out, and specifically, the similarity between the current question and each question cluster is calculated by using a SinglePass algorithm; then, step 307 is executed, and whether the current question can be classified into a temporary cluster is performed; if the temporary cluster can be determined in step 307, executing step 308 to update the central point of the temporary cluster; if the temporary clustering cannot be performed in step 307, step 309 is executed to obtain similar scatter points similar to the current question.

And, after step 309 is executed, step 310 is executed, SinglePass clustering, clustering the current question and similar scatter points; after step 310 is executed, step 311 is executed to determine whether a new cluster can be created; if yes, executing step 312, newly building a temporary cluster and caching, and then executing step 308, updating the central point of the temporary cluster; if not, step 313 is executed to cache the current question as a scatter point.

And after the central point of the temporary cluster is updated through the step 308, executing a step 314, if the cluster parameter of the temporary cluster reaches the service threshold, executing a step 315, and supplementing and perfecting the conversation database.

In a second aspect, based on the same technical concept, embodiments of the present specification provide a self-learning method for a conversation robot, as shown in fig. 4, including:

step S402, obtaining user question data in real time through log service;

s404, executing robot self-learning calculation on the user question data acquired in real time in a distributed mode, wherein the executing robot self-learning calculation step comprises the following steps:

In an optional implementation, the vectorizing the user question data using locality sensitive hashing includes:

vectorizing the user question data by using local sensitive hash to obtain a question vector of the user question data;

and clustering the question vectors of the user question data to obtain the question cluster set.

In an optional implementation manner, after clustering the obtained question vectors of the user question data to obtain a question class cluster, the method further includes:

and aiming at each question cluster in the question cluster set, acquiring a cluster center of the question cluster, acquiring a question vector closest to the cluster center from the question cluster, and taking the acquired question vector as a central point of the question cluster.

In a third aspect, based on the same technical concept, embodiments of the present specification provide a self-learning method for a conversation robot, as shown in fig. 5, including:

step S502, obtaining current question data of a current user;

step S504, vectorizing the current question data by using local sensitive Hash to obtain a question vector of the current question data;

step S506, according to the question vector and the inverted index of the current question data, acquiring a similar question cluster similar to the current question data from a question cluster set;

step S508, similarity calculation is carried out on the current question data and each question cluster in the similar question clusters, and a target question cluster matched with the current question data is obtained, wherein the similarity between the target question cluster and the current question data is not less than preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

In an optional implementation manner, the performing, by the processor, a similarity calculation on the current question data and each question class cluster in the similar question class clusters includes:

and calculating the similarity between the current question data and each question cluster based on a SinglePass algorithm to obtain the similarity between the current question data and each question cluster.

In an alternative embodiment, after updating the target question cluster, the method further comprises:

acquiring the updated clustering center of the target question cluster;

and acquiring a question vector closest to the clustering center of the target question cluster from the updated target question cluster, and taking the acquired question vector as the center point of the updated target question cluster.

and if the updated service parameters of the target question cluster are not less than the service threshold, updating the conversation knowledge base corresponding to the conversation robot.

In an optional implementation manner, after performing similarity calculation on the current question data and each question class cluster in the similar question class clusters, the method further includes:

if the similarity between the current question data and any one of the similar question clusters is calculated to be smaller than the preset similarity, obtaining similar scattered points similar to the current question data, performing cluster calculation on the current question data and the similar scattered points to obtain a new question cluster consisting of the current question data and the similar scattered points, and adding the new question cluster into the question cluster set.

In a fourth aspect, based on the same technical concept, embodiments of the present specification provide a self-learning apparatus for a conversation robot, as shown in fig. 6, including:

a question data acquiring unit 601 configured to acquire user question data in real time through a log service;

the robot self-learning unit 602 is configured to perform robot self-learning calculation on user question data acquired in real time in a distributed manner, where the performing robot self-learning calculation includes: vectorizing the user question data by using local sensitive hashing, and clustering the obtained question vectors of the user question data to obtain a question cluster set; and constructing a reverse index aiming at the question cluster set by utilizing the question vector of the user question data.

In an optional implementation manner, the robot self-learning unit 602 is configured to perform vectorization processing on the user question data by using a locality sensitive hash, so as to obtain a question vector of the user question data;

In an optional embodiment, the self-learning apparatus further comprises:

and the cluster center point updating unit is used for clustering the obtained question vectors of the user question data to obtain a question cluster set, then acquiring the cluster center of the question cluster for each question cluster in the question cluster set, acquiring a question vector closest to the cluster center from the question cluster, and taking the acquired question vector as the center point of the question cluster.

In a fifth aspect, based on the same technical concept, embodiments of the present specification provide a self-learning apparatus for a conversation robot, as shown in fig. 7, including:

a current data obtaining unit 701, configured to obtain current question data of a current user;

a vectorization processing unit 702, configured to perform vectorization processing on the current question data by using locality sensitive hash, so as to obtain a question vector of the current question data;

a similar question cluster acquiring unit 703, configured to acquire a similar question cluster similar to the current question data from the question cluster set provided in the second aspect according to the question vector of the current question data and the inverted index provided in the second aspect;

an updating unit 704, configured to perform similarity calculation on the current question data and each question cluster in the similar question cluster to obtain a target question cluster matched with the current question data, where a similarity between the target question cluster and the current question data is not less than a preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

In an optional implementation manner, the updating unit 704 is configured to calculate, based on a SinglePass algorithm, a similarity between the current question data and each question class cluster, and obtain a similarity between the current question data and each question class cluster.

In an optional implementation manner, the updating unit 704 is configured to, after updating the target question cluster, obtain a cluster center of the updated target question cluster; and acquiring a question vector closest to the clustering center of the target question cluster from the updated target question cluster, and taking the acquired question vector as the center point of the updated target question cluster.

In an optional implementation manner, the updating unit 704 is configured to, after updating the target question cluster, update the dialog knowledge base corresponding to the dialog robot if the updated service parameter of the target question cluster is not less than the service threshold.

In an optional embodiment, the self-learning apparatus further comprises:

and the new cluster acquiring unit is used for acquiring similar scattered points similar to the current question data if the similarity between the current question data and any one of the similar question clusters is calculated to be smaller than the preset similarity after the similarity calculation is carried out on the current question data and each of the similar question clusters, carrying out cluster calculation on the current question data and the similar scattered points to obtain a new question cluster consisting of the current question data and the similar scattered points, and adding the new question cluster into the question cluster set.

In a sixth aspect, based on the same inventive concept as the self-learning method of the dialog robot in the foregoing embodiment, an embodiment of the present specification further provides an electronic device, as shown in fig. 8, including a memory 804, a processor 802, and a computer program stored on the memory 804 and operable on the processor 802, where the processor 802 executes the program to implement the steps of any one of the methods of the self-learning method of the dialog robot described above.

Where in fig. 8 a bus architecture (represented by bus 800), bus 800 may include any number of interconnected buses and bridges, bus 800 linking together various circuits including one or more processors, represented by processor 802, and memory, represented by memory 804. The bus 800 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 805 provides an interface between the bus 800 and the receiver 801 and transmitter 803. The receiver 801 and the transmitter 803 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 802 is responsible for managing the bus 800 and general processing, and the memory 804 may be used for storing data used by the processor 802 in performing operations.

In a seventh aspect, based on the inventive concept of the self-learning method of the dialogue robot as in the previous embodiments, the present specification further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of any one of the methods of the self-learning method of the dialogue robot as described above.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.

Claims

1. A self-learning method of a dialogue robot, comprising:

acquiring user question data in real time through log service;

clustering the question vectors of the user question data to obtain a question cluster set;

constructing a reverse index for the question cluster set by using the question vector of the user question data;

wherein, the constructing the reverse index for the question cluster set by using the question vector of the user question data comprises:

clustering the obtained question vectors of the user question data to obtain a question cluster set, then aiming at each question cluster in the question cluster set, obtaining a cluster center of the question cluster, obtaining a question vector closest to the cluster center from the question cluster, and taking the obtained question vector as a central point of the question cluster;

and constructing an inverted index based on the central point of each question cluster in the question cluster set.

2. A self-learning method of a dialogue robot, comprising:

acquiring current question data of a current user;

acquiring similar question clusters similar to the current question data from the question cluster set established by the method according to the claim 1 according to the question vectors of the current question data and the inverted index established by the method according to the claim 1;

similarity calculation is carried out on the current question data and each question cluster in the similar question clusters, and a target question cluster matched with the current question data is obtained, wherein the similarity between the target question cluster and the current question data is not less than preset similarity;

and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

3. The method of claim 2, wherein the calculating the similarity of the current question data and each of the similar question clusters comprises:

and calculating the similarity between the current question data and each question cluster in the similar question cluster based on a SinglePass algorithm to obtain the similarity between the current question data and each question cluster in the similar question cluster.

4. The method of claim 2, after updating the target question cluster, the method further comprising:

acquiring the updated clustering center of the target question cluster;

5. The method of claim 4, after updating the target question cluster, the method further comprising:

and if the updated service parameters of the target question cluster are not smaller than the service threshold, updating a conversation knowledge base corresponding to the conversation robot, wherein the service parameters comprise the size of the class cluster and the heat of the class cluster.

6. The method of claim 3, after performing similarity calculations on the current question data and each of the similar question class clusters, the method further comprising:

if the similarity between the current question data and any one of the similar question clusters is calculated to be smaller than the preset similarity, acquiring similar scatter points similar to the current question data;

and performing cluster calculation on the current question data and the similar scattered points to obtain a new question cluster consisting of the current question data and the similar scattered points, and adding the new question cluster into the question cluster set.

7. A self-learning apparatus of a dialogue robot, comprising:

the robot self-learning unit is used for executing robot self-learning calculation on user question data acquired in real time in a distributed mode, and the executing robot self-learning calculation step comprises the following steps: vectorizing the user question data by using local sensitive hash to obtain a question vector of the user question data; clustering the question vectors of the user question data to obtain a question cluster set; constructing a reverse index for the question cluster set by using the question vector of the user question data;

a cluster center point updating unit, configured to cluster the obtained question vectors of the user question data to obtain a question cluster set, obtain a cluster center of the question cluster for each question cluster in the question cluster set, obtain a question vector closest to the cluster center from the question cluster, and use the obtained question vector as a center point of the question cluster;

the robot self-learning unit is used for constructing an inverted index based on the central point of each question cluster in the question cluster set.

8. A self-learning apparatus of a dialogue robot, comprising:

a similar question cluster obtaining unit, configured to obtain a similar question cluster similar to the current question data from the question cluster set established by the method according to claim 1, according to the question vector of the current question data and the inverted index established by the method according to claim 1;

9. The apparatus according to claim 8, wherein the updating unit is configured to calculate a similarity between the current question data and each question class cluster in the similar question class clusters based on a SinglePass algorithm, and obtain the similarity between the current question data and each question class cluster in the similar question class clusters.

10. The apparatus according to claim 8, wherein the updating unit is configured to, after updating the target question cluster, obtain a cluster center of the updated target question cluster; and acquiring a question vector closest to the clustering center of the target question cluster from the updated target question cluster, and taking the acquired question vector as the center point of the updated target question cluster.

11. The apparatus according to claim 10, wherein the updating unit is configured to, after updating the target question cluster, update the dialog knowledge base corresponding to the dialog robot if the updated service parameter of the target question cluster is not smaller than a service threshold, where the service parameter includes a cluster size and a cluster heat.

12. The apparatus of claim 9, further comprising:

13. A self-learning system for a conversation robot including a conversation server, a log server and a distributed machine cluster, comprising:

vectorizing user question data by using local sensitive hash to obtain a question vector of the user question data;

14. The system of claim 13, the distributed cluster of machines to obtain current questioning data of a current user during execution of a robotic self-learning computation; vectorizing the current question data by using local sensitive hash to obtain a question vector of the current question data; according to the question vector of the current question data and the inverted index, acquiring a similar question cluster similar to the current question data from the question cluster set; similarity calculation is carried out on the current question data and each question cluster in the similar question clusters, and a target question cluster matched with the current question data is obtained, wherein the similarity between the target question cluster and the current question data is not less than preset similarity; and updating the target question cluster, and updating the inverted index according to the updated target question cluster.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1-6 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.