CA3156172A1 - Text-clustering-based customer service log backflow method and apparatus thereof - Google Patents

Text-clustering-based customer service log backflow method and apparatus thereof

Info

Publication number
CA3156172A1
CA3156172A1 CA3156172A CA3156172A CA3156172A1 CA 3156172 A1 CA3156172 A1 CA 3156172A1 CA 3156172 A CA3156172 A CA 3156172A CA 3156172 A CA3156172 A CA 3156172A CA 3156172 A1 CA3156172 A1 CA 3156172A1
Authority
CA
Canada
Prior art keywords
question
logs
cluster
clustering
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3156172A
Other languages
French (fr)
Inventor
Guodong Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3156172A1 publication Critical patent/CA3156172A1/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text-clustering-based customer service log backflow method and an apparatus thereof relate to the technical field of artificial intelligence, and can solve issues about inefficiency of manually flowing user questions back to a knowledge base. The method includes: from a customer service system, acquiring plural question logs to which no matching answer was found; clustering the question logs by means of place recognition and intent recognition, so that every resulting cluster includes at least one said question log; and based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base. The apparatus applies the method of the scheme as described above.

Description

TEXT-CLUSTERING-BASED CUSTOMER SERVICE LOG BACKFLOW METHOD
AND APPARATUS THEREOF
BACKGROUND OF THE INVENTION
Technical Field The present invention relates to the technical field of artificial intelligence, and more particularly to a text-clustering-based customer service log backflow method and an apparatus thereof.
Description of Related Art In Internet-related sectors, many businesses have built their own customer service systems to complete their products. Manually operated customer service is a labor-intensive depai intent in a company and requires huge resources for personnel training and service delivery. With the development of artificial intelligence technology, businesses hoping to mitigate such a heavy operational burden have adopted AT customer service to assist or replace manned customer service.
Currently, most AT customer service schemes are based on matching with knowledge bases, and essentially about matching a user question with questions recorded in a knowledge base.
Therefore, it is important that a knowledge base enlists abundant questions.
However, since there has not been an efficient way to flow online questions from users to the relevant knowledge base, questions not solved by AT customer service systems online are afterward processed using algorithms so that significant ones can be manually sieved out and supplemented to existing knowledge bases, thereby forming a closed loop online and getting the algorithm trained to be better.
The traditional data backflow mechanism is inefficient because there are tens of thousands of questions generated every day and at least 20% of them are left unsolved at the AT customer service system. This is to say, the company staff has to manually identify significant questions Date Recue/Date Received 2022-04-21 from several thousand entries of data, which means an overwhelming workload.
To improve the efficiency for identifying significant questions, the conception is to cluster similar questions together using a clustering algorithm. Hence, for better performance of an AT
customer service system, it is important to enrich the content of the knowledge base by flowing back online questions.
SUMMARY OF THE INVENTION
The objective of the present invention is to provide a text-clustering-based customer service log backflow method and an apparatus thereof, which are designed to solve issues about inefficiency of manually flowing user questions back to a knowledge base.
To achieve the foregoing objective, in a first aspect, the present invention provides a text-clustering-based customer service log backflow method, which comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found;
clustering the question logs by means of place recognition and intent recognition, so that every resulting cluster includes at least one said question log; and based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base.
Preferably, the step of clustering the question logs by means of place recognition and intent recognition comprises:
performing word segmentation on the question logs, performing cleaning, vectorize words so as to obtain word vector, and tagging each said word with a part-of-speech;
performing place clustering on the question logs according to a word vector of each said word in the question logs and the words having part-of-speech tagging results as nouns, so as to obtain at least one initial cluster; and for the question logs receiving place clustering, performing intent clustering on the question logs, so as to obtain at least one cluster according to the word vector of each said word in the question
2 Date Recue/Date Received 2022-04-21 logs in each said initial cluster and the words having part-of-speech tagging results as verbs.
More preferably, before the step of clustering the question logs by means of place recognition and intent recognition, the method further comprises:
presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering;
the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs;
the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of different question logs.
Optionally, before the step of clustering the question logs by means of place recognition and intent recognition, the method further comprises:
constructing an interfering noun list and an interfering verb list, respectively;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
Preferably, the step of performing place clustering on the question logs according to a word vector of each said word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs other than the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in the same place interval distribution into an initial cluster of the same place.
Preferably, the step of performing intent clustering on the question logs according to the word vector of each said word in the question logs in each said initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs other than the verbs existing in the interfering verb
3 Date Recue/Date Received 2022-04-21 list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in the same intent interval distribution into a cluster of the same intent.
More preferably, the step of based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base comprises:
according to a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, computing a value-assignment result for each said question log in each said cluster using an assignment method; and setting backflow priority of the question logs based on the value-assignment results, and preferentially tagging the question logs with high priority as backflow question logs.
Optionally, the method further comprises:
discarding each said question log that has been clustered in the current cycle but is not tagged as a backflow question log; and clustering each said question log that has not been clustered in the current cycle and is not tagged as a backflow question log and does not have a matching answer found in the next cycle.
As compared to the prior art, text-clustering-based customer service log backflow method of the present invention provides the following beneficial effects:
In the text-clustering-based customer service log backflow method of the present invention, when a user raises a question in a customer service system, the customer service system will automatically give the user an answer matching the question, and will regularly collect question logs for which no matching answer is found. Then the plural question logs are clustered by means of place recognition and intent recognition, which means use of a secondary clustering algorithm.
During the first clustering, more attention is paid to nouns to differentiate places, and during the second clustering, more attention is paid to verbs to differentiate intents.
At last, based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, some question logs are sieved from the cluster and tagged before flowing back
4 Date Recue/Date Received 2022-04-21 to a knowledge base.
It is thus clear that the present invention provides an effective scheme for flowing logs back to the knowledge base. Richness of a knowledge base has direct influence on the ability of the customer service system to answer questions from users. By continuously making supplement with questions that the customer service system once failed to answer or did not answer perfectly, the knowledge base of the customer service system can be dynamically updated and enriched, thereby solving issues about inefficiency of manually flowing user questions back to a knowledge base.
In a second aspect, the present invention provides a text-clustering-based customer service log backflow apparatus, which is used to implement the text-clustering-based customer service log backflow method of the foregoing technical scheme the. The apparatus comprises:
a matching unit, for from a customer service system, acquiring plural question logs to which no matching answer was found;
a clustering unit, for clustering the question logs by means of place recognition and intent recognition, so that every resulting cluster includes at least one said question log; and a sieving backflow unit, for based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base.
As compared to the prior art, the disclosed text-clustering-based customer service log backflow apparatus provides beneficial effects that are similar to those provided by the disclosed method as enumerated above, and thus no repetitions are made herein.
In a third aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, wherein when run by a processor, the computer program executes the steps of the text-clustering-based customer service log backflow method as described above.
As compared to the prior art, the disclosed computer-readably storage medium provides beneficial effects that are similar to those provided by the disclosed the text-clustering-based customer service log backflow method as enumerated above, and thus no repetitions are made herein.
Date Recue/Date Received 2022-04-21 BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings are provided herein for better understanding of the present invention and form a part of this disclosure. The illustrative embodiments and their descriptions are for explaining the present invention and by no means form any improper limitation to the present invention, wherein:
FIG. 1 is a flowchart of a text-clustering-based customer service log backflow method according to one embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating customer service logs flowing back to a knowledge base according to one embodiment of the present invention; and FIG. 3 is a flowchart of a secondary clustering algorithm based on part-of-speech tags according to one embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
To make the foregoing objectives, features, and advantages of the present invention clearer and more understandable, the following description will be directed to some embodiments as depicted in the accompanying drawings to detail the technical schemes disclosed in these embodiments.
It is, however, to be understood that the embodiments referred herein are only a part of all possible embodiments and thus not exhaustive. Based on the embodiments of the present invention, all the other embodiments can be conceived without creative labor by people of ordinary skill in the art, and all these and other embodiments shall be encompassed in the scope of the present invention.
Embodiment 1 The text-clustering-based customer service log backflow method that the present embodiment provide is mainly for addressing two issues about Al customer service business. First, what kind of question logs have to flow back. In other words, how to select question logs after clustering, and what to do with the remaining question logs. Second, how to ensure the accuracy of the clustering algorithm and to ensure that the clustering algorithm can accurately separate different places and different user intents, so as to reasonably cluster questions with the same place and Date Recue/Date Received 2022-04-21 the same intent in question logs in the same cluster. The text-clustering-based customer service log backflow method mainly comprises two parts:
1. Constructing workflow for flowing back question logs This part is mainly about an application module for flowing back question logs, including sources of backflow questions, a scheme for picking up backflow questions, and a manner to deal with the remaining data after backflow, thereby completely providing a full workflow framework for question log backflow.
2. Clustering scheme for question log backflow This part is mainly about an algorithm module for question log backflow. Given the feature of question logs being multi-place and multi-intent, the present invention uses a secondary clustering algorithm based on parts of speech to better identify needs of questions about places and intents.
Referring to FIG. 1, the present embodiment provides a text-clustering-based customer service log backflow method, comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found; clustering the question logs by means of place recognition and intent recognition, so that every resulting cluster includes at least one said question log; and based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base.
Referring to FIG. 2, in the text-clustering-based customer service log backflow method of the present embodiment, in response to a question raised by a user at a customer service system, the customer service system uses a built-in algorithm to recognize the vector value of the question, and matches the question with standard questions or similar questions. If the vector of the question and the vector of a standard question are in a standard threshold interval, the system calls the answer of that standard question and sends it to the user. If the vector of the question and the vector of a similar question are in a similarity threshold interval, the system calls the answer of that similar question and sends it to the user. If the vector of the question is in neither the standard threshold interval nor the similarity threshold interval, the question is picked out as Date Recue/Date Received 2022-04-21 a question log a without matching answer and to be processed with backflow.
Then the plural question logs are clustered by means of place recognition and intent recognition, which means use of a secondary clustering algorithm. During the first clustering, more attention is paid to nouns to differentiate places, and during the second clustering, more attention is paid to verbs to differentiate intents. At last, based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, some question logs are sieved from the cluster and tagged before flowing back to a knowledge base.
It is thus clear that the present invention provides an effective scheme for flowing logs back to the knowledge base. Richness of a knowledge base has direct influence on the ability of the customer service system to answer questions from users. By continuously making supplement with questions that the customer service system once failed to answer or did not answer perfectly, the knowledge base of the customer service system can be dynamically updated and enriched, thereby solving issues about inefficiency of manually flowing user questions back to a knowledge base.
In the embodiment, the step of clustering the question logs by means of place recognition and intent recognition comprises:
performing word segmentation on the question logs, performing cleaning, vectorize words so as to obtain word vector, and tagging each said word with a part-of-speech;
performing place clustering on the question logs according to a word vector of each said word in the question logs and the words having part-of-speech tagging results as nouns, so as to obtain at least one initial cluster; and for the question logs receiving place clustering, performing intent clustering on the question logs, so as to obtain at least one cluster according to the word vector of each said word in the question logs in each said initial cluster and the words having part-of-speech tagging results as verbs.
In the embodiment, the step of performing place clustering on the question logs according to a word vector of each said word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs other than the nouns existing in the interfering noun list, setting a weighting weight of a first proportion; summing up and computing a vector of the Date Recue/Date Received 2022-04-21 question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in the same place interval distribution into an initial cluster of the same place.
In the embodiment, the step of performing intent clustering on the question logs according to the word vector of each said word in the question logs in each said initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs other than the verbs existing in the interfering verb list, setting a weighting weight of a second proportion; summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in the same intent interval distribution into a cluster of the same intent.
In practical implementations, the data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
Therein, the data pre-processing module comprises a full set of data pre-processing functions of the NLP (natural language processing) method for corpus data of the question logs, which include:
1. Word segmentation: An open-source word segmentation tool named jieba is used to segment a question log with the granularity of words, and word vectorization is realized based on FastText;
2. Character purification: The question log of the user receives string cleaning so as to remove special characters and punctuation marks, and typos are corrected; and 3. Part-of-speech tagging: An open-source tool LTP provided by the Harbin Institute of Technology is used to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
Referring to FIG. 2, then a secondary clustering module is used to cluster the question logs based on place recognition and intent recognition. For all question logs to flow back, since the number of types is unknown, the SinglePass clustering algorithm that does not require setting the number of types in advance is used. The words are vectorization so as to obtain word vectors, and the weight sum is taken as the sentence vector of the question log. The cosine similarity between vectors of question logs is used to measure the similarity between individual question logs.

Date Recue/Date Received 2022-04-21 1. First Clustering Questions raised in customer services scenarios in the financial industry usually involve multiple places and multiple intents, and therefore direct clustering is not applicable to identify importance of places and intents. It is believed that places have to be differentiate first to be consistent to the criterion for question classification in the knowledge base. Thus, the first clustering is focused on clustering questions of the same places to the same initial cluster. To this end, according to word vectors, for a word is tagged as a noun, a higher weight is set, such as a weighting weight of a first proportion. For words tagged as other parts of speech, lower weights are set. The exact value of a weight may be adjusted depending on the clustering results. At last, based on the aggregate result of the sums of the word vectors of the words in the question logs, question logs in the same place interval distribution are preliminarily clustered. It is understandable that, the place interval distribution refers to the interval distribution of the vectors of the question logs.
Every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
For preventing meaningless nouns from causing interference, an interfering noun list is ser in advance. Similarly, the nouns in the interfering noun list are assigned with lower weight, such as a reduced weight of a third proportion.
2. Second Clustering The first clustering is primarily focused on the issue of place recognition, and the second clustering is primarily focused on the issue of intent recognition. Arguably, most intents are expressed by verbs, so for the second clustering, the words that are tagged as verbs in terms of part-of-speech are assigned with a weight of a second proportion, and words of other parts of speech are weighted lower. On the basis of the initial clusters from the first clustering, the secondary clustering is performed on question logs in the same initial cluster. This means based on the aggregate results of the sums of the word vectors of the words in the question logs, question logs in the same intent interval distribution are clustered. It is understandable that, the intent interval distribution refers to the interval distribution of the vectors of the question logs.
Every interval distribution corresponds to a cosine a similarity interval of the vectors of the question logs. For preventing some meaningless verbs from causing interference, an interfering verb list is set in advance. Similarly, the verbs in the interfering verb list are assigned with a Date Recue/Date Received 2022-04-21 reduced processing, such as a reduced weight of a fourth proportion.
In the embodiment, before the step of clustering the question logs by means of place recognition and intent recognition, the method further comprises:
presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering; the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs; the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of different question logs.
In the embodiment, before the step of clustering the question logs by means of place recognition and intent recognition, the method further comprises:
constructing an interfering noun list and an interfering verb list, respectively; assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
It is understandable that the values of the first proportion, the second proportion, the third proportion, and the fourth proportion may be the same or may be different, and the present embodiment places no limitation thereto.
In the embodiment, the step of based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base comprises:
according to a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, computing a value-assignment result for each said question log in each said cluster using an assignment method; and setting backflow priority of the question logs based on the value-assignment results, and preferentially tagging the question logs with high priority as backflow question logs.
In practical implementations, the clusters formed in the clustering process are sorted in a descending order in terms of the number of question logs in every cluster. A
cluster with more question logs may be regarded as a high-frequency class of questions that is difficult to solve by the robot, and shall be given higher priority for the purpose of backflow. A
cluster with fewer Date Recue/Date Received 2022-04-21 question logs may be regarded as a low-frequency class of questions, any may be not representative. In addition, the number of appearing times of a question indicates the importance of that question. To pick backflow questions from the resulting clusters, for every question log the sum of the number of questions in a cluster having the question log and the number of appearing times of the question in the question log is calculated and the sums for all question log are sorted in a descending order. A certain number of question logs for backflow to the knowledge base is then selected from the top of the sorted list.
The embodiment further comprises: discarding each said question log that has been clustered in the current cycle but is not tagged as a backflow question log; and clustering each said question log that has not been clustered in the current cycle and is not tagged as a backflow question log and does not have a matching answer found in the next cycle.
In practical implementations, a question log that is not to flow back to the knowledge base and is not put in a cluster may be handled in either of two ways according to the present invention.
The first way is to discard irrelevant data so as to precent repetition. The second way is to re-cluster question logs that do not have any operation (such as questions in clusters having fewer questions) with data generated in the next day. This helps prevent low-frequency data from never coming into any cluster. The specific front-end operation may comprise: for a cluster that has operations, if there is not any operated question log in the cluster, the cluster is discarded directly, and for a cluster that has no operations, all the question logs in the cluster are retained to participate in clustering performed the next day.
The core issue of the present embodiment is to provide a solution for backflow of question logs of a customer service system. The solution includes a specific flow design for sources of backflow questions, a scheme for picking up backflow questions, and a manner to deal with the remaining data after backflow, and a secondary clustering algorithm for multi-place, multi-intent differentiation. Its core innovations include:
1. The present embodiment provides a specific flow design for sources of backflow questions, a scheme for picking up backflow questions. The design give consideration to both the importance of types and the importance of questions themselves, so as to reasonably decide whether data shall be retained or discarded, thereby ensuring the solution can operation stably and reliably.

Date Recue/Date Received 2022-04-21 2. The present embodiment provides a secondary clustering algorithm based on part-of-speech tagging. In view that a single clustering algorithm is incompetent to deal with multi-place, multi-intent cases in customer service logs, a secondary clustering algorithm based on part-of-speech tagging is additionally employed. The first clustering is focused on nouns to differentiate places, and the second clustering is focused on verbs to differentiate intents.
Corresponding to the way the knowledge base stores questions, questions are clustered successively in terms of place and intent, respectively, thereby ensuring effective differentiation provided by clustering.
To sum up, the present embodiment has the following beneficial effects:
1. According to the disclosed scheme, logs are fed back to allow effective enrichment of the knowledge base. Richness of a knowledge base has direct influence on the ability of the customer service system to answer questions from users. By continuously making supplement with questions that the customer service system once failed to answer or did not answer perfectly, the knowledge base of the customer service system can be dynamically updated and enriched;
2. The disclosed scheme sets the criterion for sieving clustered questions with consideration to both the importance of types of the question logs and the importance of questions themselves.
The questions remaining after sieving are retained or discarded in a manner that repetition can be prevented and low-frequency yet sustained question logs can be properly taken care of, thereby ensuring long-term stability of the scheme and allowing simple yet efficient service tagging; and 3. The secondary clustering algorithm based on part-of-speech tagging in the disclosed scheme can effectively differentiate places and intents, and can be less affected by meaningless words.
The priority setting policy accords to the way the knowledge base classifies questions, so the clustering results are representative.
Embodiment 2 The present embodiment provides a text-clustering-based customer service log backflow apparatus, comprising:
a matching unit, for from a customer service system, acquiring plural question logs to which no matching answer was found;

Date Recue/Date Received 2022-04-21 a clustering unit, for clustering the question logs by means of place recognition and intent recognition, so that every resulting cluster includes at least one said question log; and a sieving backflow unit, for based on a total number of the question logs in each said cluster and a number of the identical question logs in each said cluster, sieving the question logs from the cluster, tagging them, and making them flow back to a knowledge base.
As compared to the prior art, the disclosed text-clustering-based customer service log backflow apparatus provides beneficial effects that are similar to those provided by the disclosed text-clustering-based customer service log backflow method as enumerated above, and thus no repetitions are made herein.
Embodiment 3 The present embodiment provides a computer-readable storage medium, which stores therein a computer program that when run be a processor executes the steps of the text-clustering-based customer service log backflow method as described above.
As compared to the prior art, the disclosed computer-readably storage medium provides beneficial effects that are similar to those provided by the disclosed text-clustering-based customer service log backflow method as enumerated above, and thus no repetitions are made herein.
As will be appreciated by people of ordinary skill in the art, implementation of all or a part of the steps of the method of the present invention as described previously may be realized by having a program instruct related hardware components. The program may be stored in a computer-readable storage medium, and the program is about performing the individual steps of the methods described in the foregoing embodiments. The storage medium may be a ROM/RAM, a hard drive, an optical disk, a memory card or the like.
The present invention has been described with reference to the preferred embodiments and it is understood that the embodiments are not intended to limit the scope of the present invention.
Moreover, as the contents disclosed herein should be readily understood and can be implemented by a person skilled in the art, all equivalent changes or modifications which do not depart from the concept of the present invention should be encompassed by the appended claims. Hence, the Date Recue/Date Received 2022-04-21 scope of the present invention shall only be defined by the appended claims.
Date Recue/Date Received 2022-04-21

Claims (252)

Claims:
1. An apparatus comprising:
a matching unit, configured to from a customer service system, acquire plural question logs to which no matching answer was found;
a clustering unit, configured to cluster the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log;
and a sieving backflow unit, configured to, based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, sieve the question logs from the cluster, tag the question logs, and make the question logs flow back to a knowledge base.
2. The apparatus of claim 1, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.

Date Recue/Date Received 2022-07-06
3. The apparatus of claim 2, further comprises:
presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering;
the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs;
the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs.
4. The apparatus of claim 3, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
5. The apparatus of claim 4, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.

Date Recue/Date Received 2022-07-06
6. The apparatus of claim 5, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
7. The apparatus of claim 6, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment apparatus;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
8. The apparatus of claim 1, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.

Date Recue/Date Received 2022-07-06
9. The apparatus of any one of claims 1 to 8, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
10. The apparatus of any one of claims 1 to 9, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
11. The apparatus of any one of claims 1 to 10, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
12. The apparatus of any one of claims 1 to 11, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
13. The apparatus of any one of claims 1 to 12, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
14. The apparatus of any one of claims 1 to 13, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
15. The apparatus of any one of claims 1 to 14, wherein word segmentation of the NLP method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
16. The apparatus of any one of claims 1 to 15, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.

Date Recue/Date Received 2022-07-06
17. The apparatus of any one of claims 1 to 16, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
18. The apparatus of any one of claims 1 to 17, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
19. The apparatus of any one of claims 1 to 18, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
20. The apparatus of any one of claims 1 to 19, wherein the words tagged as other parts of speech, lower weights are set.
21. The apparatus of any one of claims 1 to 20, wherein exact value of the weight is adjusted depending on clustering results.
22. The apparatus of any one of claims 1 to 21, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
23. The apparatus of any one of claims 1 to 22, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
24. The apparatus of any one of claims 1 to 23, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
25. The apparatus of any one of claims 1 to 24, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
26. The apparatus of any one of claims 1 to 24, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
Date Recue/Date Received 2022-07-06
27. The apparatus of any one of claims 1 to 26, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
28. The apparatus of any one of claims 1 to 27, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
29. The apparatus of any one of claims 1 to 28, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
30. The apparatus of any one of claims 1 to 29, wherein number of appearing times of the question indicates an importance of the question.
31. The apparatus of any one of claims 1 to 30, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
32. The apparatus of any one of claims 1 to 31, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.
33. A system comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found;
clustering the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieving the question logs, tagging the question logs, and making the question logs flow back to a knowledge base.
34. The system of claim 33, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:

Date Recue/Date Received 2022-07-06 performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
35. The system of claim 34, further comprises:
presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering;
the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs;
the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs.
36. The system of claim 35, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
37. The system of claim 36, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
38. The system of claim 37, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
39. The system of claim 38, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:

Date Recue/Date Received 2022-07-06 according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment system;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
40. The system of claim 33, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
41. The system of any one of claims 33 to 40, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
42. The system of any one of claims 33 to 41, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
43. The system of any one of claims 33 to 42, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
44. The system of any one of claims 33 to 43, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.

Date Recue/Date Received 2022-07-06
45. The system of any one of claims 33 to 44, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
46. The system of any one of claims 33 to 45, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
47. The system of any one of claims 33 to 46, wherein word segmentation of the NLP method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
48. The system of any one of claims 33 to 47, wherein character purification of the NLP method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
49. The system of any one of claims 33 to 48, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
50. The system of any one of claims 33 to 49, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
51. The system of any one of claims 33 to 50, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
52. The system of any one of claims 33 to 51, wherein the words tagged as other parts of speech, lower weights are set.
53. The system of any one of claims 33 to 52, wherein exact value of the weight is adjusted depending on clustering results.
Date Recue/Date Received 2022-07-06
54. The system of any one of claims 33 to 53, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
55. The system of any one of claims 33 to 54, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
56. The system of any one of claims 33 to 55, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
57. The system of any one of claims 33 to 56, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
58. The system of any one of claims 33 to 56, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
59. The system of any one of claims 33 to 58, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
60. The system of any one of claims 33 to 59, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
61. The system of any one of claims 33 to 60, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
62. The system of any one of claims 33 to 61, wherein number of appearing times of the question indicates an importance of the question.
63. The system of any one of claims 33 to 62, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
64. The system of any one of claims 33 to 63, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.

Date Recue/Date Received 2022-07-06
65. A method comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found;
clustering the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieving the question logs, tagging the question logs, and making the question logs flow back to a knowledge base.
66. The method of claim 65, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
67. The method of claim 66, further comprises:

Date Recue/Date Received 2022-07-06 presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering;
the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs;
the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs.
68. The method of claim 67, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
69. The method of claim 68, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
70. The method of claim 69, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:

Date Recue/Date Received 2022-07-06 based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
71. The method of claim 70, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment method;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
72. The method of claim 65, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
73. The method of any one of claims 65 to 72, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
74. The method of any one of claims 65 to 73, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
75. The method of any one of claims 65 to 74, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
76. The method of any one of claims 65 to 75, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
77. The method of any one of claims 65 to 76, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
78. The method of any one of claims 65 to 77, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
79. The method of any one of claims 65 to 78, wherein word segmentation of the NLP method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
80. The method of any one of claims 65 to 79, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
81. The method of any one of claims 65 to 80, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
Date Recue/Date Received 2022-07-06
82. The method of any one of claims 65 to 81, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
83. The method of any one of claims 65 to 82, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
84. The method of any one of claims 65 to 83, wherein the words tagged as other parts of speech, lower weights are set.
85. The method of any one of claims 65 to 84, wherein exact value of the weight is adjusted depending on clustering results.
86. The method of any one of claims 65 to 85, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
87. The method of any one of claims 65 to 86, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
88. The method of any one of claims 65 to 87, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
89. The method of any one of claims 65 to 88, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
90. The method of any one of claims 65 to 88, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
91. The method of any one of claims 65 to 90, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
92. The method of any one of claims 65 to 91, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.

Date Recue/Date Received 2022-07-06
93. The method of any one of claims 65 to 92, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
94. The method of any one of claims 65 to 93, wherein number of appearing times of the question indicates an importance of the question.
95. The method of any one of claims 65 to 94, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
96. The method of any one of claims 65 to 95, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.
97. A computer readable physical memory having stored thereon a computer program executed by a computer configured to:
from a customer service system, acquire plural question logs to which no matching answer was found;
cluster the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieve the question logs, tag the question logs, and make the question logs flow back to a knowledge base.
98. The memory of claim 97, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;

Date Recue/Date Received 2022-07-06 vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
99. The memory of claim 98, further comprises:
presetting a weight distribution table for the place clustering and a weight distribution table for the intent clustering;
the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs;
the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs.
100. The memory of claim 99, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.

Date Recue/Date Received 2022-07-06
101. The memory of claim 100, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
102. The memory of claim 101, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
103. The memory of claim 102, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment memory;

Date Recue/Date Received 2022-07-06 setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
104. The memory of claim 97, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
105. The memory of any one of claims 97 to 104, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
106. The memory of any one of claims 97 to 105, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
107. The memory of any one of claims 97 to 106, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
108. The memory of any one of claims 97 to 107, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
109. The memory of any one of claims 97 to 108, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
Date Recue/Date Received 2022-07-06
110. The memory of any one of claims 97 to 109, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
111. The memory of any one of claims 97 to 110, wherein word segmentation of the NLP
method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
112. The memory of any one of claims 97 to 111, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
113. The memory of any one of claims 97 to 112, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
114. The memory of any one of claims 97 to 113, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
115. The memory of any one of claims 97 to 114, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
116. The memory of any one of claims 97 to 115, wherein the words tagged as other parts of speech, lower weights are set.
117. The memory of any one of claims 97 to 116, wherein exact value of the weight is adjusted depending on clustering results.
118. The memory of any one of claims 97 to 117, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.

Date Recue/Date Received 2022-07-06
119. The memory of any one of claims 97 to 118, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
120. The memory of any one of claims 97 to 119, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
121. The memory of any one of claims 97 to 120, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
122. The memory of any one of claims 97 to 120, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
123. The memory of any one of claims 97 to 122, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
124. The memory of any one of claims 97 to 123, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
125. The memory of any one of claims 97 to 124, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
126. The memory of any one of claims 97 to 125, wherein number of appearing times of the question indicates an importance of the question.
127. The memory of any one of claims 97 to 126, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
128. The memory of any one of claims 97 to 127, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.

Date Recue/Date Received 2022-07-06
129. An apparatus comprising:
a matching unit, configured to from a customer service system, acquire plural question logs to which no matching answer was found;
a clustering unit, configured to:
preset a weight distribution table for place clustering and a weight distribution table for intent clustering, wherein the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs, wherein the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs;
cluster the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and a sieving backflow unit, configured to, based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, sieve the question logs from the cluster, tag the question logs, and make the question logs flow back to a knowledge base.
130. The apparatus of claim 129, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;

performing the place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing the intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
131. The apparatus of claim 130, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
132. The apparatus of claim 131, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
133. The apparatus of claim 132, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:

Date Recue/Date Received 2022-07-06 based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
134. The apparatus of claim 133, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment apparatus;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
135. The apparatus of claim 129, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
136. The apparatus of any one of claims 129 to 135, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
Date Recue/Date Received 2022-07-06
137. The apparatus of any one of claims 129 to 136, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
138. The apparatus of any one of claims 129 to 137, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
139. The apparatus of any one of claims 129 to 138, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
140. The apparatus of any one of claims 129 to 139, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
141. The apparatus of any one of claims 129 to 140, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
142. The apparatus of any one of claims 129 to 141, wherein word segmentation of the NLP
method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
143. The apparatus of any one of claims 129 to 142, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
144. The apparatus of any one of claims 129 to 143, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.

Date Recue/Date Received 2022-07-06
145. The apparatus of any one of claims 129 to 144, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
146. The apparatus of any one of claims 129 to 145, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
147. The apparatus of any one of claims 129 to 146, wherein the words tagged as other parts of speech, lower weights are set.
148. The apparatus of any one of claims 129 to 147, wherein exact value of the weight is adjusted depending on clustering results.
149. The apparatus of any one of claims 129 to 148, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
150. The apparatus of any one of claims 129 to 149, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
151. The apparatus of any one of claims 129 to 150, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
152. The apparatus of any one of claims 129 to 151, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
153. The apparatus of any one of claims 129 to 151, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
154. The apparatus of any one of claims 129 to 153, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
155. The apparatus of any one of claims 129 to 154, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.

Date Recue/Date Received 2022-07-06
156. The apparatus of any one of claims 129 to 155, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
157. The apparatus of any one of claims 129 to 156, wherein number of appearing times of the question indicates an importance of the question.
158. The apparatus of any one of claims 129 to 157, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
159. The apparatus of any one of claims 129 to 158, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.
160.A system comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found;
presetting a weight distribution table for place clustering and a weight distribution table for intent clustering, wherein the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs, wherein the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs;
clustering the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieving the question logs, tagging the question logs, and making the question logs flow back to a knowledge base.
161. The system of claim 160, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:

Date Recue/Date Received 2022-07-06 performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing the place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing the intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
162. The system of claim 161, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
163. The system of claim 162, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;

Date Recue/Date Received 2022-07-06 summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
164. The system of claim 163, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
165. The system of claim 164, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment system;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
166. The system of claim 165, further comprising:
Date Recue/Date Received 2022-07-06 discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
167. The system of any one of claims 160 to 166, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
168. The system of any one of claims 160 to 167, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
169. The system of any one of claims 160 to 168, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
170. The system of any one of claims 160 to 169, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
171. The system of any one of claims 160 to 170, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
172. The system of any one of claims 160 to 171, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
173. The system of any one of claims 160 to 172, wherein word segmentation of the NLP method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;

Date Recue/Date Received 2022-07-06
174. The system of any one of claims 160 to 173, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
175. The system of any one of claims 160 to 174, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
176. The system of any one of claims 160 to 175, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
177. The system of any one of claims 160 to 176, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
178. The system of any one of claims 160 to 177, wherein the words tagged as other parts of speech, lower weights are set.
179. The system of any one of claims 160 to 178, wherein exact value of the weight is adjusted depending on clustering results.
180. The system of any one of claims 160 to 179, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
181. The system of any one of claims 160 to 180, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
182. The system of any one of claims 160 to 181, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
183. The system of any one of claims 160 to 182, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.

Date Recue/Date Received 2022-07-06
184. The system of any one of claims 160 to 182, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
185. The system of any one of claims 160 to 184, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
186. The system of any one of claims 160 to 185, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
187. The system of any one of claims 160 to 186, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
188. The system of any one of claims 160 to 187, wherein number of appearing times of the question indicates an importance of the question.
189. The system of any one of claims 160 to 188, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
190. The system of any one of claims 160 to 189, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.
191.A method comprising:
from a customer service system, acquiring plural question logs to which no matching answer was found;
presetting a weight distribution table for place clustering and a weight distribution table for intent clustering wherein the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs, wherein the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs;

Date Recue/Date Received 2022-07-06 clustering the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieving the question logs, tagging the question logs, and making the question logs flow back to a knowledge base.
192. The method of claim 191, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing the place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing the intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
193. The method of claim 192, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and Date Recue/Date Received 2022-07-06 assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
194. The method of claim 193, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
195. The method of claim 194, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
196. The method of claim 195, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
Date Recue/Date Received 2022-07-06 according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment method;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
197. The method of claim 196, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
198. The method of any one of claims 191 to 197, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
199. The method of any one of claims 191 to 198, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.
200. The method of any one of claims 191 to 199, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
201. The method of any one of claims 191 to 200, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.

Date Recue/Date Received 2022-07-06
202. The method of any one of claims 191 to 201, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
203. The method of any one of claims 191 to 202, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
204. The method of any one of claims 191 to 203, wherein word segmentation of the NLP
method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
205. The method of any one of claims 191 to 204, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
206. The method of any one of claims 191 to 205, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
207. The method of any one of claims 191 to 206, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.
208. The method of any one of claims 191 to 207, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
209. The method of any one of claims 191 to 208, wherein the words tagged as other parts of speech, lower weights are set.
210. The method of any one of claims 191 to 209, wherein exact value of the weight is adjusted depending on clustering results.

Date Recue/Date Received 2022-07-06
211. The method of any one of claims 191 to 210, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
212. The method of any one of claims 191 to 211, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
213. The method of any one of claims 191 to 212, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
214. The method of any one of claims 191 to 213, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
215. The method of any one of claims 191 to 213, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
216. The method of any one of claims 191 to 215, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
217. The method of any one of claims 191 to 216, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
218. The method of any one of claims 191 to 217, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.
219. The method of any one of claims 191 to 218, wherein number of appearing times of the question indicates an importance of the question.
220. The method of any one of claims 191 to 219, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
221. The method of any one of claims 191 to 220, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.

Date Recue/Date Received 2022-07-06
222. A computer readable physical memory having stored thereon a computer program executed by a computer configured to:
from a customer service system, acquire plural question logs to which no matching answer was found;
presetting a weight distribution table for place clustering and a weight distribution table for intent clustering, wherein the weight distribution table for the place clustering includes plural place interval distributions of vector intervals of different question logs, wherein the weight distribution table for the intent clustering includes plural intent interval distributions of the vector intervals of the different question logs;
cluster the question logs by means of place recognition and intent recognition, wherein every resulting cluster includes at least one question log; and based on a total number of the question logs in each cluster and a number of identical question logs in each cluster, from the cluster sieve the question logs, tag the question logs, and make the question logs flow back to a knowledge base.
223. The memory of claim 222, wherein clustering the question logs by means of the place recognition and the intent recognition comprises:
performing word segmentation on the question logs;
performing cleaning on the question logs;
vectorize words to obtain word vector;
tagging the words with a part-of-speech;
performing the place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns, to obtain at least one initial cluster; and for the question logs receiving the place clustering, performing the intent clustering on the question logs, to obtain at least one cluster according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs.
224. The memory of claim 223, further comprises:
constructing an interfering noun list and an interfering verb list;
assigning a reduced weight of a third proportion to each noun appearing in the interfering noun list; and assigning a reduced weight of a fourth proportion to each verb appearing in the interfering verb list.
225. The memory of claim 224, wherein performing place clustering on the question logs according to the word vector of each word in the question logs and the words having part-of-speech tagging results as nouns comprises:
based on all of the nouns in the question logs excluding the nouns existing in the interfering noun list, setting a weighting weight of a first proportion;
summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of corresponding words; and classifying the question logs whose vectors are in same place interval distribution into an initial cluster of same place.
226. The memory of claim 225, wherein performing intent clustering on the question logs according to the word vector of each word in the question logs in each initial cluster and the words having part-of-speech tagging results as verbs comprises:
based on all of the verbs in the question logs excluding the verbs existing in the interfering verb list, setting a weighting weight of a second proportion;
Date Recue/Date Received 2022-07-06 summing up and computing a vector of the question logs according to the word vectors of the words in the question logs and the weights of the corresponding words; and classifying the question logs whose vectors are in a same intent interval distribution into a cluster of a same intent.
227. The memory of claim 226, wherein based on the total number of the question logs in each cluster and the number of the identical question logs in each cluster, sieving the question logs from the cluster, tagging the question logs, and making the question logs flow back to the knowledge base comprises:
according to the total number of the question logs in each cluster and the number of the identical question logs in each cluster, computing a value-assignment result for each question log in each cluster using an assignment memory;
setting backflow priority of the question logs based on the value-assignment results; and tagging the question logs with high priority as backflow question logs.
228. The memory of claim 227, further comprising:
discarding each question log clustered in current cycle but is not tagged as a backflow question log; and clustering each question log not clustered in the current cycle and is not tagged as the backflow question log and does not have the matching answer found in next cycle.
229. The memory of any one of claims 222 to 228, wherein a question is raised by a user at the customer service system, the customer service system uses a built-in algorithm to recognize a vector value of the question, and matches the question with standard questions or similar questions.
230. The memory of any one of claims 222 to 229, wherein the vector of the question and the vector of the standard question are in a standard threshold interval, the customer service system calls the answer of the standard question and sends the answer to the user.

Date Recue/Date Received 2022-07-06
231. The memory of any one of claims 222 to 230, wherein the vector of the question and the vector of the similar question are in a similarity threshold interval, the customer service system calls the answer of the similar question and sends the answer to the user.
232. The memory of any one of claims 222 to 231, wherein the vector of the question is neither the standard threshold interval nor the similarity threshold interval, the question is picked out as the question log without matching answer and to be processed.
233. The memory of any one of claims 222 to 232, wherein a data pre-processing module processes the question logs with word cleaning, word segmentation, vectorized representation and part-of-speech tagging.
234. The memory of any one of claims 222 to 233, wherein the data pre-processing module comprises a full set of data pre-processing functions of a natural language processing (NLP) method for corpus data of the question logs.
235. The memory of any one of claims 222 to 234, wherein word segmentation of the NLP
method comprises an open-source word segmentation tool named jieba to segment the question log with granularity of words, and word vectorization is based on FastText;
236. The memory of any one of claims 222 to 235, wherein character purification of the NLP
method comprises the question log of the user receives string cleaning to remove special characters and punctuation marks, and typos are corrected.
237. The memory of any one of claims 222 to 236, wherein part-of-speech tagging of the NLP
method comprises an open-source tool LTP provided by Harbin Institute of Technology to tag the words generated after segmentation with parts of speech, and some excluded items are set manually to prevent interference with the algorithm caused by some meaningless verbs and nouns.
238. The memory of any one of claims 222 to 237, wherein all question logs to flow back, since the number of types is unknown, a SinglePass clustering algorithm which does not require setting the number of types in advance is used.

Date Recue/Date Received 2022-07-06
239. The memory of any one of claims 222 to 238, wherein cosine similarity between vectors of question logs measures a similarity between individual question logs.
240. The memory of any one of claims 222 to 239, wherein the words tagged as other parts of speech, lower weights are set.
241. The memory of any one of claims 222 to 240, wherein exact value of the weight is adjusted depending on clustering results.
242. The memory of any one of claims 222 to 241, wherein the place interval distribution refers to interval distribution of the vectors of the question logs.
243. The memory of any one of claims 222 to 242, wherein every interval distribution corresponds to a cosine similarity interval of vectors of question logs.
244. The memory of any one of claims 222 to 243, wherein the intent interval distribution refers to the interval distribution of the vectors of the question logs.
245. The memory of any one of claims 222 to 244, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are same.
246. The memory of any one of claims 222 to 244, wherein values of the first proportion, the second proportion, the third proportion, and the fourth proportion are different.
247. The memory of any one of claims 222 to 246, wherein the clusters formed in the clustering are sorted in a descending order in terms of number of question logs in every cluster.
248. The memory of any one of claims 222 to 247, wherein the cluster with more question logs is a high-frequency class of questions which is difficult to solve by a robot, and is given higher priority for backflow.
249. The memory of any one of claims 222 to 248, wherein the cluster with fewer question logs is a low-frequency class of questions, and not representative.

Date Recue/Date Received 2022-07-06
250. The memory of any one of claims 222 to 249, wherein number of appearing times of the question indicates an importance of the question.
251. The memory of any one of claims 222 to 250, wherein a number of question logs for backflow to the knowledge base is then selected from top of sorted list.
252. The memory of any one of claims 222 to 251, wherein for the cluster with operations, wherein there is not any operated question log in the cluster, the cluster is discarded directly, and for the cluster with no operations, all the question logs in the cluster are retained to participate in clustering performed next day.

Date Recue/Date Received 2022-07-06
CA3156172A 2021-04-25 2022-04-21 Text-clustering-based customer service log backflow method and apparatus thereof Pending CA3156172A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110448068.6 2021-04-25
CN202110448068.6A CN114328903A (en) 2021-04-25 2021-04-25 Text clustering-based customer service log backflow method and device

Publications (1)

Publication Number Publication Date
CA3156172A1 true CA3156172A1 (en) 2022-10-25

Family

ID=81044239

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3156172A Pending CA3156172A1 (en) 2021-04-25 2022-04-21 Text-clustering-based customer service log backflow method and apparatus thereof

Country Status (2)

Country Link
CN (1) CN114328903A (en)
CA (1) CA3156172A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009193A (en) * 2022-04-29 2023-11-07 青岛海尔科技有限公司 Log processing method and device, storage medium and electronic device

Also Published As

Publication number Publication date
CN114328903A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN109146610B (en) Intelligent insurance recommendation method and device and intelligent insurance robot equipment
US20200143289A1 (en) Systems and method for performing contextual classification using supervised and unsupervised training
US11645517B2 (en) Information processing method and terminal, and computer storage medium
CN111767403B (en) Text classification method and device
AU2017355420B2 (en) Systems and methods for event detection and clustering
WO2020147395A1 (en) Emotion-based text classification method and device, and computer apparatus
CN109299480A (en) Terminology Translation method and device based on context of co-text
CN104834651B (en) Method and device for providing high-frequency question answers
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
CN108549723B (en) Text concept classification method and device and server
CN113780007A (en) Corpus screening method, intention recognition model optimization method, equipment and storage medium
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN111538828A (en) Text emotion analysis method and device, computer device and readable storage medium
CN111460806A (en) Loss function-based intention identification method, device, equipment and storage medium
CN110046648A (en) The method and device of business classification is carried out based at least one business disaggregated model
CN110597978A (en) Article abstract generation method and system, electronic equipment and readable storage medium
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
Boag et al. Twitterhawk: A feature bucket based approach to sentiment analysis
CN108287848A (en) Method and system for semanteme parsing
CN113051380A (en) Information generation method and device, electronic equipment and storage medium
CA3156172A1 (en) Text-clustering-based customer service log backflow method and apparatus thereof
CN111008329A (en) Page content recommendation method and device based on content classification
Nguyen et al. Robust domain adaptation for relation extraction via clustering consistency
WO2023207566A1 (en) Voice room quality assessment method, apparatus, and device, medium, and product
CN115017271B (en) Method and system for intelligently generating RPA flow component block