CN116915916A

CN116915916A - Call processing method, device, electronic equipment and medium

Info

Publication number: CN116915916A
Application number: CN202310715408.6A
Authority: CN
Inventors: 郭梦霏; 黄毅; 冯俊兰; 邓超
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-10-20

Abstract

The application discloses a call processing method, a call processing device, electronic equipment and a call processing medium, relates to the technical field of communication, and aims to solve the problem that content replied by the prior art cannot well represent the real intention of a user. The method comprises the following steps: under the condition of answering a call of a preset type, acquiring a voice text sent by a calling party; constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between words in the voice text and the relation between the words and preset intention labels, carrying out knowledge reasoning according to the knowledge graph, and determining target intention corresponding to the voice text; and sending target reply voice corresponding to the target intention to the calling party. According to the embodiment of the application, the dialogue content is understood by introducing various knowledge data, so that the dialogue intention can be understood more deeply, and further, the reply voice generated according to the intention can be more accurately represented by the user intention.

Description

Call processing method, device, electronic equipment and medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a call processing method, a call processing device, an electronic device, and a medium.

Background

At present, harassing calls are frequently called in, the life of people is seriously influenced, and intervention treatment on the harassing calls is needed. The existing harassment call substitution connection technology mainly adopts an automatic reply technology based on a pre-training language model, however, the model is usually based on a large-scale unsupervised text corpus to learn text representation, the learned knowledge is insufficient, and under the scene of knowledge sensitivity, particularly phone spoken dialogue containing a large amount of daily expressions, the intention contained in the text cannot be well understood, so that the reply voice in substitution connection cannot well represent the real intention of a user.

Disclosure of Invention

The embodiment of the application provides a call processing method, a call processing device, electronic equipment and a call processing medium, which are used for solving the problem that the prior harassing call substitution technology cannot better understand the intention in a dialogue text, so that the reply content cannot well represent the real intention of a user.

In a first aspect, an embodiment of the present application provides a call processing method, including:

under the condition of answering a call of a preset type, acquiring a voice text sent by a calling party;

constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and a preset intention label;

Carrying out knowledge reasoning according to the knowledge graph, and determining a target intention corresponding to the voice text;

and sending target reply voice corresponding to the target intention to the calling party.

Optionally, the method further comprises:

determining whether a preset intention keyword exists in the voice text by utilizing a multi-mode matching algorithm;

the step of constructing the knowledge graph corresponding to the voice text according to knowledge data with different knowledge types comprises the following steps:

under the condition that the fact that the preset intention keywords do not exist in the voice text is determined through the multi-mode matching algorithm, a knowledge graph corresponding to the voice text is constructed according to knowledge data with different knowledge types.

Optionally, after the determining whether the preset intention keyword exists in the voice text by using the multi-mode matching algorithm, the method further includes:

and under the condition that the preset intention keywords exist in the voice text through the multi-mode matching algorithm, determining that the intention corresponding to the intention keywords existing in the voice text is the target intention corresponding to the voice text.

Optionally, the constructing the knowledge graph corresponding to the voice text according to knowledge data with different knowledge types includes:

Inquiring knowledge data in a plurality of knowledge bases, obtaining relations between each word in the voice text and a preset intention label, and constructing a knowledge graph corresponding to the voice text based on the relations;

the knowledge graph is composed of nodes and edges, the nodes comprise words in the voice text and the preset intention labels, the edges are used for indicating the relation between the nodes, and knowledge data of different knowledge types are stored in the knowledge bases respectively.

Optionally, the plurality of knowledge bases includes a scenario knowledge base and a common knowledge base;

the querying knowledge data in a plurality of knowledge bases to obtain a relationship between each word in the voice text and a preset intention label includes:

obtaining the relation between each word in the voice text and the preset intention label by inquiring knowledge data in the scene knowledge base, wherein the preset intention label is a preset scene label;

and obtaining the relation among the words in the voice text by querying knowledge data in the common sense knowledge base.

Optionally, the performing knowledge reasoning according to the knowledge graph, determining the target intention corresponding to the voice text includes:

coding the voice text and the preset intention label to obtain vectorized representation of the voice text and the preset intention label;

determining a relationship vector in the knowledge graph;

and carrying out intention recognition on the vectorized representation by adopting an attention mechanism guided based on a relation vector in the knowledge graph, and outputting a target intention label, wherein the intention indicated by the target intention label is the target intention.

Optionally, the performing intention recognition on the vectorized representation by adopting an attention mechanism guided based on a relation vector in the knowledge graph, and outputting a target intention label comprises:

calculating an attention score from the vectorized representation and a relationship vector in the knowledge graph;

carrying out residual connection, layer normalization and full connection processing on the attention scores in sequence to generate knowledge representation vectors;

and splicing the knowledge representation vector and the vectorized representation, performing full connection processing to obtain the probability of each intention label in the preset intention labels, and determining the intention label with the highest probability as the target intention label.

Optionally, the sending, to the caller, the target reply voice corresponding to the target intention includes:

acquiring a target reply language from a speaking template corresponding to the target intention, and generating the target reply voice based on the target reply language;

and sending the target reply voice to the calling party.

Optionally, the obtaining the target reply language from the speaking template corresponding to the target intention includes:

determining whether the target intent is the same as a first intent, wherein the first intent is an intent that was last determined prior to determining the target intent;

under the condition that the target intention is the same as the first intention, a first reply word is obtained from a first speaking template corresponding to the first intention as the target reply word, wherein the first reply word is the next sentence of the reply word obtained from the first speaking template last time;

and under the condition that the target intention is different from the first intention, inquiring a second speaking template corresponding to the target intention, and acquiring a first sentence reply word from the second speaking template as the target reply word.

Optionally, before the knowledge graph corresponding to the voice text is constructed according to knowledge data with different knowledge types, the method further includes:

Under the condition of receiving the call, acquiring user information;

inquiring a user state corresponding to the user information from a remote database;

and under the condition that the user state indicates to continue the call, constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types.

Optionally, the method further comprises:

acquiring a special identifier in the process of answering the call;

sending a common greeting voice to the caller if the special identifier is a start identifier;

and terminating the call when the special identifier is an end identifier.

In a second aspect, an embodiment of the present application further provides a call processing apparatus, including:

the first acquisition module is used for acquiring a voice text sent by a calling party under the condition that a call of a preset type is answered;

the construction module is used for constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and a preset intention label;

The first determining module is used for carrying out knowledge reasoning according to the knowledge graph and determining the target intention corresponding to the voice text;

and the first sending module is used for sending the target reply voice corresponding to the target intention to the calling party.

Optionally, the call processing device further includes:

the second determining module is used for determining whether a preset intention keyword exists in the voice text or not by utilizing a multi-mode matching algorithm;

the construction module is used for constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types under the condition that the preset intention keywords are not existed in the voice text through the multi-mode matching algorithm.

Optionally, the call processing device further includes:

and the third determining module is used for determining that the intention corresponding to the intention keyword existing in the voice text is the target intention corresponding to the voice text under the condition that the preset intention keyword exists in the voice text through the multi-mode matching algorithm.

Optionally, the building module includes:

the query unit is used for querying knowledge data in a plurality of knowledge bases to obtain the relation between each word in the voice text and a preset intention label;

The construction unit is used for constructing a knowledge graph corresponding to the voice text based on the relation;

the query unit includes:

the first query subunit is configured to obtain a relationship between each term in the voice text and the preset intention label by querying knowledge data in the scene knowledge base, where the preset intention label is a preset scene label;

and the second query subunit is used for obtaining the relation among the terms in the voice text by querying knowledge data in the common sense knowledge base.

Optionally, the first determining module includes:

the coding unit is used for coding the voice text and the preset intention label to obtain vectorized representation of the voice text and the preset intention label;

A determining unit, configured to determine a relationship vector in the knowledge graph;

and the intention recognition unit is used for carrying out intention recognition on the vectorized representation by adopting an attention mechanism guided based on the relation vector in the knowledge graph and outputting a target intention label, wherein the intention indicated by the target intention label is the target intention.

Optionally, the intention recognition unit includes:

a computing subunit for computing an attention score from the vectorized representation and a relationship vector in the knowledge graph;

the first processing subunit is used for sequentially carrying out residual connection, layer normalization and full connection processing on the attention scores to generate knowledge representation vectors;

and the second processing subunit is used for performing full connection processing on the knowledge representation vector and the vectorized representation after splicing to obtain the probability of each intention label in the preset intention labels, and determining the intention label with the highest probability as the target intention label.

Optionally, the first sending module includes:

the obtaining unit is used for obtaining a target reply word from a speaking template corresponding to the target intention;

the generating unit is used for generating the target reply voice based on the target reply language;

And the sending unit is used for sending the target reply voice to the calling party.

Optionally, the acquiring unit includes:

a determining subunit configured to determine whether the target intention is the same as a first intention, wherein the first intention is an intention that was last determined before the target intention was determined;

the first obtaining subunit is configured to obtain, when the target intention is the same as the first intention, a first reply word from a first speech template corresponding to the first intention as the target reply word, where the first reply word is a next sentence of a reply word obtained from the first speech template last time;

the second obtaining subunit is configured to query a second speech template corresponding to the target intention, and obtain a first sentence reply word from the second speech template as the target reply word when the target intention is different from the first intention.

Optionally, the call processing device further includes:

the second acquisition module is used for acquiring user information under the condition of receiving the call;

the query module is used for querying the user state corresponding to the user information from a remote database;

The construction module is used for constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types under the condition that the user state indicates to continue the call.

Optionally, the call processing device further includes:

the third acquisition module is used for acquiring a special identifier in the process of answering the call;

the second sending module is used for sending common greeting voice to the calling party under the condition that the special identifier is a starting identifier;

and the termination module is used for terminating the call under the condition that the special identifier is an end identifier.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps in the call processing method as described above when executing the computer program.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the call processing method as described above.

In the embodiment of the application, under the condition that a call of a preset type is answered, a voice text sent by a calling party is obtained; constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and the relation between the word in the voice text and a preset intention label; carrying out knowledge reasoning according to the knowledge graph, and determining a target intention corresponding to the voice text; and sending target reply voice corresponding to the target intention to the calling party. Therefore, knowledge data of various knowledge types are introduced to understand the dialogue content, so that intention contained in the dialogue content can be understood more deeply, and further, reply voice generated according to the intention recognition result can be more accurately represented by the true intention of the user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a flowchart of a call processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of knowledge graph construction provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of a multi-knowledge hybrid inference network provided by an embodiment of the present application;

FIG. 4 is a flowchart of user information interaction provided by an embodiment of the present application;

FIG. 5 is a flow chart of interaction of the speaking information provided by an embodiment of the present application;

FIG. 6 is a block diagram of a call processing system according to an embodiment of the present application;

FIG. 7 is a second flowchart of a call processing method according to an embodiment of the present application;

fig. 8 is a block diagram of a call processing apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to make the embodiments of the present application more clear, the following description will be given to the related technical knowledge related to the embodiments of the present application:

With the rapid development of communication technology and the increasing demand for information exchange, mobile phones have taken an important role as a main communication means in the production and life of people. However, in recent years, the number of nuisance calls such as sales promotion and advertisement is greatly increased, so that the use experience of users of mobile phones is seriously affected, and normal communication is hindered. Therefore, how to prevent frequent incoming calls of harassment calls and improve user experience is a problem to be solved under the application scene of mobile phones.

The existing substitution technology of harassing calls generally uses an automatic replying technology based on a neural network generation model. The main technical thought is that firstly, large-scale dialogue data are marked manually, secondly, a dialogue generating model is trained, and finally, an online prediction reasoning process is carried out by using the dialogue generating model. The technology adopts an encoder-decoder architecture widely adopted in the field of natural language processing, wherein an encoder realizes semantic representation and understanding of dialogue input text, converts the text into vector representation, and then utilizes the powerful representation and calculation capability of a neural network to realize content understanding of the input text and output semantic representation results of a high-dimensional space; the decoder realizes reply generation based on the representation and the understanding result, realizes an autoregressive decoding mode through an attention mechanism according to the semantic representation output by the encoder, starts decoding through the initiator, and generates reply text word by word.

Whereas existing nuisance call substitution techniques typically use a transducer-based pre-trained language model as the base encoder. Such models are usually based on large-scale unsupervised text corpus learning text representations, and the learned common sense knowledge is insufficient, so that the intention contained in the text cannot be well understood in knowledge-sensitive scenes, especially in phone spoken dialogues containing a large number of daily expressions.

The existing harassing call substitution technique has the following defects:

1) Overreliance on annotation data. Training the neural network requires a large amount of high-quality labeling data, but manual labeling data is time-consuming and labor-consuming, the quality is difficult to guarantee, and the labeling accuracy rate can be used by repeated verification. In the service cold start stage, the labeling data are very few, the effect of directly training the neural network is poor, and particularly, the training difficulty is higher, the neural network generates a model.

2) The common sense knowledge is insufficient. To achieve good dialog results, a high quality of finished natural language understanding is required. The prior art generally directly adopts a pre-training model as a main technology of natural language understanding, and the pre-training language model has the advantages of large training process scale, simple learning strategy, insufficient knowledge mining on common knowledge and difficult coping with knowledge-sensitive data samples.

3) The knowledge of the scene is not sufficient. In a man-machine conversation environment, rich scenes such as meal delivery, express delivery, promotion and the like are usually included, and scene knowledge such as related keywords and the like is included in each scene, so that the human experience is highly summarized and condensed. However, pre-trained language models are typically trained in an end-to-end fashion, and only learn hidden knowledge from existing annotation data. Therefore, it is necessary to process information such as keywords into training data by means of data enhancement or the like, and then learn the training data by the model. However, this method is time-consuming and labor-consuming, and is difficult to flexibly modify, and once the information is increased or decreased, the model can only be retrained, thereby further improving the training cost.

4) The generation of speech is not controllable. In a man-machine conversation, the utterances returned by the surrogate robot are vital, representing the actual intent desired by the user. However, neural network generated models are poor in interpretability, and it is often difficult to predict whether the text generated by the model meets the wish of the user to express, or even generate an inappropriate language, causing unnecessary influence.

Therefore, the harassing call substitution method based on the neural network generation model is limited in a real production environment, and is difficult to well meet the conversation requirements in a real scene.

According to the embodiment of the application, the intention understanding method of knowledge guidance is realized by introducing various external knowledge, so that the model can realize deeper text understanding according to the knowledge retrieval result, and a better user intention recognition effect is achieved.

The call processing method provided by the embodiment of the application is described in detail below through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a call processing method provided in an embodiment of the present application, as shown in fig. 1, including the following steps:

step 101, under the condition that a call of a preset type is answered, acquiring a voice text sent by a calling party.

The preset call type may be a call that needs to be answered in a substituted manner, for example, the call may include a call sign such as advertisement promotion, insurance financing, and a house property agency, and may also include a strange call.

In the embodiment of the application, in order to reduce the disturbance of harassment calls to users, a substitute connection mode for starting the calls, namely automatic answering and replying, can be set, and the users do not need to participate. Therefore, in order to answer the corresponding voice to the caller on behalf of the user when receiving the call of the preset type, the user needs to acquire the voice information sent by the caller, that is, receive the voice information sent by the caller through the radio frequency module, convert the voice information into a voice text through the voice recognition technology, and understand the voice text, so as to answer the corresponding content to the caller according to the understanding result, so that the dialogue is normally performed.

Step 102, constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and a preset intention label.

In the embodiment of the application, in order to ensure the deep understanding of the voice content sent by the calling party, knowledge data with various knowledge types can be introduced to understand the dialogue content, the relation between each word in the dialogue content is determined, and the relation between each word in the dialogue content and the preset intention label is determined, so that a knowledge graph corresponding to the dialogue content is constructed, and the relation is represented by the knowledge graph.

Specifically, the knowledge data of the multiple different knowledge types may be, for example, scene knowledge, common sense knowledge, etc., and the knowledge data of the multiple different knowledge types may be recorded with the relationships between the words and the intention words in the various scenes, so that each word in the voice text may be searched in the knowledge data of the multiple different knowledge types to determine the relationship between each word in the voice text and the preset intention label, so as to construct a knowledge graph according to the relationship, for example, the relationship between each word in the voice text, the preset intention label, the relationship between each word and the preset intention label, and the relationship between each word and the preset intention label as elements in the knowledge graph, so as to construct a corresponding knowledge graph. The preset intention labels can be preset labels representing intentions, such as labels respectively representing different scenes including promotion, financial management, real estate and the like.

And 103, carrying out knowledge reasoning according to the knowledge graph to determine the target intention corresponding to the voice text.

In this step, knowledge reasoning may be performed according to the knowledge graph, so as to determine a dialog intention contained in the dialog content, that is, determine a target intention corresponding to the voice text. Specifically, knowledge reasoning can be performed according to the relationship between each word in the voice text and a preset intention label, which are characterized in the knowledge graph, so as to determine that the dialogue content represented by the voice text is dialogue information describing what scene, thereby determining the target intention of the dialogue content, such as determining the target scene to which the dialogue content belongs.

Optionally, the step 102 and the step 103 may include:

and carrying out intention understanding on the voice text by utilizing a pre-trained multi-knowledge hybrid reasoning network, and outputting a target intention label, wherein the multi-knowledge hybrid reasoning network is used for constructing the knowledge graph, and reasoning the target intention label based on the knowledge graph, and the intention indicated by the target intention label is the target intention.

In one embodiment, a hybrid inference network fused with multiple knowledge can be pre-trained to accurately understand the intent of the dialog content through the network.

In the task of understanding dialog text, a great deal of background knowledge is usually required to achieve deep understanding of dialog text, so that a machine really knows about human intent. Although existing pre-training models can already introduce some degree of background knowledge, they still perform poorly for knowledge-sensitive scenes. Therefore, it is necessary to introduce rich external knowledge on the basis of the pre-trained language model, and play a role of both.

The pre-training language model is implemented based on a transducer model by designing pre-training tasks on the transducer, training word representations of the input text, typically consisting of mask language modeling and other tasks. Transformer is an algorithm which is gradually rising in the field of natural language processing in recent years, and is a mainstream method applied to natural language understanding tasks. The core idea of the transducer is to capture global features of the text using the attention mechanism. For text, the global feature is to determine the representation of the current word together according to the current word and the context information of the current word, i.e. the representation of the current word is obtained by weighted addition of multiple word representations. The advantage of a transducer is that it is able to automatically extract global features, noting the critical semantic information for the current task.

The multi-knowledge mixed reasoning network in the embodiment of the application can introduce various external knowledge on the basis of a pre-training language model so as to carry out semantic representation on an input text through the pre-training language model, then obtain the relation between words in the input text and the relation between the words in the input text and preset intention labels through retrieving the introduced various knowledge, and finally predict the intention labels according to the relation and the semantic representation of the input text so as to determine the intention represented by the input text. The knowledge data of different knowledge types can be used as the knowledge data, and the relations among words and intention words in various scenes are recorded.

In this way, in this step, the voice text may be input into the multi-knowledge hybrid inference network, and the intention understanding may be performed on the voice text by using the multi-knowledge hybrid inference network, specifically, the multi-knowledge hybrid inference network may obtain the relationship between each word in the voice text and a preset intention label by retrieving knowledge data of multiple different knowledge types, so as to construct a corresponding knowledge graph according to the relationship, and infer, based on the knowledge graph, an intention label corresponding to the voice text, that is, an intention label that is, a target intention label, specifically, an intention label that is most relevant to a word in the voice text in the preset intention label, where the intention indicated by the target intention label is an intention understanding result of the intention of the voice text, that is, a target intention. The preset intention labels can be preset labels representing intentions, such as labels respectively representing different scenes including promotion, financial management, real estate and the like.

Thus, according to the embodiment, the target intention in the dialogue content can be rapidly and accurately inferred.

Optionally, the method further comprises:

the step 102 includes:

In other words, in one embodiment, the intent understanding may be performed on the voice content sent by the caller by preferentially adopting a multi-mode matching algorithm, and when the voice content is not successfully identified by the multi-mode matching algorithm, a knowledge graph corresponding to the voice text is constructed according to knowledge data with different knowledge types, and the target intent corresponding to the voice text is inferred based on the knowledge graph, for example, the intent understanding may be performed on the voice content sent by the caller by using the multi-knowledge hybrid inference network when the voice content is not successfully identified by the multi-mode matching algorithm.

Specifically, the voice text and the preset intention keywords may be first subjected to matching recognition by using a multi-mode matching algorithm, so as to determine whether the keywords matched with the preset intention keywords exist in the voice text, and if the keywords do not exist, the intention in the voice text is considered not recognized. The preset intention keywords may be preset keywords used for representing different intentions, one intention keyword represents one intention, for example, the preset intention keywords may include keywords respectively representing different scenes, such as promotion, financial, and real estate.

For example, the phonetic text may be subjected to intent recognition using an Aho-Corasick multi-pattern matching algorithm (AC algorithm for short). The multi-pattern matching means that there are a plurality of pattern strings P ₁ ,P ₂ ,P ₃ ,...,P _m (corresponding to a plurality of preset intention keywords), all the mode strings are found in the continuous text T _1…n All possible positions in (corresponding to the text entered). The Aho-Corasick algorithm is a classical in multimodal matchingThe algorithm, the corresponding data structure of which is an Aho-Corasick automaton, abbreviated as an AC automaton. Conventional automata cannot perform multi-mode matching, and the AC automata adds failover to the suffix of text which has been successfully input.

The Aho-Corasick algorithm is divided into three steps: firstly, a dictionary tree (Trie) of the mode is established, then a failure path is added for the Trie, and finally, a text to be processed is searched according to the constructed AC automaton. The entire process is described in detail below.

A Trie of modes is first established. When inserting multiple patterns, the corresponding strings of the entire text are first traversed from front to back. If the node of the character to be inserted is built, directly considering the next character; if the character currently to be inserted has no own node below the tree formed by its previous character, a new node is created to represent the character. And then traverse the other characters. The above operation is then repeated.

Failure paths are then added to the Trie tree of the multimodal set. In The KMP (The Knuth-Morris-Pratt Algorithm) Algorithm, when mismatched characters are compared, the next position for starting matching is found through The next array, and then string matching is continued. In the AC automaton, when the character match is not found, the position pointed by the fail (fail) pointer is jumped to, and then the matching operation is performed again. The current node t has a fail pointer, and the node pointed to by the fail pointer is identical to the character represented by t. After the node t is successfully matched, t- > child needs to be matched, and if the nodes are mismatched, the node t- > fail starts to be matched again.

The fail pointers are obtained by breadth first traversal (Breath First Search, BFS) algorithm, and if nodes connected with the root node are mismatched, the fail pointers of the nodes directly point to the root node (root). Other node fail pointer solutions are as follows: let the current node be the parent node (father) and its child node be child. When the fail pointer of child is calculated, the node pointed by the fail pointer of child is found, if t is found, the node with the same letter as that represented by child node is found in child node of t, if the node is present, the fail pointer of child is found, if not, the fail- > fail node is found, and then the above process is repeated. If eventually no, the child's fail pointer points to the root.

In this embodiment, when the intent keyword is not hit by the multi-mode matching algorithm, a knowledge graph corresponding to the voice text is constructed according to knowledge data with different knowledge types, and a target intent corresponding to the voice text is inferred based on the knowledge graph, for example, the intent understanding is performed on the voice text by using the multi-knowledge hybrid inference network, and a prediction result of the multi-knowledge hybrid inference network is used, namely, the intent represented by the target intent label output by the network is determined as the target intent.

In this way, in this embodiment, the accuracy of the intended understanding can be ensured by performing the intended understanding on the dialogue content by combining the multi-mode matching algorithm and the multi-knowledge hybrid reasoning.

In other words, in one embodiment, when the multi-pattern matching algorithm is used to determine whether the preset intent keyword exists in the voice text, if it is determined that the preset intent keyword exists in the voice text, the intent corresponding to the keyword may be directly determined as the target intent, without performing the steps of constructing the knowledge graph and the knowledge reasoning corresponding to the voice text.

In this embodiment, although the number of intention keywords that can be hit by the multi-mode matching algorithm is small, the accuracy is high, so that the recognition result of the multi-mode matching algorithm can be preferentially adopted, when the intention keywords that are matched with the voice text are recognized by the multi-mode matching algorithm, the intention corresponding to the intention keywords can be determined to be the target intention, and in this case, it is not necessary to reconstruct a knowledge graph and perform knowledge reasoning, for example, it is not necessary to use the multi-knowledge hybrid reasoning network to perform intention understanding on the voice text.

In this way, in this embodiment, the accuracy of the intended understanding can be further ensured by performing the intended understanding on the dialogue content by combining the multi-mode matching algorithm and the multi-knowledge hybrid reasoning.

Optionally, the step 102 includes:

In one embodiment, knowledge data of different knowledge types may be stored in different knowledge bases, so that a relationship between each word in the voice text and a preset intention label may be obtained by querying knowledge data in a plurality of knowledge bases, and a knowledge graph may be further constructed based on the obtained relationship.

Specifically, each word and a preset intention label in the voice text can be used as nodes in a knowledge graph, and edges between the nodes in the knowledge graph are constructed according to the relation between each word and the preset intention label, namely, the nodes of the knowledge graph consist of each word and the preset intention label in the voice text, and the edges consist of the relation between each word and the preset intention label.

The knowledge graph construction process may be as shown in FIG. 2, for example, given a scene knowledge base S and a common sense knowledge base C, for any word w in the dialog text sequence P and the intention label text sequence Q _i And w _j If both hit the triples in any knowledge base at the same time, then in the knowledge graph at w _i And w _j Construct an edge r therebetween _ij Namely the relationship in the triplet. In the mixed knowledge graph, the relation between the dialogue text and the intention label text is provided by a scene knowledge base, the relation between the dialogue text and the dialogue text is provided by a common knowledge base, and various knowledge supports model reasoning together so as to achieve better intention understanding effect.

Therefore, the knowledge graph corresponding to the voice text can be quickly and accurately constructed through the embodiment, so that a knowledge reasoning model can be better guided to conduct knowledge reasoning through the constructed knowledge graph, and the intention of a user can be accurately understood.

In one embodiment, the plurality of knowledge bases at least includes a scenario knowledge base and a common knowledge base, wherein the scenario knowledge base is used for storing knowledge data in different scenarios, such as knowledge data in the scenarios of promotion, financial management, etc., and the common knowledge base is used for storing knowledge data in daily conversations; the scene knowledge can be composed of the correspondence between manually written scene categories and keywords, for example, the scene category (the house purchasing requirement, the property of promotion) is a knowledge triplet, and the common knowledge is from a common knowledge base, for example, the market staff has the ability to achieve, promote) is a knowledge triplet.

In this embodiment, the relationship between each word in the dialogue text and each preset scene tag may be obtained by querying knowledge data in the scene knowledge base, where the relationship may be a subordinate relationship, for example, the query that "house purchasing requirement" in the dialogue text belongs to the scene tag "promoting property"; relationships between words in the dialogue text can also be obtained by querying knowledge data in the common sense knowledge base, and the relationships can be association relationships or subordinate relationships, for example, a "store personnel" queried in the dialogue text has the ability to achieve "promotion", "fund", "investment" belongs to "financial".

In this way, through the embodiment, the relation between the words in the dialogue text and the relation between each word and the intention label in the dialogue text can be quickly and accurately obtained by means of the scene knowledge base and the common knowledge base, so that the user intention in the dialogue text can be accurately understood.

Optionally, the step 103 includes:

determining a relationship vector in the knowledge graph;

In one embodiment, voice text, preset intention labels, relationships, and the like can be vectorized, an attention mechanism guided based on the knowledge graph is adopted to understand intention of dialogue content, and a multi-knowledge hybrid reasoning network can be used to complete a knowledge reasoning process.

The multi-knowledge hybrid inference network may be composed of a pre-trained language model and a knowledge inference model, and a variety of knowledge is provided to the network for supporting deep understanding of dialog text. The whole network reasoning process can be divided into three phases: a text encoding stage, a knowledge retrieval stage and a knowledge reasoning stage.

The text encoding stage realizes vectorization representation of the input dialogue text and preset intention label text. This stage is mainly completed by means of a pre-training language model, and the introduction of pre-training knowledge is realized. In the text encoding stage, the dialogue text, namely the voice text and the preset intention label text, can be input into a pre-training language model to encode the dialogue text and the preset intention label text, so that the text vectorization representation with the context information is obtained.

Specifically, let the input dialog text sequence be p= { P ₁ ,p ₂ ,...,p _m The preset intention label text sequence is Q= { Q ₁ ,q ₂ ,...,q _n After splicing the two text sequences, obtaining a combined sequence with the length of m+n, and encoding by a Chinese pre-training language model (RoBERTa):

H _p ,H _q ＝RoBERTa(P,Q)

wherein H is _p Is a pre-training language representation of dialog text, length m; h _q Is a pre-trained language representation of the intended label text, of length n. The width of both is the dimension k of the pre-trained language model. In the training stage, the RoBERTa model weight is continuously updated and adjusted according to the training process. It should be noted that, in the model training process, keyword information can be used, so that data with higher reliability can be selected from massive log data to serve as training data of the neural network, the data labeling cost is greatly reduced, and the dependence of labeling data is reduced.

The knowledge retrieval stage is implemented to produce the relation between words in the dialogue text and the preset intention label text. In the knowledge retrieval stage, a plurality of knowledge bases can be retrieved, and the relations among the words in the dialogue text and the preset intention labels are obtained through the knowledge bases. For example, the word "purchase house" in the dialogue text has a membership relationship with the intention label "house property promotion", and the word "insurance" in the dialogue text has an association relationship with "recommendation". And constructing a knowledge graph based on the relation, wherein the node representation in the knowledge graph is produced by the text encoding process in the previous stage, and the relation representation among the nodes is produced in the knowledge retrieval stage.

For example, for a knowledge graph G generated in the knowledge retrieval stage, the node set is W, and the relationship set is R. For node w _i Wherein the node represents n _i Given by a pre-trained language model of the text encoding stage; for relationship r _ij The relationship thereof represents e _ij The query relationship vector table is provided, and the relationship vector table contains all the relationships in the knowledge graph, and the vector representation can be randomly initialized and continuously updated in the training process.

After the knowledge graph is built, the knowledge reasoning model can realize the intention recognition of knowledge guidance according to the knowledge graph, namely, the intention recognition is carried out on the node representation by adopting an attention mechanism based on the relation representation guidance in the knowledge graph, and a target intention label is output. Specifically, in the knowledge reasoning stage, the vectorized representation of the dialogue text and the vectorized representation of the intention label text can be spliced, and the vectorized representation of the intention label text are jointly encoded by a knowledge reasoning model based on an attention mechanism guided by a relation vector in the knowledge graph, namely the attention mechanism is guided by the relation extracted in the knowledge retrieval stage, so that dialogue intention understanding of multi-knowledge reasoning is finally realized, and a target intention label is output by the knowledge reasoning model.

In this way, in the embodiment, the dialogue intention is understood by adopting the mode of fusing the pre-training language model and the multi-knowledge reasoning model, so that the problem of insufficient background knowledge in the existing pre-training language model can be effectively relieved, the deep fusion of multiple kinds of knowledge is realized, the model can realize deeper text understanding according to the knowledge retrieval result, and a better user intention recognition effect is achieved.

and splicing the knowledge representation vector and the vectorized representation, performing full connection processing to obtain the probability of each intention label in the preset intention labels, and determining the intention label with the highest probability as a target intention label.

In one embodiment, the knowledge reasoning model adopts a knowledge graph-guided attention mechanism to realize intention recognition, and the specific process is as follows:

the current dialog text representation and the intention label text representation are H respectively _p And H is _q The lengths are m and n, respectively. Firstly splicing the two to form a representation sequence H with the length of m+n, then extracting global features by using a knowledge-introduced attention mechanism, and calculating the attention score:

where Q, K and V are the results of three different linear transformations corresponding to H, f (R) represents the operation of taking out the relationship vector in the current knowledge graph, d _k Is the network dimension. And then, carrying out residual connection and layer normalization processing:

H′＝LayerNorm(Attn+H)

where LayerNorm represents the layer normalization operation, the residual connection may be implemented using a residual network. The full connection layer is then applied to derive the final representation:

H ^f ＝LayerNorm(H′+ReLU(FC(H′)))

Wherein ReLU is an activation function and FC represents a fully connected layer. At this time, the knowledge representation vector H to be generated ^f Splicing with the original vector H, and finishing classification through a full connection layer:

y＝softmax(FC([H ^f ；H]))

where y represents the intent tag distribution, the intent tag with the highest probability may ultimately be taken as the final predicted target intent tag.

Thus, according to the embodiment, the target intention label can be accurately inferred according to the node representation and the relation representation in the knowledge graph.

The structure of the multi-knowledge hybrid inference network may be as shown in fig. 3 according to the description of the above embodiments.

And 104, sending target reply voice corresponding to the target intention to the calling party.

After the target intention in the dialogue text is identified, corresponding target reply voice can be determined according to the target intention, specifically, reply voice can be generated based on the intention understanding result through training a model, and corresponding reply voice can be determined according to the target intention corresponding voice template by pre-configuring the voice template corresponding to the different graph, so that reply content is simpler and more controllable.

The determined target reply voice may then be sent to the caller to effect an automatic conversation with the caller. It should be noted that, when receiving the reply voice sent again by the caller, the following steps may be performed again according to the logic from step 101 to step 104, so as to perform a further dialogue with the caller.

Optionally, the step 104 includes:

and sending the target reply voice to the calling party.

In one embodiment, in order to ensure that the answer content of the surrogate connection is reasonable, safe and controllable, a conversation template under the disagreement diagram, such as a sales promotion conversation template used in a sales promotion scene, a financial conversation template used in a financial management scene, and the like, can be designed in advance.

Therefore, in this embodiment, after identifying the target intention in the dialogue, the speaking template corresponding to the target intention may be queried, and the corresponding reply language may be selected from the speaking template, and the reply voice to the caller may be generated based on the reply language.

The embodiment designs a conversation response logic in a harassment call scene, the conversation can be designed according to conversation habits in daily calls of people, replies in the scene can be realized aiming at different scenes, and intelligent conversation with sensitive scene is realized. Firstly, the conversation slot positions can be designed according to different scenes, each slot position can be composed of a plurality of specific conversation texts with the same type or meaning, one sentence of the conversation slots can be randomly selected to fill the current conversation slot position in the actual calling process, and the diversity in the conversation process is realized. For each scene, usually starting with a greeting, finishing mutual greetings after answering; then aiming at the current scene category and subcategory, if the speaking action is greeting, reply or tattooing in the scene, a dialogue is developed, and the functions of asking questions, replying and the like in the scene are realized; and finally, the dialect is ended, the polite refuses the calling party to request, and the dialogue is ended.

Therefore, the reply content is controlled to be generated through the pre-designed speaking template, the implementation process is simple and controllable, the maintenance is easy, the safety and the reasonability of the reply content can be ensured, and the reply content can represent the current user intention.

In a specific embodiment, after intention recognition is performed on the dialogue text, current intention information, namely the target intention, can be compared with historical intention information, namely the intention determined last time, if the current intention information is the same as the historical intention information, a current speaking template, namely a speaking template corresponding to the historical intention information, can be directly queried, and a corresponding next sentence is selected from the current speaking template and is used as a current reply language; if the current intention information is different from the historical intention information, updating the current intention state, searching a conversation template corresponding to the current new intention, and selecting a first sentence reply word under the conversation as a current reply word.

Thus, through the embodiment, the conversation template corresponding to the current intention can be quickly positioned, so that the corresponding reply language can be quickly determined.

Optionally, before the step 102, the method further includes:

under the condition of receiving the call, acquiring user information;

In the embodiment of the application, besides the user intention recognition algorithm module, the harassing call substitution system also involves a large amount of information interaction with the user, and the interaction generally needs to query personal information or preset speaking logic of the user so as to determine the current dialogue state and the reply text needing to be returned. Thus, a large number of data storage, modification and query operations are required during system information interaction. The above operations are typically implemented using a database. In one embodiment, a remote dictionary service Redis may be employed as a database for data services.

Redis, a remote dictionary service, is an open-source log-type, key-Value (Key-Value) database written in ANSI C language, supported in network, and capable of being based on memory and persistent, and provides application program interfaces (Application Programming Interface, APIs) in multiple languages. Redis is a high-performance Key-Value database, provides Java, C/C++, C#, python and other clients, is convenient to use, supports master-slave synchronization, can synchronize data from a master server to any number of slave servers, can be a master server associated with other slave servers, enables Redis to execute single-layer tree replication, and can intentionally and unintentionally write data. The publish/subscribe mechanism is fully implemented, so that when the slave database synchronizes the tree anywhere, a channel can be subscribed to and the complete message publication record of the master server can be received. Synchronization is helpful for scalability and data redundancy of read operations.

In this embodiment, when a call of the preset type, such as a nuisance call, is answered, user information may be obtained first, where the user information stores a unique ID corresponding to the user, and a remote database corresponding to the user information stores a current state of the user, that is, a state of suspending or continuing the conversation. After the network interface receives the user request, the user information is searched first, the corresponding user information Redis database is queried, and the current state of the user under the same ID is returned. If the user state is suspended, the current call is ended, and the call cannot be continued, and the user returns an empty string, and various states return a default empty value; if the user state is continuation, continuing representing the current conversation, continuing to identify the intention of the conversation text, and performing the interaction of the conversation information according to the intention identification result. The user information interaction flow may be as shown in fig. 4.

Thus, according to the embodiment, the current call state can be judged based on the user information, and whether the intention recognition and the interaction of the speaking information are continued or not can be further determined.

Optionally, the method further comprises:

acquiring a special identifier in the process of answering the call;

and terminating the call when the special identifier is an end identifier.

In the process of answering the call, the received dialogue input text can be identified first to judge whether the dialogue input text is a special identifier, wherein two types of special identifiers are respectively a start identifier "start" and an end identifier "end". If the input text is "start", representing a session start instruction, and needing to reply to a common greeting, at this time, randomly selecting a common greeting, and sending a common greeting voice to the caller; if the input text is "end", the session end instruction is represented, and the session needs to be terminated, and the session can be directly ended. Except for the two special identifiers, the rest input text is dialogue history information transmitted by the user, namely dialogue text returned by the calling party, and the intention recognition is needed for the dialogue text, and a reply voice is returned based on a speaking template. The flow of interaction of the session information may be as shown in fig. 5.

Thus, according to this embodiment, appropriate telephone information interaction can be performed by determining the input text in the substitute call.

The following describes specific speech surgery design logic taking a promotion house purchase scenario as an example. Firstly, answering a calling party at the beginning of a session, and returning a greeting of 'hello, hello'; second, a dialogue is developed for the promotion purchase scene, returning a reply "do you to recommend a floor to me? "etc.; finally, before the dialogue is finished, the bulletin is returned to' no longer to be played, i do not need, and i do not need to go to me any more. Flexible modification of the supporting voice in the voice template can adjust the content of the reply text according to the requirement.

The call processing method in the embodiment of the application can be applied to a harassing call substitution system, and the system framework is shown in fig. 6 and mainly comprises three modules, namely a network layer, an algorithm layer and a data layer. Firstly, a user initiates a calling request, and dialogue information and user information are transmitted; after routing to the specified IP via the network layer, specific algorithm functions (including multi-mode matching algorithms and multi-knowledge hybrid inference networks) are invoked to achieve user intent understanding. The data layer searches the user information data according to the user intention, searches the preset conversation, returns a conversation reply and is provided for the user by the network layer interface.

The whole process of the harassing call substitution system can be shown in fig. 7. For a single user, the dialog is usually developed in a multi-round form, so that the whole flow needs to be constructed in a round-robin form, the user information is kept, and the history state is stored until the dialog is finished. Firstly, a calling party initiates an access request, and a system receives a reply text transmitted by the calling party and confirms the identity of the calling party. And then judging the intention of the user according to the reply text, confirming the current scene information, updating the current scene information together with the user information, and storing the updated current scene information in a user database. And searching the speech database based on the latest user intention, confirming the current dialogue process according to the user information, and returning a preset corresponding reply text. Finally, judging whether the dialogue is ended, if the current state does not reach the designated dialogue round number, continuing to receive the reply of the opposite party until the dialogue is ended.

Compared with the traditional harassing call substitution system architecture based on the neural network generation model, the harassing call substitution system architecture based on the neural network generation model has the advantages that the algorithm module is expanded, the network module and the data module are additionally added, the system is more complete, and the use efficiency is improved.

The embodiment of the application has the following advantages: the method for identifying the joint intention based on keyword matching and neural network identification can still achieve a good effect in a scene with insufficient annotation data, so that the problem of excessive dependence on the annotation data is relieved, and the effect can be greatly improved in a scene with sufficient annotation data; secondly, introducing various kinds of knowledge, unifying the various kinds of knowledge into a mode of a knowledge graph, and introducing the mode into a transducer framework taking pre-training knowledge as a main body to realize an intention understanding method of knowledge guidance, so that the problem of insufficient background knowledge in the existing pre-training language model can be effectively relieved, the deep fusion of the various kinds of knowledge is realized, the model can realize deeper text understanding according to a knowledge retrieval result, and a better user intention recognition effect is achieved; thirdly, a preset speaking operation is configured, and corresponding reply texts in the scene are designed according to the disagreement diagram, so that reply contents are simple and controllable, and the safety risk of a user in the using process is avoided.

The embodiment of the application further digs the harassment call substitution capability through the technical advantages, introduces various external knowledge on the basis of the main stream method, realizes the dialogue intention understanding of various knowledge guidance through the multi-knowledge reasoning model, can more conveniently fuse more information under the technical framework, adapts to various requirements of users, has good portability and expandability, and is suitable for large-scale popularization and use.

According to the call processing method, under the condition that a preset type of call is answered, a voice text sent by a calling party is obtained; constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and a preset intention label; carrying out knowledge reasoning according to the knowledge graph, and determining a target intention corresponding to the voice text; and sending target reply voice corresponding to the target intention to the calling party. Therefore, knowledge data of various knowledge types are introduced to understand the dialogue content, so that intention contained in the dialogue content can be understood more deeply, and further, reply voice generated according to the intention recognition result can be more accurately represented by the true intention of the user.

The embodiment of the application also provides a call processing device. Referring to fig. 8, fig. 8 is a block diagram of a call processing apparatus according to an embodiment of the present application. Because the principle of the call processing device for solving the problem is similar to that of the call processing method in the embodiment of the present application, the implementation of the call processing device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 8, the call processing apparatus 800 includes:

a first obtaining module 801, configured to obtain a voice text sent by a caller when a call of a preset type is answered;

a construction module 802, configured to construct a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, where the knowledge graph is used to characterize a relationship between each word in the voice text and a preset intention label;

a first determining module 803, configured to perform knowledge reasoning according to the knowledge graph, and determine a target intention corresponding to the voice text;

a first sending module 804, configured to send, to the caller, a target reply voice corresponding to the target intention.

Optionally, the call processing apparatus 800 further includes:

the construction module 802 is configured to construct a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types when it is determined by the multimodal matching algorithm that the preset intent keyword does not exist in the voice text.

Optionally, the call processing apparatus 800 further includes:

Optionally, the building block 802 includes:

the query unit includes:

Optionally, the first determining module 803 includes:

the coding unit is used for coding the voice text and a preset intention label to obtain vectorized representation of the voice text and the preset intention label;

Optionally, the intention recognition unit includes:

Optionally, the first sending module 804 includes:

Optionally, the acquiring unit includes:

Optionally, the call processing apparatus 800 further includes:

the construction module 802 is configured to construct a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types when the user status indicates that the call is continued.

Optionally, the call processing apparatus 800 further includes:

The call processing apparatus 800 provided in the embodiment of the present application may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

The call processing device 800 of the embodiment of the present application obtains a voice text sent by a caller when receiving a call of a preset type; constructing a knowledge graph corresponding to the voice text according to knowledge data with different knowledge types, wherein the knowledge graph is used for representing the relation between each word in the voice text and a preset intention label; carrying out knowledge reasoning according to the knowledge graph, and determining a target intention corresponding to the voice text; and sending target reply voice corresponding to the target intention to the calling party. Therefore, knowledge data of various knowledge types are introduced to understand the dialogue content, so that intention contained in the dialogue content can be understood more deeply, and further, reply voice generated according to the intention recognition result can be more accurately represented by the true intention of the user.

The embodiment of the application also provides electronic equipment. Because the principle of the electronic device for solving the problem is similar to that of the call processing method in the embodiment of the application, the implementation of the electronic device can be referred to the implementation of the method, and the repetition is omitted. As shown in fig. 9, an electronic device according to an embodiment of the present application includes:

Processor 900, for reading the program in memory 920, performs the following procedures:

and sending target reply voice corresponding to the target intention to the calling party through the transceiver 910.

A transceiver 910 for receiving and transmitting data under the control of the processor 900.

Wherein in fig. 9, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 900 and various circuits of memory represented by memory 920, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 910 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 900 is responsible for managing the bus architecture and general processing, and the memory 920 may store data used by the processor 900 in performing operations.

Optionally, the processor 900 is further configured to read the program in the memory 920, and perform the following steps:

the processor 900 is further configured to read the program in the memory 920, and perform the following steps:

determining a relationship vector in the knowledge graph;

the target reply voice is sent to the caller via transceiver 910.

The processor 900 is also optionally configured to read the program in the memory 920, and perform the following steps:

under the condition of receiving the call, acquiring user information;

acquiring a special identifier in the process of answering the call;

sending a common greeting voice to the caller through transceiver 910, if the special identifier is a start identifier;

And terminating the call when the special identifier is an end identifier.

The electronic device provided by the embodiment of the present application may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

Furthermore, a computer readable storage medium of an embodiment of the present application is used for storing a computer program, where the computer program can be executed by a processor to implement the steps of the method embodiment shown in fig. 1.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A call processing method, comprising:

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 2, wherein after determining whether a preset intent keyword is present in the voice text using a multi-pattern matching algorithm, the method further comprises:

4. A method according to any one of claims 1 to 3, wherein said constructing a knowledge graph corresponding to said phonetic text from knowledge data having different knowledge types comprises:

5. The method of claim 4, wherein the plurality of knowledge bases comprises a scenario knowledge base and a common knowledge base;

6. A method according to any one of claims 1 to 3, wherein said performing knowledge reasoning from the knowledge graph to determine a target intent corresponding to the phonetic text comprises:

determining a relationship vector in the knowledge graph;

7. The method of claim 6, wherein the intent recognition of the vectorized representation using an attention mechanism directed based on a relationship vector in the knowledge graph, outputting a target intent tag, comprises:

8. A method according to any one of claims 1 to 3, wherein said sending a target reply voice to the caller, to which the target intention corresponds, comprises:

and sending the target reply voice to the calling party.

9. The method of claim 8, wherein the obtaining the target answer phrase from the speech template corresponding to the target intention comprises:

10. A method according to any one of claims 1 to 3, wherein before constructing the knowledge graph corresponding to the phonetic text from knowledge data having different knowledge types, the method further comprises:

under the condition of receiving the call, acquiring user information;

11. A method according to any one of claims 1 to 3, further comprising:

acquiring a special identifier in the process of answering the call;

and terminating the call when the special identifier is an end identifier.

12. A call processing apparatus, comprising:

13. An electronic device, comprising: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor; a processor for reading a program in a memory to implement the steps in the call processing method according to any one of claims 1 to 11.

14. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps in the call processing method according to any one of claims 1 to 11.