CN117834781A

CN117834781A - Transponder detection and interaction method, device, terminal and storage medium

Info

Publication number: CN117834781A
Application number: CN202310126797.9A
Authority: CN
Inventors: 马雨枫; 许文强; 郑达; 任腾
Original assignee: Shanghai Pharud Intelligent Technology Co ltd
Current assignee: Shanghai Pharud Intelligent Technology Co ltd
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2024-04-05

Abstract

The application provides a method, a device, a terminal and a storage medium for detecting and interacting a transponder, wherein the method comprises the following steps: acquiring voice data to be detected; inputting the voice data to be detected into a first detection module, wherein the first detection module is used for detecting whether the voice data to be detected is voice data; when the first detection module detects that the voice data to be detected is voice data, extracting the voice data to be input into the second detection module, wherein the second detection module is used for detecting whether the voice data is real voice data or not; when the second detection module detects that the voice data is real voice data, the call is connected into an artificial seat or the initialized voice robot fixed phone is played. Based on the method and the device, the node switched from the automatic answering machine to the real person can be accurately identified, corresponding interaction is performed according to the identification result, conversation interaction efficiency is improved, and occupation of communication resources is reduced.

Description

Transponder detection and interaction method, device, terminal and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for detecting and interacting with a transponder.

Background

At present, with the gradual development and the trend of the technology of voice robots, the automatic response robots are widely applied to various fields of social life. For example, in the hotel field, when a customer needs to reserve a guest room and call the hotel, the automatic response robot can complete interaction with the customer, and the reservation of the guest room of the hotel can be completed according to the voice information of the customer; in the financial field, when a financial institution needs to urge against a customer, an automatic response robot can be accessed after the financial institution makes a call against the customer, and information related to the repayment will of the customer is obtained through the interaction of the automatic response robot and the customer; in the e-commerce field, when the merchant needs to notify the customer of the preferential activity information, the merchant can access the automatic response robot after the merchant makes a call to the customer, and the automatic response robot notifies the customer of the corresponding preferential activity information. Therefore, the automatic response robot technology is developed, so that the labor cost in the business communication process is greatly saved, and the interaction efficiency of business activities is improved.

However, in the existing voice robot technology, the voice robot cannot realize the recognition of the automatic response robot or the recognition accuracy is not high, so that if both sides cannot recognize that the other side is the automatic response robot and do not take the hang-up operation if both sides are in the scene that both sides are the automatic response robots, the automatic response robots of both sides interact to generate long-time invalid interaction, and conversation resources are wasted. In addition, there is a possibility that the initial state of the call object is an automatic answering robot, but then a person picks up and seeks a call. The prior art usually adopts a direct hanging-up mode for processing the situations, but ignores the situation that the automatic response robot is switched to a real person after the automatic response robot is switched to the real person, so the prior art cannot accurately identify the time node when the automatic response robot is switched to the real person and lacks a corresponding interaction mode.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a method, an apparatus, a terminal and a storage medium for detecting and interacting a transponder, which are used for solving the problem that the method for detecting and interacting a transponder in the prior art cannot accurately identify a node switched from an automatic transponder to a real person for answering and interact correspondingly according to the identification result.

To achieve the above and other related objects, a first aspect of the present application provides a transponder detection and interaction method, including: acquiring voice data to be detected; inputting the voice data to be detected into a first detection module, wherein the first detection module is used for detecting whether the voice data to be detected is voice data; when the first detection module detects that the voice data to be detected is voice data, extracting the voice data to be input into a second detection module, wherein the second detection module is used for detecting whether the voice data is real voice data; when the second detection module detects that the voice data are real voice data, the call is connected into a manual seat or the initialized voice robot fixed phone is played.

In some embodiments of the first aspect of the present application, in order to improve accuracy of detecting human voice data and non-human voice data according to the present application, the first detection module is a trained classifier, and the method for detecting whether the voice data to be detected is human voice data includes: inputting the voice data to be detected into a trained classifier; the trained classifier outputs each classification result and the prediction probability of each classification result; and when the prediction probability of the voice data in each classification result is not smaller than a preset probability threshold, judging that the voice data to be detected is voice data.

In some embodiments of the first aspect of the present application, in order to solve the problem that the prediction probability of each classification result is smaller than the preset probability threshold in a specific scenario, when the prediction probability of each classification result is smaller than the preset probability threshold, the judgment result of the voice data to be detected is output as uncertain, and a state machine for describing the state of the change of the judgment result is set.

In some embodiments of the first aspect of the present application, in order to solve the problem that non-voice data interspersed in voice data may cause interference to detection, the method for setting a state machine for describing a state of change of the determination result includes: making the initial state of the state machine uncertain; if the current judgment result is the voice data, the current state of the state machine is made to be in speaking; if the current judging result is non-human voice data, enabling the current state of the state machine to be non-human voice; if the current judging result is uncertain, the current state of the state machine is uncertain; if the current judgment result is the non-voice data and the previous judgment result is the voice data, the current state of the state machine is maintained to be in speaking.

In some embodiments of the first aspect of the present application, to enable the classifier to have better prediction performance after training, the trained classifier is obtained based on the following steps: acquiring a training set for training a classifier; wherein the training set for training the classifier comprises: the voice data and the non-voice data comprise noise data, music data and call prompt voice data; constructing a classifier for audio classification, and inputting the training set into the classifier for audio classification to train until the classifier converges, so as to obtain a trained classifier.

In some embodiments of the first aspect of the present application, in order to further improve training efficiency of the classifier to increase convergence speed of the classifier, when the training set is input into the classifier for audio classification, audio with a fixed length in the training set is randomly extracted to input into the classifier for audio classification.

In some embodiments of the first aspect of the present application, in order to solve the problem of how to detect the voice data to be detected in the streaming scenario, the voice data to be detected is obtained in a sliding window manner and is input into the first detection module.

In some embodiments of the first aspect of the present application, the classifier is a convolutional neural network for faster model training efficiency and model performance.

In some embodiments of the first aspect of the present application, to implement the determination of whether the voice data is real voice data or non-real voice data, the method for detecting whether the voice data is real voice data by using the second detection module includes: determining whether the voice data simultaneously satisfies one or more of the following conditions: the number of words in the voice data exceeds a word number threshold; the speech speed of the voice data exceeds a speech speed threshold; the voice data comprises tag words; the voice data comprises numbers with consistency; and when the voice data do not meet the conditions, the second detection module judges the voice data as real voice data.

In some embodiments of the first aspect of the present application, to adapt to the personalized setting requirement of the user, before the obtaining the voice data to be detected, the method further includes: and acquiring the opening and closing state detected by the current transponder, and executing the step of acquiring the voice data to be detected when the opening and closing state detected by the transponder is open.

In some embodiments of the first aspect of the present application, in order to adapt to the personalized setting requirement of the user, when the open/close state detected by the answering machine is closed, the step of inputting the voice data to be detected into the first detection module is skipped, and the call is directly connected to the artificial seat or the voice robot fixed phone after the initialization is played.

In some embodiments of the first aspect of the present application, in order to reduce waste of human communication resources, when the first detection module detects that the voice data to be detected is non-human voice data, or when the second detection module detects that the human voice data is non-real human voice data, a voice robot fixed phone is played.

In some embodiments of the first aspect of the present application, in order to reduce the waste of communication resources and improve the interaction efficiency, when the second detection module still does not detect that the voice data is real voice data after a preset detection threshold time, the call is hung up and the call record is marked.

In some embodiments of the first aspect of the present application, to solve the problem that the call answering party hangs up the call prematurely, the greeting is played repeatedly before the call is accessed into the manual agent.

To achieve the above and other related objects, a second aspect of the present application provides a transponder detection and interaction device, comprising: a voice data acquisition module: the method is used for acquiring voice data to be detected; and the voice detection module is used for: the voice data to be detected are input into a first detection module, and the first detection module is used for detecting whether the voice data to be detected are voice data or not; the true man detection module: the voice data processing module is used for extracting the voice data to be detected to input a second detection module when the first detection module detects that the voice data to be detected are voice data, and the second detection module is used for detecting whether the voice data are real voice data or not; and an interaction module: and the second detection module is used for accessing the call into an artificial seat or playing the initialized voice robot fixed phone when detecting that the voice data are real voice data.

To achieve the above and other related objects, a third aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method.

To achieve the above and other related objects, a fourth aspect of the present application provides an electronic terminal, including: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory so as to enable the terminal to execute the method.

As described above, the method, the device, the terminal and the storage medium for detecting and interacting the answering machine have the following beneficial effects:

the method and the device can accurately and efficiently judge whether the call answering party is an automatic answering machine or a real person, accurately identify the node of the call answering party, which is converted into the real person by the automatic answering machine, answer, and execute corresponding interactive operation according to the judging result. For example, when the call answering party is judged to be an automatic answering machine, the fixed phone operation of the voice robot is continuously played, and answering machine detection is continuously performed until the call answering party is judged to be converted into a real person for answering, and at the moment, the call is connected into a manual seat or the initialized fixed phone operation of the voice robot is played, so that the interaction requirement of the real person user of the answering party is met. Therefore, the method, the device, the terminal and the storage medium for detecting and interacting the answering machine can accurately and efficiently detect the answering machine and conduct corresponding interaction according to the detection result, so that communication efficiency is improved and user experience is improved.

Drawings

Fig. 1 is a schematic flow chart of a method for detecting and interacting a transponder according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for detecting whether the voice data to be detected is voice data by the first detection module according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a state machine for describing the state of the change of the determination result according to an embodiment of the present invention.

FIG. 4 is a flow chart of a method for obtaining the trained classifier according to an embodiment of the invention.

Fig. 5 is a schematic diagram of a sliding window for acquiring the voice data to be detected according to an embodiment of the present invention.

Fig. 6 is a flowchart of a method for detecting whether the voice data is real voice data by the second detection module according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a transponder detection and interaction device according to an embodiment of the present application.

Detailed Description

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present application. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.

In order to solve the problems in the background art, the invention provides a method, a device, a terminal and a storage medium for detecting and interacting a transponder, which aim to solve the problem that the method for detecting and interacting a transponder in the prior art cannot accurately identify a node switched from an automatic transponder to a real person for answering and correspondingly interact according to an identification result. Meanwhile, in order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be further described in detail by the following examples with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Before explaining the present invention in further detail, terms and terminology involved in the embodiments of the present invention will be explained, and the terms and terminology involved in the embodiments of the present invention are applicable to the following explanation:

<1> transponder (Answer Machine): synonymous with conversation robot, automatic response robot, automatic interaction robot in this application, it is essentially a program that can simulate human conversation chat form and provide services, replacing manual call answering by automatic response of voice robot. In the tide that robots replace manual work, many enterprises are on the way of using or using robots, for example in the call center industry, telephone robots are outstanding in the field of 'man-machine wars', and the robots take the place of/assist in replacing a large flag of a manual seat, so that people can be liberated and the benefit can be improved by efficient work.

<2> interactive voice response (IVR, interactive Voice Response): is a generic term for telephone-based voice value added services. The telephone user dials the appointed number, and can listen to or send the needed information according to the voice operation prompt, and participate in interactive services such as chat, friend making and the like. Some commercial institutions (such as banks, credit card centers, telecommunication companies, etc.) also provide automated telephone inquiry services through interactive voice response (Interactive Voice Response System, IVRS), customers can dial designated telephone numbers, enter the system, enter appropriate options or personal data according to the instructions of the system to listen to prerecorded information, or read specific data (such as account balance, amount due, etc.) in a voice manner through a computer system according to preset program (Call Flow) combined data, and also input transaction instructions through the system to perform preset transactions (such as transferring accounts, changing passwords, changing contact telephone numbers, etc.).

<3> manual agent response: the manual agent response is a communication media term. According to the needs of the customer, the voice channel for carrying out automatic voice response (IVR) is transferred to the artificial seat, the customer carries out one-to-one conversation with the seat operator, and the seat operator answers the consultation of the customer or inputs the information of the customer. In addition, the seat staff can also report the query result to the client by adopting automatic voice.

<4> Classifier (Classifier): classification is a very important method of data mining. The concept of classification is to learn a classification function or construct a classification model (i.e., what we commonly call a Classifier) based on existing data. The function or model can map data records in a database to one of a given class, and thus can be applied to data prediction. In summary, the classifier is a generic term of a method for classifying samples in data mining, and includes algorithms such as decision trees, logistic regression, naive bayes, neural networks, and the like.

<5> model training: model training refers to the process of finding model parameters from known data, and the mapping that is ultimately searched to meet a certain accuracy is called the trained model.

<6> finite state machines (FSM, abbreviated in english) are also called finite state automata (FSA, abbreviated in english), which are mathematical computational models representing finite states and behaviors such as transitions and actions between these states.

<7> model convergence: model convergence generally means that the training and verification loss curve does not have large fluctuation, and the fluctuation can still be within a certain tolerance range with the continuous increase of the training round number. The model gradient change tends to be gentle in the training process, and model training can be considered to be completed when the model converges.

<8> sliding window, sliding window: sliding window is an idea based on double pointers, forming a window between the elements pointed by the two pointers. In this application, a sliding window is used to divide audio data, the time series of the data included in the sliding window is called a window length, and the time series difference between the previous sliding window and the next sliding window is called a window shift.

<9> convolutional neural network (Convolutional Neural Network, CNN): the convolutional neural network is a feedforward neural network, and its artificial neurons can respond to surrounding units within a part of coverage, and have excellent performance for large-scale image processing. Convolutional neural networks consist of one or more convolutional layers and a top fully connected layer (corresponding to classical neural networks) and also include associated weights and pooling layers (pooling layers). This structure enables the convolutional neural network to take advantage of the two-dimensional structure of the input data. Convolutional neural networks can give better results in terms of image and speech recognition than other deep learning structures. This model may also be trained using a back propagation algorithm. Compared with other deep and feedforward neural networks, the convolutional neural network needs fewer parameters to be considered, so that the convolutional neural network becomes an attractive deep learning structure.

The embodiment of the invention provides a transponder detection and interaction method, a device for the transponder detection and interaction method, a storage medium storing an executable program for realizing the transponder detection and interaction method and an electronic terminal. With respect to implementation of a transponder detection and interaction method, an exemplary implementation scenario of a transponder detection and interaction method will be described.

Fig. 1 is a schematic flow chart of a transponder detection and interaction method according to an embodiment of the present invention. The transponder detection and interaction method in this embodiment mainly includes the following steps:

step S11: and acquiring voice data to be detected.

Specifically, the voice data to be detected in step S11 is the voice data transferred from the caller in the call-on direction in the more common application scenario, but the present application is certainly not only used in the traditional telephony scenario. For example, the voice data to be detected may be voice data transmitted by other instant messaging software such as WeChat, nail, etc., or voice data transmitted through an interphone. Of course, besides the instant messaging scene, the application can also be suitable for the scene of non-instant messaging, for example, the application can be pre-recorded, downloaded and cached voice data. In short, the source, form and application scenario of the voice data to be detected are not limited, any voice data which can be identified and perceived can be the detection object in the application, and the type of the voice data to be detected is not limited in any way in the embodiment.

In some implementations of embodiments of the present application, before acquiring the voice data to be detected, the method further includes: and acquiring the opening and closing state detected by the current transponder, and executing the step of acquiring the voice data to be detected when the opening and closing state detected by the transponder is open.

Specifically, in some special application scenarios, the user wants to have an option of turning on or off the transponder detection function to adapt to different object requirements. Thus, in this implementation of the embodiments of the present application, the open and closed state of the current transponder detection may be acquired prior to the acquisition of the voice data to be detected. The opening and closing states detected by the answering machine can be set manually to meet the requirements of different use subjects in different scenes, and the answering machine can also be automatically triggered by preset conditions, for example, when a call object of a conversation is a restaurant or a hotel, the function of answering machine detection is automatically started; when the call object of the conversation is a person, the function of the answering machine detection is maintained to be in a closed state. The reason is that the application scenario of answering machine detection is to solve the situation that the machine sound can be played and then manually received, and the scenario facing the business main body is relatively more, such as a restaurant, an enterprise foreground and the like. However, if the other personal-oriented scene is a voice mail, the situation that the voice mail is manually connected is not existed or only very small probability exists, the answering machine is hung up after being detected from the aspect of cost, and the answering machine detection is not required to be started. Therefore, in order to meet the requirement of whether the transponder detection needs to be started in different application scenarios, the setting state of the transponder detection can be acquired before the voice data to be detected is acquired, and if the acquired setting state of the transponder detection is the starting state, the following steps S11-S14 are continuously executed, namely the voice data to be detected is acquired.

In some implementation processes of the embodiments of the present application, when the open/close state detected by the answering machine is closed, the step of inputting the voice data to be detected into the first detection module is skipped, and the call is directly connected to the artificial seat or the initialized voice robot fixed-line call is played.

As described above, the requirements of whether the transponder detection needs to be started in different application scenarios are satisfied, the setting state of the transponder detection may be obtained before the voice data to be detected is obtained, and if the obtained setting state of the transponder detection is the off state, it indicates that the function of the transponder detection does not need to be started in the current application scenario. Under the situation, the step of acquiring the voice data to be detected is not needed to be executed, and the judgment of whether the voice data to be detected is human voice data or real human voice data is carried out, so that the call can be directly connected into a manual seat or the voice robot fixed phone after the initialization is played. In this implementation process of the embodiment of the present application, if the call answering party can be definitely an individual user, the open/close state detected by the answering machine can be maintained as the closed state, and after the call is connected, the call is directly connected to the manual seat or the initialized voice robot fixed phone is played so as to implement interaction with the call answering party.

Step S12: and inputting the voice data to be detected into a first detection module, wherein the first detection module is used for detecting whether the voice data to be detected is voice data.

In particular, the main purpose of the present application is to detect voice data to determine when the voice data is converted from a transponder to a real person, so that the key technical node is to distinguish whether the data to be detected is a transponder or a real person. However, during a daily conversation, in addition to the voice data, the conversation content is interspersed with background noise, music data (e.g., light music played while waiting to be on), and beeps commonly found in telephones, which listed data actually interfere with achieving a distinction between a transponder and a real human voice. Therefore, step S12 is performed to eliminate the interference of the non-human voice data, and the voice data to be detected is input into the first detection module, where the first detection module is used for determining whether the voice data to be detected is human voice data or non-human voice data, so as to eliminate the interference of the non-human voice data, and improve the detection efficiency and accuracy.

Fig. 2 is a schematic flow chart of a method for detecting whether the voice data to be detected is voice data by the first detection module in the embodiment of the invention. In this implementation process as shown in fig. 2, the first detection module is a trained classifier, and the method for detecting whether the voice data to be detected is voice data includes:

Step S21: and inputting the voice data to be detected into the trained classifier.

Specifically, to determine whether the voice data to be detected is voice data, in some implementation processes of the embodiments of the present application, the first detection module may be a trained classifier. The classifier is a classification model constructed based on existing data, and the model can map the data record in the database to one of given classes, so as to apply to data prediction, and in the implementation process possibly included in the embodiment of the application, the classifier used in step S21-step S23 may be logistic regression (Logistic Regression), and may also be a Naive Bayes, nearest neighbor (K-Nearest Neighbors), decision Tree (precision Tree), support vector machine (Support Vector Machines), and the like, where the selection of the classifier is not limited in the implementation process of the application.

Step S22: and outputting each classification result and the prediction probability of each classification result by the trained classifier.

Specifically, in step S22, after the voice data to be detected is input into the trained classifier for detection, the trained classifier has good performance of classifying the audio after being trained to converge, so that each classification result and the prediction probability of each classification result can be output. In this implementation process of the embodiment of the present application, the number of classification results is not limited, but each classification result necessarily includes a prediction probability of the voice data, and the other classification results can be set according to the needs, for example, the other classification results can also be data such as background sound, music data, current sound, telephone beep, silence, etc., which is not limited and can be set according to the needs.

Step S23: and when the prediction probability of the voice data in each classification result is not smaller than a preset probability threshold, judging that the voice data to be detected is voice data.

Specifically, in step S23, in order to make the detection result of the voice data to be detected by the classifier more intuitive and easy to process, a probability threshold may be preset for measuring the reliability of each classification result, and when the prediction probability of the voice data in each classification result output by the classifier is greater than or equal to the preset probability threshold, it indicates that the voice data to be detected is more likely to be voice data under the configuration of the current classifier, and at this time, the voice data to be detected is determined to be voice data. In a preferred embodiment after testing, the probability threshold may be 0.6, that is, when the probability of the voice data output by the classifier is greater than or equal to 0.6, the voice data to be detected is judged as the voice data.

In some implementation processes of an embodiment of the present application, when the prediction probability of each classification result is smaller than the preset probability threshold, the judgment result of the voice data to be detected is output as uncertain, and a state machine for describing the state of the change of the judgment result is set.

Specifically, through the classifier in step S23 and the probability threshold mechanism set in advance, classification of the voice data to be detected, particularly classification of the voice data and the non-voice data, can be achieved under a more common scene. However, in some rare scenes, the prediction probability of each classification result output by the classifier may be smaller than a preset probability threshold, for example, in a classification scene, the prediction probability of the first classification result is 0.55, the prediction probability of the second classification result is 0.45, and in this case, the prediction probability of any classification result is smaller than the probability threshold of 0.6, so that the classifier cannot output a corresponding prediction result. Therefore, in order to solve the problem that the classifier cannot output the corresponding prediction result in this scenario, the classifier outputs the classification result as "uncertain" in this situation, and a state machine for describing the state of change of the determination result is provided, where the state machine may represent the state of change of the classification result.

Fig. 3 is a schematic diagram of a state machine for describing the state of the change of the determination result in the embodiment of the present invention. In some implementations of an embodiment of the present application, the method for setting a state machine for characterizing a state of a change of the determination result includes:

Step S31: making the initial state of the state machine uncertain;

step S32: if the current judgment result is the voice data, the current state of the state machine is made to be in speaking; if the current judging result is non-human voice data, enabling the current state of the state machine to be non-human voice; if the current judging result is uncertain, the current state of the state machine is uncertain;

step S33: if the current judgment result is the non-voice data and the previous judgment result is the voice data, the current state of the state machine is maintained to be in speaking.

Specifically, the function of the state machine in step S23 is to characterize the state of change of the output result of the classifier, and indicate the judgment result output by the current classifier based on the triggering of several conditions. First, when the prediction probability of each classification result is smaller than the preset probability threshold as described in step S23, the initial state of the state machine is made uncertain. On the basis that the previous state of the state machine is uncertain, the next judging result of the classifier is obtained, so that the current state of the state machine is updated according to the type of the judging result: if the next judgment result of the classifier is the voice data, the current state of the state machine is changed into speaking; if the next judgment result of the classifier is non-human voice data, changing the current state of the state machine into non-human voice; if the next judgment result of the classifier is uncertain, the current state of the state machine is maintained to be uncertain. It should be noted that, in a more specific state change scenario, if the last determination result of the classifier is voice data and the next determination result of the classifier is non-voice data, the current state of the state machine is still maintained as speaking. The reason is that no matter in the scene of fixed-line speech playing or real person interaction of the answering machine, part of non-human voice data is inevitably inserted in human voice data, so that even if the classification result of the classifier is the non-human voice data in the speaking state, the state of the state machine is still maintained in the speaking state, and the complete human voice data is captured.

Fig. 4 is a schematic flow chart of a method for obtaining the trained classifier according to an embodiment of the present invention.

In this implementation of the present application as shown in fig. 4, the trained classifier is obtained based on the following steps:

step S41: wherein the training set for training the classifier comprises: and the non-voice data comprises noise data, music data and call prompt sound data.

Specifically, the purpose of step S41 is to prepare a training set, that is, a data set used for training a classifier in supervised learning, where the selection, size and label setting of the training set are often directly related to the training effect. Therefore, in order to make the classifier in the embodiment of the present application have a relatively good training effect, the training set used for training the classifier in the embodiment of the present application includes voice data, noise data, music data, and call alert data. Of these data, the voice data is that an automatic speech recognition model (ASR, automatic Speech Recognition) of a plurality of languages is adopted to extract a voice fragment in the voice data; the noise data is derived from the public noise data set and word segments which cannot be identified by the automatic speech recognition model in the real speech data, and in order to further improve the robustness of the model, the noise data also comprises mute audio; music data originate from some open source light music data, but these light music data must not contain human voice; the call prompt tone data is part of beeps in the real telephone call audio, but because the beeps in the real audio occupy smaller space, a large number of autonomously synthesized beeps can be added to expand the machine sound in the training set.

Step S42: constructing a classifier for audio classification, and inputting the training set into the classifier for audio classification to train until the classifier converges, so as to obtain a trained classifier.

Specifically, the purpose of step S42 is to train the original classifier to converge through the data set, and as mentioned above, the classifier used in step S42 for audio classification may be a random forest classifier or a naive bayes classifier, or any classifier capable of implementing multiple classification tasks such as convolutional neural network. After the type of the classifier for audio classification is determined, inputting the classifier with labels already marked in the training set into the classifier and setting a training target to start training the classifier, and when the classification performance of the classifier reaches the preset requirement after training, considering that the classifier is trained to be converged. At this point, the parameters of the classifier under this performance representation are saved as a trained classifier for performing subsequent predictive tasks.

In some implementations of step S42 of an embodiment of the present application, when the training set is input to the classifier for audio classification, fixed-length audio in the training set is randomly extracted to input to the classifier for audio classification.

Specifically, in order to obtain a better training effect in the model training process, in this implementation process of the embodiment of the present application, when the training set is input into the classifier for audio classification to train the classifier, a fixed length audio in the training set may be randomly extracted to be used for inputting the classifier, so as to complete the training of the classifier. For example, the audio data in the training set may be divided into pieces of audio data of a fixed length in advance, and these pieces of audio data of a fixed length may be randomly input into the classifier to train the classifier to converge, in which case, since the lengths of the audio data used to train the classifier are the same, the training efficiency of the classifier can be improved, allowing the classifier to reach a converging state more quickly. It should be noted that, in the embodiment of the present application, the length of the input audio is not limited when the classifier is trained, but in a preferred training scheme, the length of the audio data may be 2s, so as to achieve the optimal training effect. The reason is that 2s is typically able to cover one complete sound unit in daily speech, so in some embodiments tested, the classifier's predictive performance is best when the training set audio length for training is 2 s.

Fig. 5 shows a schematic diagram of a sliding window for acquiring the voice data to be detected in the embodiment of the present invention. In some implementation processes of steps S11 to S12 in an embodiment of the present application, the voice data to be detected is obtained in a sliding window manner and is input to a first detection module.

Specifically, after training of the classifier is completed, a trained converged classifier may be obtained, followed by an application process of the trained classifier. In order to obtain a better prediction effect of the classifier in practical application, a sliding window mode can be adopted when the voice data to be detected is obtained and input into the classifier. The sliding window comprises a starting node and an ending node, the selected data in the starting node and the ending node are the data processed when the window is currently slid, the parameters of the sliding window also comprise window length and window movement, the window length refers to the distance between the starting node and the ending node of one sliding window, and the window length of each sliding window is consistent; the window shift refers to the distance between the last sliding window and the next sliding window, and in one detection, the window shift is consistent, the window shift represents the sampling frequency, and the larger the window shift is, the lower the sampling frequency is, and the smaller the window shift is, the higher the sampling frequency is. It should be noted that, the window length and the size of the window shift are directly related to the prediction effect of the classifier, and the embodiment of the present application does not limit the window length and the size of the window shift parameter, so that a user can apply different window lengths and window shifts in different demand scenarios. However, in a preferred embodiment, the window length and window shift may be 0.5s and 0.25s, respectively, to optimize the prediction effect of the classifier in the examples of the present application. The reason is that the window length of 0.5s can generally cover a shortest sounding unit (such as a 'beep' sound in the conversation process) in all application scenes of the application, and the window movement of 0.25s can improve the sampling frequency and have less influence on the prediction efficiency, so that the prediction precision and the prediction efficiency of the classifier can be considered during prediction.

In some implementations of an embodiment of the present application, the classifier is a convolutional neural network.

In some implementations of embodiments of the present application, the classifier used may be a convolutional neural network for better training and prediction. The convolutional neural network is one of the very representative neural networks in the technical field of deep learning at present, and has made a lot of breakthrough progress in the field of voice analysis and processing.

Step S13: when the first detection module detects that the voice data to be detected is voice data, the voice data are extracted to be input into a second detection module, and the second detection module is used for detecting whether the voice data are real voice data.

In particular, the main purpose of the present application is to detect voice data to determine when the voice data is converted from a transponder to a real person, so that the key technical node is to distinguish whether the data to be detected is a transponder or a real person. As mentioned above, step S12 is used for eliminating the interference of non-voice data, inputting the voice data to be detected into the first detection module, where the first detection module is used for judging whether the voice data to be detected is voice data or non-voice data, so as to eliminate the interference of non-voice data, and improve the detection efficiency and accuracy. After the judgment that the voice data to be detected is the voice data or the non-voice data is completed, since the voice data includes not only the real human voice data but also the non-real human voice data, the purpose of step S13 is to judge whether the voice data detected by the first detection module is the real human voice data or the non-real human voice data (in the present application, the non-real human voice data is the robot voice played by the automatic answering machine). When the first detection module detects that the voice data to be detected is voice data, the voice data detected by the voice data to be detected is extracted to be input into the second detection module, and the second detection module completes judgment that the voice data is real voice data or non-real voice data.

Fig. 6 is a schematic flow chart of a method for detecting whether the voice data is real voice data by the second detection module according to the embodiment of the invention. In some implementations of step S13 of an embodiment of the present application, the method for detecting whether the voice data is real voice data by the second detection module includes:

step S61: determining whether the voice data simultaneously satisfies one or more of the following conditions: the number of words in the voice data exceeds a word number threshold; the speech speed of the voice data exceeds a speech speed threshold; the voice data comprises tag words; the voice data comprises numbers with consistency.

Specifically, on the basis that the first detection module has completed the determination of whether the voice data to be detected is the human voice data, the purpose of step S61 is to detect whether the human voice data is the real human voice data, so as to perform further interaction according to the determination result. In some implementations of step S61 of the embodiments of the present application, it may be determined whether the voice data is real voice data by determining whether the voice data satisfies one or more of the following conditions: in condition one, the number of words in the voice data exceeds a word number threshold. In general, compared with a real person dialogue, the number of words contained in voice of an automatic answering machine in the same time period is often larger than that of the real person dialogue, so that whether the voice data is real or non-real person voice data can be judged to a certain extent by setting a word number threshold and judging whether the word number in the voice data exceeds the word number threshold. It should be noted that, the word number threshold mentioned in the condition one may be set by the user body to adapt to the prediction requirements under different scenes, for example, the word number of the language data under different language scenes is larger, so that the larger word number threshold may be set in the english area and the smaller word number threshold may be set in the chinese area. And in the second condition, the speech speed of the voice data exceeds the speech speed threshold. In general, the speech speed of an automatic answering machine is often greater than that of a real person conversation, so that it is possible to determine whether the voice data is real or non-real person voice data to some extent by setting a speech speed threshold and determining whether the speech speed of the voice data exceeds the speech speed threshold. The speech speed threshold mentioned in the second condition can be set by the user by himself as well as the condition, so as to adapt to the prediction requirements under different scenes. And in the third condition, the voice data comprises tag words. The tag word refers to a tag word sufficient to identify whether the voice data is genuine voice data, for example, "later," "sorry," "dumb" in chinese context, "voiceemail," "press the button," "sorry" in english context, etc., any tag word sufficient to identify whether the voice data is genuine voice data can be included in the scope of the tag word in condition three. And in the fourth condition, the voice data comprises numbers with continuity. In general, compared with a real person conversation, the voice played by the automatic answering machine often contains numbers with consistency, especially telephone numbers, so that whether the voice data is real person voice data can be judged by judging whether the voice played by the automatic answering machine contains numbers with obvious consistency.

Step S62: and when the voice data do not meet the conditions, the second detection module judges the voice data as real voice data.

Specifically, the purpose of step S62 is to comprehensively obtain the result of determination of whether or not the voice data is real voice data based on the four condition determination results obtained in step S61. In this implementation process of the embodiment of the present application, when the voice data input to the second detection module does not satisfy any of the four conditions, then the voice data has a high likelihood of being real voice data, and the second detection module outputs the detection result thereof.

Step S14: when the second detection module detects that the voice data are real voice data, the call is connected into a manual seat or the initialized voice robot fixed phone is played.

Specifically, step S14 is to solve the technical problem of how to interact with the call answering party when the voice data is determined to be the real voice data. In this implementation process of the embodiment of the present application, when the second detection module detects that the voice data is real voice data, it indicates that the call answering party has been converted into real voice by the automatic answering machine, so in order to timely implement interaction with the call answering party and prevent the call answering party from hanging up prematurely, the call can be connected into a manual seat, and interaction is performed by the manual work and the call answering party. Or, the initialized fixed phone operation of the voice robot can be played, so that a call answering party can execute corresponding operation according to the fixed phone operation of the voice robot after hearing the initialized fixed phone operation of the voice robot, and the interaction efficiency is improved.

In some implementation processes of step S14 in an embodiment of the present application, when the second detection module still does not detect that the voice data is real voice data after the preset detection threshold time, the call is hung up and the call record is marked.

Specifically, in order to prevent the waste of communication resources and further improve the efficiency of answering machine detection, in some implementation processes of this embodiment of the present application, a detection threshold time for measuring whether the detection is overtime may be preset for the second detection module, and when the second detection module still does not detect that the voice data is real voice data after the preset detection threshold time, it indicates that the detection time of the second detection module is overtime, so it is required to hang up the call in time and mark the call record of this time, so that the follow-up tracking call or other statistical operations are facilitated. In the implementation process, the conversation is hung up in time after the detection time of the second detection module is overtime, so that the probability of the follow-up receiving of a real person is low, invalid interaction can be avoided, the waste of communication resources is reduced, and the detection efficiency of the transponder is improved.

In some implementations of step S14 in an embodiment of the present application, the greeting is played repeatedly before the call is accessed into the manual agent.

Specifically, when the second detection module detects that the voice data is real voice data, the call is connected to the artificial seat, so that interaction between the artificial customer service personnel and the call answering Fang Jishi cannot be guaranteed in time, and at the moment, if the answering party of the call fails to obtain effective interaction information, the call may be hung up, so that interaction failure is caused. Therefore, in some implementation processes of step S14 in an embodiment of the present application, before a call is connected to a manual agent, in order to prevent a call answering party from receiving effective interaction information in time, greeting words in a call such as "feed", "hello" and the like can be played repeatedly during the period, so that the call answering party feels natural and relevant, and the call will not hang up as soon as possible, thereby facilitating interaction between a manual technician and the call answering party after the subsequent manual agent is connected.

In some implementation processes of the embodiments of the present application, when the first detection module detects that the voice data to be detected is non-human voice data, or when the second detection module detects that the human voice data is non-real human voice data, a voice robot fixed phone is played.

Specifically, when the first detection module detects that the voice data to be detected is non-human voice data, or when the second detection module detects that the human voice data is non-real human voice data, the possibility that the call answering party is a real person is very low, in this case, the voice robot fixed phone can be played circularly until the voice data to be detected is detected to be real human voice data, and then the call is accessed into an artificial seat or the voice robot fixed phone after initialization is played.

Referring to fig. 7, an optional hardware structure diagram of a transponder detection and interaction terminal 700 according to an embodiment of the present invention may be shown, where the terminal 700 may be a mobile phone, a computer device, a tablet device, a personal digital processing device, a factory background processing device, etc. The transponder detection and interaction terminal 700 includes: at least one processor 701, memory 702, at least one network interface 704, and a user interface 706. The various components in the device are coupled together by a bus system 705. It is to be appreciated that the bus system 705 is employed to facilitate connection communications between these components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus systems in fig. 7.

The user interface 706 may include, among other things, a display, keyboard, mouse, trackball, click gun, keys, buttons, touch pad, or touch screen, etc.

It is to be appreciated that the memory 702 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), which serves as an external cache, among others. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory). The memory described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 702 in embodiments of the present invention is used to store various types of data to support operation of the transponder detection and interaction terminal 700. Examples of such data include: any executable programs for operating on the transponder detection and interaction terminal 700, such as the operating system 7021 and application programs 7022; the operating system 7021 contains various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks. The application programs 7022 may include various application programs such as a Media Player (Media Player), a Browser (Browser), and the like for implementing various application services. The transponder detection and interaction method provided by embodiments of the invention may be embodied in the application 7022.

The method disclosed in the above embodiment of the present invention may be applied to the processor 701 or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 701 or by instructions in the form of software. The processor 701 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 701 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor 701 may be a microprocessor or any conventional processor or the like. The steps of the accessory optimization method provided by the embodiment of the invention can be directly embodied as the execution completion of the hardware decoding processor or the execution completion of the hardware and software module combination execution in the decoding processor. The software modules may be located in a storage medium having memory and a processor reading information from the memory and performing the steps of the method in combination with hardware.

In an exemplary embodiment, the transponder detection and interaction terminal 700 may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device) for performing the aforementioned methods.

Fig. 8 is a schematic structural diagram of a transponder detection and interaction device according to an embodiment of the present invention. In this embodiment, the transponder detection and interaction device 800 includes:

voice data acquisition module 801: the method is used for acquiring voice data to be detected.

The voice detection module 802: the voice data to be detected are input into a first detection module, and the first detection module is used for detecting whether the voice data to be detected are voice data or not.

The real person detection module 803: and the voice data processing module is used for extracting the voice data to be detected to input a second detection module when the first detection module detects that the voice data to be detected is voice data, and the second detection module is used for detecting whether the voice data is real voice data or not.

Interaction module 804: and the second detection module is used for accessing the call into an artificial seat or playing the initialized voice robot fixed phone when detecting that the voice data are real voice data.

In some embodiments, the voice detection module and the true human detection module may be combined into one transponder detection module.

It should be noted that: the transponder detection and interaction device provided in the above embodiment is only exemplified by the division of the above program modules when carrying out transponder detection and interaction, and in practical application, the above processing allocation may be completed by different program modules according to needs, i.e. the internal structure of the device is divided into different program modules to complete all or part of the above processing. In addition, the transponder detection and interaction device and the transponder detection and interaction method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not repeated herein.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

In summary, the present application provides a method, an apparatus, a terminal, and a medium for detecting and interacting a transponder, which are provided by the present invention, for improving the efficiency of detecting and interacting a transponder, so as to solve the problem that the method for detecting and interacting a transponder in the prior art cannot accurately identify a node from an automatic transponder to a real person and interact correspondingly according to the identification result. Therefore, the method effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. A transponder detection and interaction method comprising:

acquiring voice data to be detected;

inputting the voice data to be detected into a first detection module, wherein the first detection module is used for detecting whether the voice data to be detected is voice data;

When the first detection module detects that the voice data to be detected is voice data, extracting the voice data to be input into a second detection module, wherein the second detection module is used for detecting whether the voice data is real voice data;

when the second detection module detects that the voice data are real voice data, the call is connected into a manual seat or the initialized voice robot fixed phone is played.

2. The method for detecting and interacting a transponder according to claim 1, wherein the first detection module is a trained classifier, and the method for detecting whether the voice data to be detected is voice data comprises the following steps:

inputting the voice data to be detected into a trained classifier;

the trained classifier outputs each classification result and the prediction probability of each classification result;

and when the prediction probability of the voice data in each classification result is not smaller than a preset probability threshold, judging that the voice data to be detected is voice data.

3. The method for detecting and interacting with a transponder according to claim 2, wherein when the prediction probability of each classification result is smaller than the preset probability threshold, outputting the judgment result of the voice data to be detected as uncertain, and setting a state machine for describing the state of change of the judgment result.

4. A transponder detection and interaction method according to claim 3 wherein the method of setting a state machine for characterizing the state of change of the determination result comprises:

making the initial state of the state machine uncertain;

if the current judgment result is the voice data, the current state of the state machine is made to be in speaking; if the current judging result is non-human voice data, enabling the current state of the state machine to be non-human voice; if the current judging result is uncertain, the current state of the state machine is uncertain;

if the current judgment result is the non-voice data and the previous judgment result is the voice data, the current state of the state machine is maintained to be in speaking.

5. The transponder detection and interaction method of claim 2 wherein the trained classifier is derived based on the steps of:

acquiring a training set for training a classifier; wherein the training set for training the classifier comprises: the voice data and the non-voice data comprise noise data, music data and call prompt voice data;

constructing a classifier for audio classification, and inputting the training set into the classifier for audio classification to train until the classifier converges, so as to obtain a trained classifier.

6. The transponder detection and interaction method according to claim 5, wherein the training set is input into the classifier for audio classification by randomly extracting audio of a fixed length from the training set to input into the classifier for audio classification.

7. The method for detecting and interacting with a transponder according to one of claims 1-6, wherein the voice data to be detected is obtained in a sliding window manner and is input to a first detection module.

8. The transponder detection and interaction method of any one of claims 2-6 wherein the classifier is a convolutional neural network.

9. The method for detecting and interacting with a transponder according to claim 1, wherein the method for detecting whether the voice data is real voice data by the second detection module comprises:

determining whether the voice data satisfies one or more of the following conditions: the number of words in the voice data exceeds a word number threshold; the speech speed of the voice data exceeds a speech speed threshold; the voice data comprises tag words; the voice data comprises numbers with consistency;

And when the voice data do not meet the conditions, the second detection module judges the voice data as real voice data.

10. The transponder detection and interaction method of claim 1 wherein prior to acquiring the voice data to be detected, the method further comprises: and acquiring the opening and closing state detected by the current transponder, and executing the step of acquiring the voice data to be detected when the opening and closing state detected by the transponder is open.

11. The method for detecting and interacting a answering machine according to claim 10, wherein when the open/close state of the answering machine is closed, the step of obtaining the voice data to be detected is skipped, and the call is directly connected to a manual seat or the initialized fixed voice of the voice robot is played.

12. The method according to claim 1, wherein when the first detection module detects that the voice data to be detected is non-human voice data, or when the second detection module detects that the human voice data is non-human voice data, playing a voice robot fixed phone.

13. The method for detecting and interacting a answering machine according to claim 1, wherein when the second detecting module does not detect that the voice data is real voice data after a preset detecting time threshold, hanging up a call and marking the call record.

14. The method of claim 1, wherein the greeting is played repeatedly before the call is placed into the manual agent.

15. A transponder detection and interaction device, comprising:

a voice data acquisition module: the method is used for acquiring voice data to be detected;

and the voice detection module is used for: the voice data to be detected are input into a first detection module, and the first detection module is used for detecting whether the voice data to be detected are voice data or not;

the true man detection module: the voice data processing module is used for extracting the voice data to be detected to input a second detection module when the first detection module detects that the voice data to be detected are voice data, and the second detection module is used for detecting whether the voice data are real voice data or not;

and an interaction module: and the second detection module is used for accessing the call into an artificial seat or playing the initialized voice robot fixed phone when detecting that the voice data are real voice data.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 14.

17. An electronic terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, to cause the terminal to perform the method according to any one of claims 1 to 14.