CN112202974B

CN112202974B - Method, device and system for automatically judging telephone answering state

Info

Publication number: CN112202974B
Application number: CN202011391529.2A
Authority: CN
Inventors: 王刚; 曾文佳; 冯梦盈
Original assignee: Anhui Xinchen Communication Technology Co ltd
Current assignee: Anhui Xinchen Communication Technology Co.,Ltd.
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-04-02
Anticipated expiration: 2040-12-03
Also published as: CN112202974A

Abstract

The embodiment of the application provides a method, a device and a system for automatically judging a telephone answering state, wherein the method comprises the following steps: acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed telephone and received by a telephone dialing end, and the voice audio characteristics are represented by an array; obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning algorithm and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; determining a call-in status of the dialed call based on at least one of the audio states. Some embodiments of the present application may be completely handed over to the machine to handle the on-state determination for each phone call, so that a large number of phone call services may be handled simultaneously.

Description

Method, device and system for automatically judging telephone answering state

Technical Field

The present application relates to the field of telephone communication, and in particular, to a method, an apparatus, and a system for automatically determining a telephone answering state.

Background

In the current sales, collection is hasty, and in the first step of the operation of these services taking the telephone outgoing call service as the beginning of the whole service, the telephone numbers of a large number of audience objects are needed, and these telephone numbers are used as alternative service objects, and then the telephone numbers are individually called, and then the subsequent service is carried out on the clients who call. In the process, the effective service objects are selected from a large number of telephone numbers one by one.

When selecting a true audience from a large number of alternative telephone service objects, a large amount of tedious, simple and repetitive work of making a call is required by the service personnel. In the process of continuously dialing the standby telephone, the mind of the service personnel is also influenced to a certain extent in the process, so that when the telephone is actually and effectively connected, the service personnel still does not reach a state capable of serving people, irrecoverable loss is caused, and the whole service becomes unsatisfactory.

Therefore, how to automatically screen out the dialed call answered by the real person becomes a technical problem to be urgently solved.

Disclosure of Invention

Some embodiments of the present application construct a determination logic based on an online real-time call state (i.e., an audio state is obtained) and then according to a determination manner when a human answers the call, and finally identify whether the call is picked up by the human according to a plurality of audio states and the determination logic, and the picked-up person is a real person or a small intelligent assistant, and leaves a message mailbox. For a scene with a large number of telephone numbers as alternative service objects, the embodiment of the application can rapidly and autonomously judge the connection state of each call, so that downstream business is started.

In a first aspect, some embodiments of the present application provide a method for automatically determining a phone answering state, the method including: acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed telephone and received by a telephone dialing end, and the voice audio characteristics are represented by an array; obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning model and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; determining a call-in status of the dialed call based on at least one of the audio states.

Some embodiments of the present application may be completely handed over to the machine to handle the on-state determination for each phone call, so that a large number of phone call services may be handled simultaneously.

In some embodiments, said determining a connected state of said placed call based on at least one of said audio states comprises: judging that the dialed call is in a real person connection state according to at least one audio state; the method further comprises the following steps: and carrying out transfer processing on the outgoing call corresponding to the dialed call.

Some embodiments of the application further trigger corresponding subsequent services by judging the connection state, so that the subsequent services are conveniently developed.

In some embodiments, after said determining the on state of said dialed call based on at least one of said audio states, said method further comprises: providing the on state to a client.

Some embodiments of the present application further feed back the connection state determined by the server to the client, so that the client performs the next operation according to the received connection state.

In some embodiments, the set length audio byte stream is sampled from the audio source by setting a sliding window and a step-size traverse.

Some embodiments of the present application aim to obtain an audio byte stream with a set length at least to enable the processing speed of the subsequent deep learning network on data to meet the real-time requirement. In some embodiments, the obtaining speech audio features corresponding to a set length of the audio byte stream includes: and performing feature extraction on the audio byte stream with the set length through a Mel cepstrum coefficient (MFCC) or a Mel spectrum logmel algorithm to obtain the voice audio features.

Some embodiments of the present application obtain speech audio features represented by arrays, and aim to perform audio state classification on an audio byte stream by using a deep learning algorithm of a deep learning network.

In some embodiments, the deep learning model is implemented by a convolutional recurrent neural network CGRU.

Some embodiments of the present application make full use of the CNN high-dimensional feature extraction function included in the CGRU to perform higher-dimensional feature extraction of the input speech audio features, and then apply the temporal processing capability of the rnn (gru) to perform processing in the temporal direction, so as to meet the real-time requirement.

In some embodiments, the audio states include a current audio state and a historical audio state, the audio categories include: silent state, beep, polyphonic ringtone, system prompt tone, voice of speaking after the person is connected or hang-up prompt tone; said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and obtaining the connection state according to the logic relation between the current audio state corresponding to the current audio segment and the historical audio state corresponding to the historical audio segment, wherein the audio segments correspond to the audio bytes with the set length in a one-to-one streaming manner.

Some embodiments of the present application perform logical combination on a classification result of deep learning of a current voice byte stream and a classification result of deep learning of a voice byte stream acquired at a historical time to finally determine the on state, so as to improve accuracy of the on state obtained by the determination.

In some embodiments, the audio states include a current audio state and a historical audio state, the method comprising, prior to determining the on state of the dialed call based on at least one of the audio states: acquiring at least one historical audio state, wherein the historical voice audio state corresponds to the audio byte stream with the set length obtained at the past moment; said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and judging the state of the dialed call through the combination of at least one historical audio state and the current audio state, wherein the state of the dialed call comprises one of connection, hang-up and continuous waiting.

Some embodiments of the application comprehensively judge the on state by combining the front and rear audio states, and can improve the accuracy of the on state obtained by judgment.

In a second aspect, some embodiments of the present application provide a method for automatically determining a phone answering state, which is applied to a client, and the method includes: acquiring an audio byte stream with a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing end; carrying out feature extraction on the audio byte stream to obtain voice audio features represented by arrays; transmitting the voice audio features; acquiring an audio state corresponding to the voice audio features, wherein the audio state is obtained by classifying the voice audio features through a deep learning model, and belongs to one of audio categories which can be heard when waiting to be switched on; determining a call-in status of the dialed call based on at least one of the audio states.

In some embodiments, the acquiring an audio byte stream of a set length from an audio source further comprises: and sampling the audio source to obtain the audio byte stream with the set length by setting a sliding window and the step size transverse movement.

In some embodiments, the audio state comprises: silent state, beep, polyphonic ringtone, system alert tone, voice of a person speaking after switching on or hang-up alert tone.

In some embodiments, the audio states include a current audio state and a historical audio state, the audio categories include: silent state, beep, polyphonic ringtone, system prompt tone, voice of speaking after the person is connected or hang-up prompt tone; said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and obtaining the connection state according to the logic relation between the current audio state corresponding to the current audio segment and the historical audio state corresponding to the historical audio segment, wherein the audio segments correspond to the audio byte streams with the set lengths one by one.

In some embodiments, the audio states include a current audio state and a historical audio state, the method comprising, prior to determining the on state of the dialed call based on at least one of the audio states: acquiring at least one historical audio state, wherein the historical voice audio state corresponds to the audio byte stream with the set length obtained at the past moment; said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and judging the state of the dialed call by the combination of at least one historical audio state and the current audio state, wherein the state of the dialed call comprises one of connection, hang-up and continuous waiting.

In a third aspect, some embodiments of the present application provide a method for automatically determining a phone answering state, where the method is applied to a server, and the method includes: receiving voice audio features; acquiring an audio state corresponding to the voice audio feature through a deep learning algorithm, wherein the voice audio feature is obtained by acquiring an audio byte stream with a set length from an audio source by a client and performing feature extraction on the audio byte stream with the set length, the voice audio feature is represented by an array, the audio source is a sound signal fed back by a dialed call received by a telephone dialing end, and the audio state belongs to one of audio categories which can be heard when waiting to be connected; and storing and sending the audio state to the client.

In a fourth aspect, some embodiments of the present application provide an apparatus for automatically determining a phone answering state, the apparatus including: the voice audio characteristic acquisition module is configured to acquire voice audio characteristics corresponding to a set length of audio byte stream, wherein the set length of audio byte stream is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed call received by a telephone dialing terminal, and the voice audio characteristics are represented by an array; the audio state acquisition module is configured to obtain and store a voice frequency state corresponding to the audio byte stream with the set length according to a deep learning model and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; and the connection state judging module is configured to determine the connection state of the dialed call according to at least one audio state.

In a fifth aspect, some embodiments of the present application provide an apparatus for automatically determining a phone answering state, the apparatus including: the audio byte stream acquisition module is configured to acquire an audio byte stream with a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing end; the voice audio characteristic extraction module is configured to perform characteristic extraction on the audio byte stream to obtain voice audio characteristics represented by an array; a transmitting module configured to transmit the voice audio feature; the receiving module is configured to acquire an audio state corresponding to the voice audio features, wherein the audio state is obtained by classifying the voice audio features through a deep learning algorithm, and belongs to one of audio categories which can be heard when waiting to be switched on; and the connection state judging module is configured to determine the connection state of the dialed call according to at least one audio state.

In a sixth aspect, some embodiments of the present application provide a system for automatically determining a phone answering state, the system including: a client configured to acquire an audio byte stream of a set length from an audio source and transmit the audio byte stream; a server configured to: acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed telephone and received by a telephone dialing end, and the voice audio characteristics are represented by an array; obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning model and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; determining a call-in status of the dialed call based on at least one of the audio states.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a diagram of a network architecture on which a method for automatically determining a phone answering state according to an embodiment of the present application is based;

fig. 2 is a flowchart of a method for automatically determining a phone answering state according to an embodiment of the present disclosure;

fig. 3 is a second flowchart of a method for automatically determining a phone answering state according to an embodiment of the present application;

fig. 4 is a third flowchart of a method for automatically determining a phone answering state according to an embodiment of the present application;

fig. 5 is a fourth method for automatically determining a phone answering state according to an embodiment of the present application;

FIG. 6 is a block diagram of an apparatus for automatically determining a phone answering status according to an embodiment of the present disclosure;

fig. 7 is a second block diagram of the apparatus for automatically determining a phone answering state according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Some embodiments of the present application provide a method for identifying a call state in real time through a deep learning model and an algorithm on line, and then identifying whether the call is finally picked up by a person by combining a combinational logic judgment (for example, a plurality of logic judgment sentences may be constructed by referring to a logic judgment mode for judging whether the call is picked up by a person) of a plurality of call states (or called audio states), and whether the picked-up person is a real person, a small intelligent assistant, a message box, or the like. Some embodiments of the application can be applied to a scene with a large number of telephone numbers as alternative service objects, and can quickly and autonomously judge the connection state of each call so as to start to perform downstream services.

Referring to fig. 1, fig. 1 is a diagram illustrating a network architecture based on which a method for automatically determining a phone answering state according to an embodiment of the present application is based.

In some embodiments, client 100 and server 170 are interconnected via network 160. The network 160 includes, but is not limited to, a mobile communication access network (e.g., a 4G or 5G communication network) and a core network.

The client 100 may be a computing device including a processor and memory, through which the client 100 may retrieve at least a set length audio byte stream. For example, the client 100 may be a smart phone, a game device, a robot, or other interactive devices with a telephone dialing function. In some embodiments, the client 100 may further include a sound collector (not shown in the figure) at least for collecting a sound signal returned by the dialed phone, and an audio obtaining device (not shown in the figure) at least for obtaining an audio byte stream of a set length from the collected sound signal. In some embodiments, the client 100 may further include a feature extraction module, where the feature extraction module is configured to extract features from the received audio byte stream with a set length, and characterize the extracted features with an array, that is, obtain the voice audio features.

In some embodiments, the server 170 is configured to receive the byte stream with the set length from the client 100 through the network 160, and the server 170 will then automatically determine the phone answering status according to the byte stream with the set length.

In some embodiments, the server 170 is configured to receive the voice audio features from the client 100 via the network 160 (corresponding to the example, the client 100 is required to obtain the voice audio features), and then the server 170 automatically determines the phone answering status according to the voice audio features.

The method for automatically determining the phone answering state performed by the server 170 at least includes performing audio state recognition on the obtained voice audio features by using a deep learning model to obtain audio categories corresponding to each audio byte stream with a set length (for example, the audio categories include a silent state, a beep, a polyphonic ringtone, a system alert tone, a human-speaking voice after being turned on, a hang-up alert tone, etc.). In some embodiments, the server 170 further combines the audio state recognition results to determine the answering state of the phone, and the corresponding server 170 feeds back the finally determined answering state of the phone to the client 100. In other embodiments, the server 170 feeds back the audio state recognition result corresponding to each audio byte stream with a set length to the client 100, and then the client 100 determines the answering state of the call according to a combination of the audio state results.

The following exemplary method for automatically determining the call answering state performed on the server 170 is described.

As shown in fig. 2, some embodiments of the present application provide a method for automatically determining a phone answering state, including: s101, acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a sound signal fed back by a dialed telephone received by a telephone dialing end, and the voice audio characteristics are represented by an array; s102, obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning model (or called a deep learning algorithm) and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; s103, determining the connection state of the dialed call according to at least one audio state.

In some embodiments, the client 100 samples the audio source to obtain the audio byte stream with a set length by setting a sliding window and a step size traversing, and then sends the audio byte stream with the set length to the server 170 to continue executing the method for automatically determining the telephone answering state. For example, the client 100 obtains an audio byte stream from a stable audio source and samples the audio byte stream for a set length by setting a sliding window and a step-size shift.

As an example, S101 further includes: and performing feature extraction on the obtained audio byte stream with the set length through a Mel cepstrum coefficient MFCC or a Mel spectrum logmel algorithm to obtain voice audio features. That is, some embodiments of the present application convert speech data from a set length of an audio byte stream to an array of specific dimensions through speech feature extraction, for example, speech audio feature extraction techniques that may be employed include MFCC or logmel. Some embodiments of the present application mark a piece of speech feature information (i.e. audio byte streams with set lengths) through the obtained array, so that the deep learning model can be used to fit the audio byte streams with set lengths.

That is to say, some embodiments of the present application convert voice data from a byte stream with a set length into an array of a specific dimension through voice feature extraction, and convert an audio byte stream from a time domain into a frequency domain, so that the obtained array has an effect of indicating a piece of voice feature information, and then can use a deep learning model for fitting.

S102 is exemplarily set forth below.

It should be noted that the specific functions of the deep learning model in the embodiments of the present application include determining the real-time state of an input audio segment (corresponding to an audio byte stream with a set length), and the requirement on real-time performance is high, so that the input, number of layers, and parameter size of the deep learning model are all as small as possible according to the requirement on real-time performance. In order to ensure that the performance requirement of real-time performance and the input characteristics contain sufficient information, some embodiments of the application select a byte stream with a set length corresponding to an audio segment of 1-2 seconds to perform characteristic extraction, and obtain a voice audio corresponding to the byte stream with the set length as the input of a deep learning model, so that the audio segment with the length can retain useful information while the processing time of the deep learning model is short. According to the real-time requirement on the use scene of the embodiment of the application, the processing time of the deep learning model for the audio segment needs to be as short as possible, so that the array label of the smaller audio segment (namely, the audio byte stream with the set length) of the deep learning model is input, and the deep learning model obtains the feature extraction result corresponding to the input smaller audio segment (namely, the audio segment corresponding to the audio byte stream with the set length) and finally outputs the audio type judged according to the feature extraction result. That is, the input of the deep learning model is a speech audio feature corresponding to each audio byte stream of a set length, and is output as a different sound type (i.e., speech type) in the phone scenario.

As one example, the structure of the deep learning model employs a product-cycle neural network CGRU. The CGRU neural network structure fully utilizes the high-dimensional feature extraction function of the CNN to perform higher-dimensional feature extraction on input voice audio features, obtains some new features which are obtained through learning of training data and have expression capability, and then applies the time sequence processing capability of RNN (GRU) to perform processing in the time sequence direction.

The structure of the deep learning model of embodiments of the present application employs a shallow neural network structure in order to ensure processing real-time requirements, for example, in some embodiments of the present application the deep learning model includes: and a CNN layer, an RNN layer and a full connection layer, wherein the high-dimensional feature extraction function of the CNN is fully utilized to perform higher-dimensional feature extraction on the input voice audio features corresponding to the audio byte stream with the set length, so that the features with higher expression capability, which are obtained by learning through training data, are obtained.

In some embodiments of the present application, in order to fit a feature that the time dimension of the audio signal is longer than the feature dimension, the length of the selected convolution kernel of the CNN network in the time dimension is longer than the feature dimension, for example, the size of the convolution kernel of the CNN network is 9 × 5 or 11 × 5, where numbers 9 and 11 represent the time dimension, and two numbers 5 represent the feature dimension; the processing time of the RNN layer in the deep learning model is longest, so in order to ensure the requirement of real-time performance, some embodiments of the present application should output a feature map from the CNN as small as possible, and at the same time, ensure that useful data is not lost when the CNN extracts high-dimensional features, and the features of the CNN network further include: the step size of the convolution kernel is chosen to be 1 and includes pooling levels with kernel sizes of 2 x 2, so that the final feature map output from the CNN level is reduced by a factor of two compared to the feature length and width of the CNN input.

Some embodiments of the present application use the RNN (gru) timing processing capability to perform processing in the timing direction, which requires sequential processing in the time dimension due to the characteristics of the RNN recurrent neural network. In some embodiments of the present application, to meet the requirement of real-time performance, the layer parameters of the RNN select a smaller parameter number (e.g., about 5-7 th power of 2), and only the output of the last node of the RNN is selected as the final output, because the length of the final output is only the number of one feature dimension.

As an example, the fully connected layer uses softmax as a classification normalization function, and the final class output is different sound classes (speech classes) in the phone scenario. It can be understood that, the category selection of the sound has two cases of whether the prompt sound exists or not when the telephone is switched on, some embodiments of the present application use the category of the sound when the prompt sound does not appear when the telephone is switched on according to the actual use environment, the sound categories are divided according to six different sounds that can be heard by human when waiting to be switched on, and the six different sounds include: silent state, beep, polyphonic ringtone, system alert tone, the sound of a person speaking after switching on, and hang-up alert tone (whether the hang-up alert tone needs to be identified and added is selected). That is, some embodiments of the present application use a deep learning network to classify the input speech audio features, and then to classify the input audio segments corresponding to the byte stream of a set length into one of six different sounds.

S103 is exemplarily set forth below.

S103, comprehensively judging the connection state of the dialed call according to the current audio state of the audio segment corresponding to the audio byte stream with the set length at the current moment and the historical audio states of one or more historical audio segments.

As an example, the on state of S103 is a logical judgment based on human experience of making a call, and is implemented according to how a human judges the on state of the call in a real call-making scene. For example, the logic determination of switch on may include: if a polyphonic ringtone, a beep sound, or a system alert sound occurs before (i.e., the corresponding historical audio state obtained by the deep learning model according to the historical audio segment is a polyphonic ringtone, a beep sound, or a system alert sound), and is recognized as a human sound at this moment (i.e., the deep network determines that the current audio state corresponding to the audio byte stream of the set length at the current moment is a human-spoken sound after being connected), S103 determines that the current moment is a real person connected state; when the polyphonic ringtone or the system prompt tone appears in the front (namely, the corresponding historical audio state obtained by the deep learning model according to the historical audio segment is the polyphonic ringtone or the system prompt tone), the polyphonic ringtone or the system prompt tone is identified to be silent for about one second continuously at the moment (namely, the deep network judges that the current audio state corresponding to the audio byte stream with the set length at the current moment is a silent state), and the polyphonic ringtone or the system prompt tone is judged to be in a connected state at the moment; alternatively, when the previous ticker occurs (i.e., the corresponding historical audio state obtained by the deep learning model from the historical audio segment is ticker), it is recognized as silence for about 5 seconds (i.e., the current audio state corresponding to the audio byte stream of the set length at the current time is determined to be silent by the deep network), and it is determined to be on. The hang-up determination logic may include: recognizing an end tone (namely, the deep network judges that the current audio state corresponding to the audio byte stream with the set length at the current moment is a silent state), which means that the call is hung up; or the user has previously generated a beep or a ring back tone (i.e., the corresponding historical audio state obtained by the deep learning model from the historical audio segments is a beep or a ring back tone), and at this moment, the user is identified as a system alert tone (the deep network determines that the current audio state corresponding to the audio byte stream of the set length at the current moment is a system alert tone), and to a large extent, the user has hung up a phone or left a message box, etc.

That is, S103 includes: and obtaining the connection state according to the logic relation between the current audio state corresponding to the current audio byte stream and the historical audio state corresponding to the historical audio byte stream. It is understood that before the on-state judgment of S103 is performed, the method includes: acquiring at least one historical audio state, wherein the historical audio state corresponds to the audio byte stream with the set length obtained at the past moment; said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and judging whether the state of the dialed call is on, off or needing to continue waiting through the combination of at least one historical audio state and the current audio state.

It should be noted that, in step S103, it is determined that the dialed call is in a live connection state according to at least one of the audio states; the method further comprises: and carrying out transfer processing on the outgoing call corresponding to the dialed call. In some embodiments, after performing S103, the method further comprises: providing the on state to a client.

As shown in fig. 3, the overall technology implementation logic of some embodiments of the present application includes: s201, acquiring an audio byte stream with a set length, for example, acquiring the audio byte stream from a stable audio source, and sampling the audio byte stream with the set length by setting a sliding window and step-size traversing (for example, about 1-2 seconds); s202, extracting the characteristics of the audio byte stream, for example, extracting the voice characteristics of the obtained audio byte stream to obtain voice audio characteristics; s203, the deep learning model obtains an audio state according to the voice audio characteristics, for example, the obtained voice audio characteristics are used as the deep learning model pair for input, and the audio state corresponding to each audio segment is obtained; s204, logically determining a plurality of audio states to obtain an on state, for example, storing the audio state of each voice segment (or called audio segment) from the beginning, and logically determining according to the stored audio states by referring to human experience of making a call, so as to obtain a final on state result of whether the call at this time is on, hanging up, or continuing to wait.

As an implementation manner, the engineering implementation architecture of some embodiments of the present application sends a request to a Client with byte stream information of a set length, and continuously receives status information of whether to switch on, which is obtained from a server. The Server side Server finishes the judgment of the deep learning model transmitted by the receiving client side, the logic judgment based on the experience of human call connection, the storage of the historical moment state and the return of the answering state. In the aspect of dependence, because the feasibility problem does not need to be judged and processed constantly at every moment in the telephone judging process, the reaction time of one port does not influence the normal used time node to judge the model state, so that the Client is in a many-to-many relationship to the Web and the Web is in a many-to-one relationship to the Tensiloflowserving in the aspect of engineering realization.

In an engineering implementation manner, the URL may be used to perform a web connection to send a post request so as to achieve a deep learning service using a cloud server (as shown in fig. 4 below), and the deep learning service may also be directly deployed to a local client (as shown in fig. 2 and 3), so as to reduce a delay effect caused by network transmission.

As shown in fig. 4, some embodiments of the present application further provide a method for automatically determining a phone answering status applied to a client, which is different from the method of fig. 2 in that a deep learning model is disposed in a cloud server, and the rest steps are performed by the client. The method for automatically judging the telephone answering state applied to the client comprises the following steps: s401, acquiring an audio byte stream with a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing end; s402, extracting the characteristics of the audio byte stream to obtain the voice audio characteristics represented by an array; s403, sending the voice audio features; s404, obtaining an audio state corresponding to the voice audio features, wherein the audio state is obtained by classifying the voice audio features through a deep learning algorithm, and belongs to one of audio categories which can be heard when waiting to be connected; s405, determining the connection state of the dialed call according to at least one audio state.

Corresponding to the method of fig. 4, as shown in fig. 5, a method for automatically determining a call answering state, executed at a server side, includes: s501, receiving voice audio features; s502, acquiring an audio state corresponding to the voice audio feature through a deep learning algorithm, wherein the voice audio feature is obtained by collecting an audio byte stream with a set length from an audio source by a client and performing feature extraction on the audio byte stream with the set length, the voice audio feature is represented by an array, the audio source is a feedback sound signal of a dialed call received by a telephone dialing end, and the audio state belongs to one of audio categories which can be heard when waiting to be connected; s503, storing and sending the audio state to the client.

For specific implementation details of corresponding steps in the methods of fig. 4 and fig. 5, reference may be made to the detailed description of fig. 2, and redundant description is not repeated here to avoid repetition.

As shown in fig. 6, some embodiments of the present application provide an apparatus for automatically determining a phone answering state, the apparatus including: a voice audio feature obtaining module 601 configured to obtain a voice audio feature corresponding to an audio byte stream with a set length, where the audio byte stream with the set length is acquired from an audio source, the audio source includes a sound signal fed back by a dialed phone received by a phone dialing terminal, and the voice audio feature is characterized by an array; an audio state obtaining module 602, configured to obtain and store an audio state corresponding to the audio byte stream with the set length according to a deep learning algorithm and the voice audio features, where the audio state belongs to one of audio categories that can be heard when waiting to be turned on; a connection state determining module 603 configured to determine a connection state of the dialed call according to at least one of the audio states.

As shown in fig. 7, some embodiments of the present application provide an apparatus for automatically determining a phone answering state, the apparatus including: an audio byte stream acquiring module 701 configured to acquire an audio byte stream of a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing terminal; a voice audio feature extraction module 702, configured to perform feature extraction on the audio byte stream to obtain a voice audio feature represented by an array; a sending module 703 configured to send the voice audio feature; a receiving module 704 configured to obtain an audio state corresponding to the voice audio feature, wherein the audio state is obtained by classifying the voice audio feature through a deep learning algorithm, and the audio state belongs to one of audio categories that can be heard when waiting to be turned on; a connected state determining module 705 configured to determine a connected state of the dialed call according to at least one of the audio states.

For the specific implementation details of the corresponding modules in the apparatuses of fig. 6 and 7, reference may be made to the detailed description in the foregoing, and redundant description is not repeated here to avoid repetition.

Some embodiments of the present application further provide a system (as shown in fig. 1) for automatically determining a phone answering state, the system comprising: a client configured to acquire an audio byte stream of a set length from an audio source and transmit the audio byte stream; a server configured to: acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed telephone and received by a telephone dialing end, and the voice audio characteristics are represented by an array; obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning algorithm and the voice audio characteristics, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on; determining a call-in status of the dialed call based on at least one of the audio states.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for automatically determining a telephone answering state, the method comprising:

acquiring voice audio characteristics corresponding to an audio byte stream with a set length, wherein the audio byte stream with the set length is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed telephone and received by a telephone dialing end, and the voice audio characteristics are represented by an array;

obtaining and storing an audio state corresponding to the audio byte stream with the set length according to a deep learning model and the voice audio features, wherein the audio state belongs to one of audio categories which can be heard when waiting to be switched on, the deep learning model comprises a layer of CNN, a layer of RNN and a layer of full connection layer, the length of a convolution kernel of the CNN in a time dimension is longer than the feature dimension, and the CNN further comprises: the step size of the convolution kernel is chosen to be 1 and includes pooling layers with kernel size of 2 x 2;

determining a call-in status of the dialed call based on at least one of the audio states.

2. The method of claim 1,

said determining a call-on state of said dialed call based on at least one of said audio states, comprising: judging that the dialed call is in a real person connection state according to at least one audio state;

the method further comprises the following steps: and carrying out transfer processing on the outgoing call corresponding to the dialed call.

3. The method of claim 1, wherein after determining the on state of the dialed call based on at least one of the audio states, the method further comprises: providing the on state to a client.

4. The method of claim 1, wherein the set length audio byte stream is obtained by sampling the audio source by setting a sliding window and a step traverse.

5. The method of claim 1, wherein the obtaining speech audio features corresponding to a set length of an audio byte stream comprises: and performing feature extraction on the audio byte stream with the set length through a Mel cepstrum coefficient (MFCC) or a Mel spectrum logmel algorithm to obtain the voice audio features.

6. The method of claim 1, in which the deep learning model is implemented by a convolutional recurrent neural network (CGRU).

7. The method of claim 1, wherein the audio states include a current audio state and a historical audio state, the audio categories including: silent state, beep, polyphonic ringtone, system prompt tone, voice of speaking after the person is connected or hang-up prompt tone;

said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and obtaining the connection state according to the logic relation between the current audio state corresponding to the current audio segment and the historical audio state corresponding to the historical audio segment, wherein the audio segments correspond to the audio byte streams with the set lengths one by one.

8. The method of claim 1, wherein the audio states include a current audio state and a historical audio state, the method comprising, prior to determining the on state of the dialed call based on at least one of the audio states:

acquiring at least one historical audio state, wherein the historical audio state corresponds to the audio byte stream with the set length obtained at the past moment;

said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and judging the state of the dialed call through the combination of at least one historical audio state and the current audio state, wherein the state of the dialed call comprises one of connection, hang-up and continuous waiting.

9. A method for automatically judging a telephone answering state is applied to a client, and is characterized by comprising the following steps:

acquiring an audio byte stream with a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing end;

carrying out feature extraction on the audio byte stream to obtain voice audio features represented by arrays;

transmitting the voice audio features;

acquiring an audio state corresponding to the voice audio features, wherein the audio state is obtained by classifying the voice audio features through a deep learning model, and belongs to one of audio categories which can be heard when waiting to be switched on;

determining a connection state of the dialed call according to at least one audio state;

wherein the audio states include a current audio state and a historical audio state, and the audio categories include: silent state, beep, polyphonic ringtone, system prompt tone, voice of speaking after the person is connected or hang-up prompt tone;

10. The method of claim 9,

11. The method of claim 9, wherein the obtaining a set length of the audio byte stream based on an audio source, further comprises:

and sampling the audio source to obtain the audio byte stream with the set length by setting a sliding window and the step size transverse movement.

12. The method of claim 9, wherein the audio state comprises: silent state, beep, polyphonic ringtone, system alert tone, voice of a person speaking after switching on or hang-up alert tone.

13. The method of claim 9, wherein the audio states include a current audio state and a historical audio state, the method comprising, prior to determining the on state of the dialed call based on at least one of the audio states:

said determining a call-on state of said dialed call based on at least one of said audio states, comprising: and judging the state of the dialed call by the combination of at least one historical audio state and the current audio state, wherein the state of the dialed call comprises one of connection, hang-up and continuous waiting.

14. A method for automatically judging a telephone answering state is applied to a server side, and is characterized by comprising the following steps:

receiving voice audio features;

acquiring an audio state corresponding to the voice audio feature through a deep learning model, wherein the voice audio feature is obtained by a client collecting an audio byte stream with a set length from an audio source and performing feature extraction on the audio byte stream with the set length, the voice audio feature is represented by an array, the audio source is a sound signal fed back by a dialed telephone received by a telephone dialing terminal, and the audio state belongs to one of audio categories which can be heard when waiting to be connected, wherein the deep learning model comprises a layer of CNN, a layer of RNN and a layer of full connection layer, a convolution kernel of the CNN has a long length bit feature dimension in a time dimension, and the CNN further comprises: the step size of the convolution kernel is chosen to be 1 and includes pooling layers with kernel size of 2 x 2;

and storing and sending the audio state to the client.

15. An apparatus for automatically determining a telephone answering state, the apparatus comprising:

the voice audio characteristic acquisition module is configured to acquire voice audio characteristics corresponding to a set length of audio byte stream, wherein the set length of audio byte stream is acquired from an audio source, the audio source comprises a voice signal fed back by a dialed call received by a telephone dialing terminal, and the voice audio characteristics are represented by an array;

an audio state obtaining module, configured to obtain and store a voice frequency state corresponding to the audio byte stream with the set length according to a deep learning model and the voice audio features, where the audio state belongs to one of audio categories that can be heard when waiting to be turned on, the deep learning model includes a layer of CNN, a layer of RNN, and a layer of full-connected layer, and a length of a convolution kernel of the CNN in a time dimension is longer than a feature dimension, and the CNN further includes: the step size of the convolution kernel is chosen to be 1 and includes pooling layers with kernel size of 2 x 2;

and the connection state judging module is configured to determine the connection state of the dialed call according to at least one audio state.

16. An apparatus for automatically determining a telephone answering state, the apparatus comprising:

the audio byte stream acquisition module is configured to acquire an audio byte stream with a set length according to an audio source, wherein the audio source is a sound signal fed back by a dialed call received by a telephone dialing end;

the audio characteristic extraction module is configured to perform characteristic extraction on the audio byte stream to obtain voice audio characteristics represented by an array;

a transmitting module configured to transmit the voice audio feature;

a receiving module configured to obtain an audio state corresponding to the speech audio feature, where the audio state is obtained by classifying the speech audio feature through a deep learning model, the audio state belongs to one of audio categories that can be heard when waiting to be turned on, the deep learning model includes a layer of CNN, a layer of RNN, and a layer of fully-connected layer, and a length of a convolution kernel of the CNN in a time dimension is longer than a feature dimension, and the CNN further includes: the step size of the convolution kernel is chosen to be 1 and includes pooling layers with kernel size of 2 x 2;

17. A system for automatically determining a telephone answering state, the system comprising:

a client configured to acquire an audio byte stream of a set length from an audio source and transmit the audio byte stream;

a server configured to: