CN112133324A

CN112133324A - Call state detection method, device, computer system and medium

Info

Publication number: CN112133324A
Application number: CN201910491201.9A
Authority: CN
Inventors: 童颖
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2020-12-25

Abstract

The present disclosure provides a method, an apparatus, a computer system and a medium for detecting a call state, wherein the method for detecting a call state comprises: obtaining a voice signal; obtaining speech features of the speech signal, the speech features including features based on auditory properties and being decoupled from an algorithm that cancels echo in the speech signal; and inputting the voice characteristics into a call mode recognition model, and determining that the call state of the voice signal is a single-end call state or a double-end call state.

Description

Call state detection method, device, computer system and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a computer system, and a medium for detecting a call state.

Background

In the voice communication process, a voice communication product (such as a mobile phone) receives a far-end signal from a network side and plays the far-end signal through a loudspeaker, an echo signal is generated in an acoustic path, and the echo signal and near-end voice are collected by a microphone and then transmitted to the other communication end. In order to cancel echo signals, the prior art adopts an acoustic echo cancellation technique, and the principle of the technique is as follows: and simulating an echo path by using the self-adaptive filter to obtain an estimated echo signal, and subtracting the estimated echo signal from a near-end signal acquired by the microphone to realize echo cancellation.

In the echo cancellation technology, it is necessary to detect whether near-end speech exists in a near-end signal acquired by a microphone, and if the near-end speech exists, the near-end speech is in a double-end call state, which may affect an update process of a coefficient of an adaptive filter, resulting in divergence of the coefficient of the filter. For the call state detection technology, the prior art mainly uses an energy-based detection method and a dual-filtering-based detection method.

In the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the prior art: when the prior art is used for detecting the call state, the online detection is not easy to realize, and the accuracy of the detection result is ensured on the premise of using a small calculation amount.

Disclosure of Invention

In view of the above, the present disclosure provides a method, an apparatus, a computer system, and a medium for detecting a call state, which can perform online detection and ensure accuracy of a detection result with a small amount of computation.

An aspect of the present disclosure provides a call state detection method, which may include the operations of first obtaining a voice signal, then obtaining a voice feature of the voice signal, the voice feature including a feature based on an auditory characteristic and being decoupled from an algorithm for eliminating an echo in the voice signal, then inputting the voice feature into a call mode recognition model, and determining a call state of the voice signal as a single-ended call state or a double-ended call state.

According to the embodiment of the present disclosure, the voice feature is a feature based on auditory characteristics, and is decoupled from an algorithm for eliminating Echo in the voice signal, so that it is not necessary to obtain an estimate of the feature quantity by means of other algorithms, such as an Echo Cancellation Algorithm (AEC), and the input parameter can be obtained online, and the detection result is accurate.

According to an embodiment of the present disclosure, the voice feature may include at least one of: the filter comprises a Mel frequency cepstrum coefficient characteristic, an amplitude modulation spectrum characteristic, a perceptual linear prediction coefficient characteristic of relative spectrum conversion and a filter bank power spectrum characteristic. This facilitates determining whether two voice messages with higher correlation are included in the current voice signal based on the voice features, and determining the call state of the voice signal based on the correlation of the messages.

According to an embodiment of the present disclosure, the speech feature may further include an assist feature, wherein the assist feature is coupled with the algorithm that cancels echo in the speech signal. This helps to improve the accuracy of the determination result of the call state.

According to an embodiment of the present disclosure, the topology of the call pattern recognition model is a neural network having a first specified number of layers, wherein the input of the call pattern recognition model is a current frame of a speech signal, and at least one of: the second appointed number of frames before the current frame and the third appointed number of frames after the current frame. The input information of frames with the appointed number of contexts is added for enhancement, so that the neural network can make better distinction of the call state by using the related information between the previous frame and the next frame. In addition, because neural networks have very good nonlinear modeling capabilities, features based on auditory properties can be used here to learn more discriminative features using this modeling capability of neural networks.

According to the embodiment of the disclosure, the output of the call pattern recognition model comprises a binary mask, and the binary mask comprises call states of a plurality of frequency points, wherein the plurality of frequency points are a plurality of frequency points included in a spectrogram of a frame of voice signal. This way, the output judgment can be divided into finer granularity, and then whether a frame belongs to a single-ended call state or a double-ended call state is judged by judging the values of all frequency points in the frame.

According to an embodiment of the present disclosure, the method may further include an operation of performing acoustic echo cancellation on the voice signal using a filter after determining that the call state of the voice signal is a single-ended call state or a double-ended call state, wherein a parameter of the filter is updated if it is determined that the voice signal is the single-ended call state. Thus, the parameter divergence of the filter is avoided, and the echo cancellation effect can be improved.

According to the embodiment of the disclosure, updating the parameters of the filter may include, on one hand, if the filter employs a frequency domain-based echo cancellation algorithm, determining whether to update the parameters of the filter based on the call state of each frequency point, and since the frequency points are frequency domain-related parameters, determining whether to update the parameters of the filter based on the call state of each frequency point may be performed directly, thereby improving control fineness. On the other hand, if the filter employs a time-domain-based echo cancellation algorithm, the call state of the current frame speech signal is determined based on the call states of a plurality of frequency points included in the binary mask, and then whether to update the parameters of the filter is determined based on the call states of the current frame speech signal, wherein the plurality of frequency points have the same or different weights.

According to the embodiment of the disclosure, the call pattern recognition model can further comprise a classification layer, and the output of the classification layer comprises the call state of the voice signal, so that the recognition result can be directly given by the call pattern recognition model without determining the call state of a frame of voice signal based on binary masking by other means.

According to an embodiment of the present disclosure, training the call mode recognition model may include inputting training data, where the training data is a voice signal with call state labeling information, and then adjusting parameters of the call mode recognition model so that the output of the call mode recognition model approaches the call state labeling information of the training data, to obtain the parameters of the call mode recognition model.

Another aspect of the present disclosure provides a call state detection apparatus, which may include a signal obtaining module, a feature obtaining module, and a state determining module, where the signal obtaining module is configured to obtain a voice signal, the feature obtaining module is configured to obtain a voice feature of the voice signal, the voice feature is a feature based on an auditory characteristic and is decoupled from an algorithm for eliminating an echo in the voice signal, and the state determining module is configured to input the voice feature into a call mode recognition model and determine that the voice signal is in a single-ended call state or a double-ended call state.

According to an embodiment of the present disclosure, the speech feature includes at least one of: the filter comprises a Mel frequency cepstrum coefficient characteristic, an amplitude modulation spectrum characteristic, a perceptual linear prediction coefficient characteristic of relative spectrum conversion and a filter bank power spectrum characteristic.

According to an embodiment of the present disclosure, the speech features further comprise an assist feature, wherein the assist feature is coupled with the algorithm that cancels echo in the speech signal.

According to an embodiment of the present disclosure, the input of the state determination module comprises a current frame of a speech signal, and at least one of: the second appointed number of frames before the current frame and the third appointed number of frames after the current frame.

According to the embodiment of the disclosure, the output of the call pattern recognition model comprises a binary mask, and the binary mask comprises call states of a plurality of frequency points, wherein the plurality of frequency points are a plurality of frequency points included in a spectrogram of a frame of voice signal.

According to an embodiment of the present disclosure, the apparatus may further include an echo cancellation module and a parameter update module, where the echo cancellation module is configured to perform acoustic echo cancellation on the voice signal by using a filter after determining that a call state of the voice signal is a single-ended call state or a double-ended call state, and the parameter update module is configured to update a parameter of the filter if determining that the voice signal is the single-ended call state.

According to an embodiment of the present disclosure, the parameter updating module includes a first updating unit and a second updating unit, wherein the first updating unit is configured to determine whether to update the parameter of the filter based on the call state of each frequency point if the filter employs a frequency domain-based echo cancellation algorithm, and the second updating unit is configured to determine the call state of the current frame voice signal based on the call states of a plurality of frequency points included in the binary mask if the filter employs a time domain-based echo cancellation algorithm, and then determine whether to update the parameter of the filter based on the call states of the current frame voice signal, where the plurality of frequency points have the same or different weights.

According to an embodiment of the present disclosure, the speech pattern recognition model further comprises a classification layer, an output of the classification layer comprising a speech state of the speech signal.

According to the embodiment of the present disclosure, the apparatus may further include a training module, where the training module is configured to train the call mode recognition model, and may include an input unit and a parameter obtaining unit, where the input unit is configured to input training data, the training data is a voice signal with call state labeling information, and the parameter obtaining unit is configured to adjust a parameter of the call mode recognition model so that an output of the call mode recognition model approaches the call state labeling information of the training data, and obtain the parameter of the call mode recognition model.

Another aspect of the disclosure provides a computer system comprising one or more processors and a storage device, wherein the storage device is configured to store executable instructions that, when executed by the processors, implement the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1A schematically illustrates an application scenario of a call state detection method, apparatus, computer system and medium according to an embodiment of the present disclosure;

fig. 1B schematically shows a system architecture diagram suitable for a call state detection method according to an embodiment of the present disclosure

Fig. 2 schematically illustrates a flow chart of a call state detection method according to an embodiment of the present disclosure;

FIG. 3A schematically illustrates a schematic diagram of feature extraction according to an embodiment of the disclosure;

FIG. 3B schematically illustrates a schematic diagram of binary masking according to an embodiment of the disclosure;

fig. 3C schematically illustrates a flow chart of a call state detection method according to another embodiment of the present disclosure;

fig. 4 schematically illustrates a call state determination process according to an embodiment of the disclosure;

fig. 5 schematically shows a block diagram of a call state detection apparatus according to an embodiment of the present disclosure; and

FIG. 6 schematically shows a block diagram of a computer system according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features.

Fig. 1A schematically illustrates an application scenario of a call state detection method, device, computer system, and medium according to an embodiment of the present disclosure.

As shown in fig. 1A, during a call between a subscriber 10 and a subscriber 20, the call state may be a single-ended call state or a double-ended call state, where the single-ended call state is one party speaking and the other party listening, and the double-ended call state is two parties simultaneously speaking. As shown in fig. 1A, the user 10 says "is going to play during work? "then say again" which do you want to go? "user 20 replies" do you have no good recommendations after hearing the user 10 question "go to play from work"? ", this results in the user's 10 speech" which are you going? "speech with user 20" do you have no good recommendations? At least partially simultaneously speaking such that the talk state is a double-talk state. When acoustic echo cancellation is performed on the voice signals of the subscriber 10 and the subscriber 20 using the filter, if the parameters of the filter are updated in the double-talk state, the parameters of the filter may be diverged, and the cancellation effect of the acoustic echo may be reduced.

It should be noted that, besides the above two-user call scenario, the method can also be applied to a multi-user call scenario, and in addition, acoustic echo cancellation is also involved in a teleconference. Therefore, scenarios requiring acoustic echo cancellation are generally applicable.

The existence of acoustic echo can seriously affect the conversation process and even prevent the conversation process from going normally. Acoustic echo is cancelled during the course of a call. The detection of the call state is used as a key factor in the acoustic echo technology, and the accuracy of the detection can determine the quality of the acoustic echo cancellation in a large process. The acoustic echo cancellation technology can adopt a Least Mean Square (LMS) cluster algorithm based on a wiener filtering principle to process, and when the two-end call state is carried out, the updating process of the filter coefficient of the algorithm is influenced, so that the filter coefficient is diverged. The solution to this situation is to detect whether the current state is a double-end call state, if so, stop updating the filter coefficients, and if not, update the filter parameters, thereby ensuring fast convergence and real-time update of the echo cancellation algorithm.

The conventional call state detection algorithms can be roughly classified into the following categories.

In the first category, the Geigel algorithm based on energy detection has a small calculation amount, but the detection result is not accurate enough.

And the second type is a call state detection algorithm based on an orthogonal principle. The algorithm does not distinguish the double-end call state from the echo path change, and only uses whether the self-adaptive filter converges or not as the basis for updating the filter coefficient, so that the timing for updating the filter coefficient is not accurately grasped.

And the third category is a cross-correlation based call state detection algorithm. The method has higher judgment accuracy rate when the filter converges, but the detection performance can not be ensured for the changeable acoustic environment.

And the fourth category is a call state detection algorithm based on the frequency domain cross-correlation of Gaussian Mixture Models (GMMs). Compared with a time domain doubletalk detection algorithm based on cross-correlation, the algorithm can obviously reduce the error probability. However, this algorithm needs to obtain an estimate of the feature quantity by means of, for example, an AEC algorithm, and cannot obtain the input parameters online.

Call state detection is a two-class problem from the point of view of pattern recognition, but the impact of environmental variability and the impact of non-stationary signal variability need to be considered. However, in essence, the call state detection needs to determine whether the current signal is in a single-ended call state or a double-ended call state, so as to instruct the echo cancellation algorithm whether to select to update the parameters of the filter.

The embodiment of the disclosure provides a call state detection method. The method comprises a feature extraction process and a pattern recognition process. In a feature extraction process, speech features of the speech signal are obtained, the speech features including features based on auditory properties and being decoupled from an algorithm that cancels echo in the speech signal. After the characteristic extraction process is finished, entering a mode recognition process, inputting the voice characteristic into a call mode recognition model, and determining that the call state of the voice signal is a single-end call state or a double-end call state. By adopting the features based on the psychoacoustic characteristics as input, corresponding feature estimation can be avoided by using other auxiliary algorithms on the premise of ensuring higher accuracy, so that the call state can be accurately detected on line.

Fig. 1B schematically illustrates a system architecture diagram suitable for a call state detection method according to an embodiment of the present disclosure. It should be noted that fig. 1B is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1B, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 for voice interaction over the network 104. In addition, the

terminal devices

101, 102, 103 may have various messaging client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices with a telephone dialing function including, but not limited to, smart phones, tablets, laptop and desktop computers, smart watches, smart glasses, smart speakers, and the like.

The server 105 may be a server that provides various services, such as establishing a link between the terminal device 101 used by the user and the terminal device 102 that is dialed. Also for example, a background management server (by way of example only) that provides support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the call state detection method provided by the embodiment of the present disclosure may be generally executed by a terminal device. Accordingly, the call state detection apparatus provided by the embodiments of the present disclosure may be generally disposed in a terminal device.

It should be understood that the number of terminal devices, networks, and servers are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a call state detection method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S201 to S205.

In operation S201, a voice signal is obtained.

In this embodiment, a voice signal sent by the user may be acquired through the microphone, where the voice signal acquired by the microphone may include an echo signal and a noise signal, the echo information may be a voice signal of an echo sent by a loudspeaker of the near-end electronic device and received by a microphone of the near-end electronic device, and the voice signal of the echo is sent to the near-end electronic device by the far-end electronic device. Furthermore, the echo signal may also be an electrical signal of the speech signal that bounces due to a circuit mismatch of the near-end electronics.

In operation S203, a speech feature of the speech signal is acquired, the speech feature including features based on auditory properties and being decoupled from an algorithm for canceling echo in the speech signal

In this embodiment, the voice feature may include at least one of: Mel-Frequency Cepstral Coefficients (MFCC) characteristics, Amplitude Modulation Spectrum (AMS) characteristics, perceptual linear prediction Coefficients of relative spectral transforms (RASTA-PLP) characteristics, and filter bank power spectrum (GF) characteristics.

The MFCC features are cepstrum parameter features extracted in the Mel scale frequency domain, do not depend on the properties of signals, do not make any assumption and limitation on input signals, utilize the research results of an auditory model, have better robustness, better accord with the auditory features of human ears, and still have better recognition performance when the signal-to-noise ratio is reduced. The MFCC characteristic, the AMS characteristic, the RASTA-PLP characteristic and the GF characteristic are all characteristics based on psycho-auditory characteristics, can better simulate the perception capability of human ears on sound, and can be used for representing whether a voice signal comprises two signals with different sounding characteristics. In both single-talk and double-talk states, for example, the correlation (between 0 and 1) of the energy distribution in the spectrogram is different between psycho-acoustic features (and first order delta features) and correlation-based features, and the MFCC features may be, for example, 40-dimensional (13-order MFCC features and first and second order delta features, plus one-dimensional energy features). The feature based on the psycho-acoustic characteristics is based on the psycho-acoustic characteristics of human ears, a signal with high energy has a certain masking effect on a signal with low energy, and the masking effect can be used for obtaining masking information of frequency points in different speaking states, and the masking information can be used for determining the conversation state of the current voice signal.

In the call state detection process, a correlation technique is to perform determination by analyzing a discrimination characteristic effective for state determination, selecting a corresponding feature amount as an input, and performing the determination by a method such as energy or correlation analysis. In this method, the feature extraction method usually adopted is to obtain the values of the corresponding parameters in the time-frequency transformation by means of an echo cancellation algorithm through the adaptive process of the echo cancellation algorithm. Since the accuracy of the call state detection algorithm will affect the performance of the echo cancellation algorithm to a certain extent, the echo cancellation algorithm and the call state detection algorithm have the effect of mutual influence and mutual restriction, that is, the features extracted in the related art are mutually coupled with the recovery cancellation algorithm. Any of the above four acoustic features employed by embodiments of the present disclosure may eliminate such interaction and mutual constraints due to other auxiliary algorithms, i.e., the speech features of the present disclosure are decoupled from the recovery cancellation algorithm.

In another embodiment, the speech features may also include auxiliary features. Wherein the assist feature is coupled to the algorithm that cancels echo in the speech signal. That is, the assist feature is a correlation function estimated by an assist function echo cancellation algorithm. In addition to the four features, auxiliary features can be added, which depend on cross-correlation function features of echo cancellation algorithms, and can enhance the modeling capability of neural networks.

Fig. 3A schematically illustrates a schematic diagram of feature extraction according to an embodiment of the present disclosure.

As shown in fig. 3A, feature extraction is performed on a received speech signal, and the MFCC feature, the AMS feature, the RASTA-PLP feature, and the GF feature of the speech signal can be obtained. Taking the MFCC feature as an example, the MFCC feature extraction process may include: pre-emphasis, framing, windowing, time-frequency transformation, signal filtering by a Mel filter component, logarithm (log) of energy value, discrete cosine transformation and difference.

Moreover, the voice features may also include voiceprint features and the like. For example, each person has a unique voiceprint feature, and thus, identity authentication based on the voiceprint feature can be achieved. Although there may be some difference between the voiceprint characteristics of the echo and the voiceprint characteristics of the speech signal directly spoken by the user, such a difference is generally tolerable. For example, if only two voiceprint features with similarity exceeding the first preset threshold and strong stability exist in the voice signal, it may be considered that the current state is a single-ended call state, and one of the voice signals is an echo signal. For another example, if there are two voiceprint features with a similarity smaller than a second preset threshold in the voice signal, the two voice signals may belong to different speakers respectively, and may be currently in a double-talk state, where the second preset threshold is smaller than the first preset threshold. For example, the first threshold may be 90%, 95%, 98%, 99%, etc.

Additionally, the speech features may further include acoustic features including at least one of: the signal-to-noise ratio of a speech segment (such as a specified number of frames of speech signals, one or more frames of speech signals between two pauses), the average volume of the speech segment, and the relationship angle between the speech segment and the main microphone, wherein the relationship angle refers to the angle between the horizontal line and the connecting line between the sound source to which the speech segment belongs and the main microphone. The average volume and the signal-to-noise ratio of the voice of the user are obviously different from those of the echo, and the included angles between the connection lines of the sound sources of different users and the main microphone and the horizontal line are also usually different, as shown in fig. 1A, the difference between the relationship included angles of the two users is obvious, and the relationship included angles can be used for determining whether two sound sources exist currently, and if so, the determination of the current double-end conversation state is facilitated. Through the selection of the acoustic features, the accuracy of the recognition result of the call state can be effectively improved on the premise of ensuring on-line detection and the calculation amount as small as possible.

In other embodiments, after the voice features of the voice signal are obtained, semantic features including semantic integrity may also be obtained based on the voice features. For example, the speech feature is extracted from the current speech segment (e.g. multiple speech frames between pauses or multiple speech frames within a specified duration, which is a calibrated value of the duration for the user to express a complete semantic meaning), and it is determined whether the current speech segment has a complete semantic meaning according to the semantic understanding result corresponding to the current speech segment. The semantic features can be obtained based on semantic understanding of grammar rules, semantic understanding based on ontology knowledge bases, semantic understanding based on models and the like, and are not limited herein, when the two-terminal conversation state is achieved, at least two users carry out voice expression, and different semantics can be mixed in voice signals, so that complete semantics cannot be analyzed. It should be noted that the semantic feature is only a feature that assists in determining the current call state, and needs to be matched with the voice feature to determine the current call state, for example, the weight of the semantic feature may be set to be smaller than the weight of the voice feature.

In operation S205, the voice feature is input into a call mode recognition model, and a call state of the voice signal is determined to be a single-ended call state or a double-ended call state.

In this embodiment, the call pattern recognition model includes, but is not limited to: at least one of regression analysis, decision trees, artificial neural networks, bayesian networks, support vector machines, and the like. The artificial neural network may be a Deep Neural Network (DNN).

The following description will be given taking a neural network as an example. The topological structure of the call pattern recognition model is a neural network with a first specified number of layers.

Optionally, the input of the call mode recognition model is a speech feature of a current frame of the speech signal, and at least one of: the speech characteristics of a second specified number of frames before the current frame and a third specified number of frames after the current frame. For example, the first specified number includes: 3. 4, 5, 7, 9, 15, 20, etc., the second specified number may be the same as or different from the third specified number, the second specified number comprising: 0. 1, 2, 3, 5, 7, 10, 15, 20, etc., the second designated number may be set empirically or may be designated according to a certain rule, such as the number of frames required for representing the correlation between the previous and next frames, the number of frames required for expressing a complete semantic, etc.

The output of the call pattern recognition model comprises binary masking, wherein the binary masking comprises call states of a plurality of frequency points, and the frequency points are a plurality of frequency points included in a spectrogram of a frame of voice signal.

In a specific embodiment, a 5-layer neural network structure can be adopted, and input information of 5 frames before and after the input information is added for enhancement, so that the neural network can more accurately determine the call state by using relevant information between the frames before and after the input information. Because neural networks have a very good nonlinear modeling capability, features based on auditory properties can be used here to learn more discriminative features using this modeling capability of neural networks.

Fig. 3B schematically illustrates a schematic diagram of binary masking according to an embodiment of the present disclosure.

As shown in fig. 3B, the output of the selected speech pattern recognition model of the present disclosure is the selected ideal binary mask. A frame signal is subjected to short-time fourier transform to a frequency domain, as shown in fig. 3B, and it is determined whether the frequency point belongs to a single-ended call state (S ═ 0) or a double-ended call state (D ═ 1) in each frequency bin (frequency bin), in such a manner that the output granularity is divided into finer granularities, and then it is determined whether the frame belongs to the single-ended call state or the double-ended call state by determining the values of all the frequency points in the frame. In a spectrogram obtained by performing Fast Fourier Transform (FFT) on original data of a voice signal, a frequency interval or resolution of a frequency axis generally depends on a sampling rate and a sampling point, and frequency bin is a result of dividing the sampling rate by the number of sampling points. As shown in fig. 3B, the frequency point in the double-talk state is more than the frequency point in the single-talk state, so the current speech frame is in the double-talk state.

In another embodiment, the speech pattern recognition model further comprises a classification layer, the output of which comprises the speech state of the speech signal. Referring to fig. 3B, the classification layer may be a fully connected layer, and determine the speech state of the speech signal by directly performing weighted summation, normalization, and the like on the binary masked output based on the weight of each frequency point.

Training the call mode recognition model may include inputting a speech feature of training data, the training data being a speech signal having call state labeling information, and then adjusting a parameter of the call mode recognition model such that an output of the call mode recognition model approaches the call state labeling information of the training data, to obtain the parameter of the call mode recognition model.

It should be noted that the input of the call pattern recognition model may further include a speech feature of a reference signal, where the reference signal is obtained by extracting sound output by a speaker, and may include a music signal and the like.

Of course, embodiments of the present disclosure may also perform acoustic echo cancellation on the voice signal using a filter, during which, after determining the call state of the voice signal, it may also be determined whether to update the parameter of the filter based on the call state.

Fig. 3C schematically shows a flow chart of a call state detection method according to another embodiment of the present disclosure.

As shown in fig. 3C, the method may further include operation S301.

In operation S301, after determining that the call state of the voice signal is a single-ended call state or a double-ended call state, acoustic echo cancellation is performed on the voice signal using a filter. And if the voice signal is determined to be in a single-ended call state, updating the parameters of the filter.

In this embodiment, the method for performing acoustic echo cancellation on the speech signal by using the filter may be the same as the prior art, and will not be described in detail herein. In order to ensure that the filter coefficients do not diverge after being updated, so as to ensure fast convergence and real-time update of the filter, it is necessary to ensure that the call state of the voice signal is in a single-ended call state. And if the call state of the voice signal is in a double-end call state, stopping updating the parameters of the filter.

In a particular embodiment, updating the parameters of the filter may include the following operations.

On the one hand, if the filter adopts the echo cancellation algorithm based on the frequency domain, whether to update the parameters of the filter is determined based on the call state of each frequency point. Since the frequency points are information based on the frequency domain, it may be determined whether the filter of the frequency domain-based echo cancellation algorithm updates the parameters of the filter based on the call state of each frequency point.

On the other hand, if the filter employs a time-domain-based echo cancellation algorithm, the call state of the current frame speech signal is determined based on the call states of a plurality of frequency points included in the binary mask, and then whether to update the parameters of the filter is determined based on the call states of the current frame speech signal, wherein the plurality of frequency points have the same or different weights. When the filter adopts the echo cancellation algorithm based on the time domain, the frequency domain signal needs to be converted into the time domain signal, so as to perform echo cancellation on the time domain signal by adopting the echo cancellation algorithm based on the time domain. If a call mode identification model of a DNN topology is adopted, the call state of the current frame may be determined based on counting that there are many frequency points corresponding to a single-ended call state or many frequency points corresponding to a double-ended call state among a plurality of frequency points included in one frame, and if the frequency points corresponding to the single-ended call state are more than the frequency points corresponding to the double-ended call state, the call state of the current frame is the single-ended call state, and parameters of a filter may be updated. In addition, different weights may be given to different frequency bands, for example, the lower frequency band includes more information that can be used to determine the call state, and therefore, a higher weight may be set to the lower frequency band, and one frequency band may include a plurality of frequency points, and the weights of the frequency points in the same frequency band are the same. Therefore, the call state of the frame of voice signals related to the weight of the frequency band can be obtained, and the accuracy of determining the call state of the voice signals is improved.

Fig. 4 schematically shows a call state determination process according to an embodiment of the disclosure.

As shown in fig. 4, the call state detection method provided by the present disclosure includes a model training phase and a detection phase. In the model training stage, a neural network with high distinctiveness on the call state is trained aiming at the selected voice characteristics, for example, a voice signal with call state marking information is obtained from a training database, the voice signals are subjected to characteristic extraction to obtain the voice characteristics, then the voice characteristics of the voice signals and the voice characteristics of the reference signals corresponding to the voice signals are input into a call mode recognition model, so that the output of the call mode recognition model approaches to the call state marking information of the voice signals, and the call mode recognition model is subjected to model training to obtain model parameters. In the detection stage, a trained neural network is used for judging the call state of a frame of input voice signals. Specifically, feature extraction is performed on a received frame of voice signal to obtain voice features of the frame of voice signal, the voice features of the frame of voice signal and the voice features of a reference signal corresponding to the frame of voice signal are spliced and then input into a trained call mode recognition model, and a call state of the frame of voice signal or a call state of each frequency point in the frame of voice signal is obtained. This facilitates determining whether to currently update the parameters of the filter for canceling echo based on the call state. The reference signal is used for giving more information to judge which call state the reference signal is currently in. For example, when the sound box plays music, a user sends a voice signal to the sound box, at this time, the voice signal received by the microphone includes a music signal after convolution of the room and a voice signal of the user, and at this time, the reference signal may include a music signal currently played by the sound box, and the music signal may be obtained by picking up a signal played by the sound box (e.g., a signal to be played currently received by the sound box) while the sound box plays music. For example, the music played by the speaker includes the voice signal of the singer in addition to the sound of the musical instrument, and when the voice signal of the singer and the voice signal of the user are mixed together, it is not favorable to determine the current call state, and may cause misjudgment of the call state. At this time, the speech features based on the reference signal acquired by the back-sampling and the speech features of the speech signal acquired by the microphone can be input into the call mode recognition model to determine the current call state, so that the accuracy of determining the call state can be improved by improving the robustness of the speech signal.

Fig. 5 schematically shows a block diagram of a call state detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the call state detection apparatus 500 includes a signal obtaining module 510, a feature obtaining module 530, and a state determining module 550.

Wherein the signal obtaining module 510 is configured to obtain a speech signal.

The feature obtaining module 530 is configured to obtain a speech feature of the speech signal, where the speech feature is based on an auditory characteristic and is decoupled from an algorithm for eliminating echo in the speech signal.

The state determination module 550 is configured to input the voice feature into a call mode recognition model, and determine that the voice signal is in a single-ended call state or a double-ended call state.

In one embodiment, the speech features include at least one of: the filter comprises a Mel frequency cepstrum coefficient characteristic, an amplitude modulation spectrum characteristic, a perceptual linear prediction coefficient characteristic of relative spectrum conversion and a filter bank power spectrum characteristic.

In another embodiment, the speech feature may further comprise an assist feature, wherein the assist feature is coupled to the algorithm that cancels echo in the speech signal.

In addition, the voice features may further include an upgradable feature, a voiceprint feature, and the like, which can help determine the number of the current sound sources and the number of speakers. Reference may be made in particular to the description of relevant parts of the method.

Wherein the input of the state determination module comprises a current frame of the speech signal and at least one of: the second appointed number of frames before the current frame and the third appointed number of frames after the current frame.

The output of the speech pattern recognition model may include a binary mask including speech states of a plurality of frequency points included in a spectrogram of a frame of speech signal.

The apparatus 500 may further include an echo cancellation module 570 and a parameter update module 590.

The echo cancellation module 570 is configured to perform acoustic echo cancellation on the voice signal by using a filter after determining that the call state of the voice signal is a single-ended call state or a double-ended call state.

The parameter updating module 590 is configured to update the parameter of the filter if it is determined that the voice signal is in the single-ended call state.

In a specific embodiment, the parameter update module 590 includes a first update unit and a second update unit.

The first updating unit is used for determining whether to update the parameters of the filter based on the call state of each frequency point if the filter adopts the echo cancellation algorithm based on the frequency domain.

The second updating unit is configured to determine a call state of a current frame voice signal based on the call states of a plurality of frequency points included in the binary mask if the filter employs a time domain-based echo cancellation algorithm, and then determine whether to update a parameter of the filter based on the call state of the current frame voice signal, where the plurality of frequency points have the same or different weights.

In addition, the call pattern recognition model may further include a classification layer, an output of which includes a call state of the voice signal.

Further, the apparatus 500 may further include a training module.

The training module is used for training the call mode recognition model and can comprise an input unit and a parameter acquisition unit.

The input unit is used for inputting training data, and the training data is a voice signal with call state marking information.

The parameter obtaining unit is used for adjusting the parameters of the call mode recognition model so that the output of the call mode recognition model approaches to the call state marking information of the training data, and obtaining the parameters of the call mode recognition model.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by any other reasonable means of hardware or firmware by integrating or packaging the circuits, or in any one of three implementations of software, hardware and firmware, or in any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

For example, any of the signal obtaining module 510, the feature obtaining module 530, and the state determining module 550 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the signal obtaining module 510, the feature obtaining module 530 and the state determining module 550 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware and firmware. Alternatively, at least one of the signal obtaining module 510, the feature obtaining module 530 and the state determining module 550 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.

FIG. 6 schematically shows a block diagram of a computer system according to an embodiment of the disclosure. The computer system illustrated in FIG. 6 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 6, a computer system 600 according to an embodiment of the present disclosure includes a processor 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. Processor 601 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the system 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or RAM 603. It is to be noted that the programs may also be stored in one or more memories other than the ROM 602 and RAM 603. The processor 601 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, system 600 may also include an input/output (I/O) interface 605, input/output (I/O) interface 605 also connected to bus 604. The system 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 602 and/or RAM 603 described above and/or one or more memories other than the ROM 602 and RAM 603.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A call state detection method comprises the following steps:

obtaining a voice signal;

obtaining speech features of the speech signal, the speech features including features based on auditory properties and being decoupled from an algorithm that cancels echo in the speech signal; and

and inputting the voice characteristics into a call mode recognition model, and determining that the call state of the voice signal is a single-end call state or a double-end call state.

2. The method of claim 1, wherein the speech features comprise at least one of: the filter comprises a Mel frequency cepstrum coefficient characteristic, an amplitude modulation spectrum characteristic, a perceptual linear prediction coefficient characteristic of relative spectrum conversion and a filter bank power spectrum characteristic.

3. The method of claim 1 or 2,

the speech features further comprise ancillary features;

wherein the assist feature is coupled to the algorithm that cancels echo in the speech signal.

4. The method of claim 1, wherein the topology of the call pattern recognition model is a neural network having a first specified number of layers;

wherein, the input of the call mode recognition model is a current frame of a voice signal, and at least one of the following: the second appointed number of frames before the current frame and the third appointed number of frames after the current frame.

5. The method of claim 4, wherein,

6. The method of claim 5, further comprising:

after determining that the call state of the voice signal is a single-ended call state or a double-ended call state, performing acoustic echo cancellation on the voice signal by using a filter;

and if the voice signal is determined to be in a single-ended call state, updating the parameters of the filter.

7. The method of claim 6, wherein the updating parameters of the filter comprises:

if the filter adopts an echo cancellation algorithm based on a frequency domain, determining whether to update the parameters of the filter based on the call state of each frequency point; and

and if the filter adopts an echo cancellation algorithm based on a time domain, determining the call state of the current frame voice signal based on the call states of a plurality of frequency points included in the binary masking, and then determining whether to update the parameters of the filter based on the call state of the current frame voice signal, wherein the frequency points have the same or different weights.

8. The method of claim 4, wherein,

the speech pattern recognition model further comprises a classification layer, the output of which comprises the speech state of the speech signal.

9. A call state detection apparatus comprising:

a signal obtaining module for obtaining a voice signal;

the characteristic acquisition module is used for acquiring the voice characteristics of the voice signal, wherein the voice characteristics are characteristics based on auditory characteristics and are decoupled from an algorithm for eliminating echo in the voice signal; and

and the state determining module is used for inputting the voice characteristics into a call mode recognition model and determining that the voice signal is in a single-end call state or a double-end call state.

10. The apparatus of claim 9, wherein the input to the state determination module comprises a current frame of a speech signal and at least one of: the second appointed number of frames before the current frame and the third appointed number of frames after the current frame.

11. The apparatus of claim 9, further comprising:

the echo cancellation module is used for performing acoustic echo cancellation on the voice signal by using a filter after the call state of the voice signal is determined to be a single-ended call state or a double-ended call state; and

and the parameter updating module is used for updating the parameters of the filter if the voice signal is determined to be in the single-ended call state.

12. The apparatus of claim 11, wherein the parameter update module comprises:

a first updating unit, configured to determine whether to update a parameter of the filter based on a call state of each frequency point if the filter employs a frequency domain-based echo cancellation algorithm; and

and a second updating unit, configured to determine a call state of the current frame voice signal based on the call states of the multiple frequency points included in the binary mask if the filter employs a time domain-based echo cancellation algorithm, and then determine whether to update parameters of the filter based on the call state of the current frame voice signal, where the multiple frequency points have the same or different weights.

13. A computer system, comprising:

one or more processors;

a storage device for storing executable instructions which, when executed by the processor, implement the method of any one of claims 1 to 8.

14. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement a method according to any one of claims 1 to 8.