CN116074440A

CN116074440A - Call state detection method and device and computer readable storage medium

Info

Publication number: CN116074440A
Application number: CN202111291918.2A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2023-05-05

Abstract

The embodiment of the invention discloses a call state detection method, a call state detection device and a computer readable storage medium; after receiving current call information sent by a first terminal, analyzing a voice frame signal in the current call information according to attribute information in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then playing the to-be-played voice frame signal, collecting the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then checking the current voice frame number based on the attribute information, calculating a cross correlation value of the recording frame signal and the to-be-played voice frame signal, and then determining a current call state with the first terminal according to a checking result and the cross correlation value of the current voice frame number; the scheme can improve the accuracy of call state detection.

Description

Call state detection method and device and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and apparatus for detecting a call state, and a computer readable storage medium.

Background

In recent years, with the rapid development of internet technology, instant messaging applications have become more and more popular. When an audio/video call is performed by an instant messaging application, the sounding party cannot confirm whether the opposite party can hear the sound of the sounding party or not because of different places of the calling party, so that the current call state needs to be detected. The existing call state detection method mainly calculates rms (root mean square) of the collected signals to detect the volume of the signals, and then determines the current call state.

In the research and practice process of the prior art, the inventor of the invention finds that the conversation state determined by calculating rms of the collected signals only represents whether the local sound collection function is normal or not, and cannot represent whether a listener really hears, but in the instant messaging, the sound needing to be collected is processed, transmitted, played and other excessive links, and the failure of any link can lead to the fact that the listener cannot hear the real sound, so that the conversation state detection accuracy is insufficient.

Disclosure of Invention

The embodiment of the invention provides a call state detection method, a call state detection device and a computer readable storage medium, which can improve the accuracy of call state detection.

A call state detection method includes:

receiving current call information sent by a first terminal, wherein the current call information comprises a voice frame signal and attribute information of the voice frame signal;

analyzing the voice frame signal according to the attribute information to obtain a voice frame signal to be played and the current voice frame number of the voice frame signal;

playing the to-be-played voice frame signal, and collecting the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal;

based on the attribute information, verifying the current voice frame number, and calculating a cross-correlation value of the recording frame signal and the voice frame signal to be played, wherein the cross-correlation value is used for indicating the similarity degree of the recording frame signal and the voice frame signal to be played;

and determining the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number.

Correspondingly, an embodiment of the present invention provides a call state detection device, including:

the receiving unit is used for receiving current call information sent by the first terminal, wherein the current call information comprises a voice frame signal and attribute information of the voice frame signal;

The analyzing unit is used for analyzing the voice frame signal according to the attribute information to obtain a voice frame signal to be played and the current voice frame number of the voice frame signal;

the acquisition unit is used for playing the voice frame signals to be played so as to acquire recording frame signals corresponding to the voice frame signals to be played;

the verification unit is used for verifying the current voice frame number based on the attribute information, and calculating a cross-correlation value of the recording frame signal and the voice frame signal to be played, wherein the cross-correlation value is used for indicating the similarity degree of the recording frame signal and the voice frame signal to be played;

and the determining unit is used for determining the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number.

Optionally, in some embodiments, the verification unit may be specifically configured to extract a base speech frame number of the speech frame signal from the attribute information; when the basic voice frame number and the current voice frame number exceed a preset frame number threshold, comparing the basic voice frame number with the current voice frame number; and determining a verification result of the current voice frame number based on the comparison result.

Optionally, in some embodiments, the detecting unit may be specifically configured to calculate a ratio of the number of received voice frames to the number of basic voice frames to obtain a first frame ratio; calculating the ratio of the number of received voice frames to the number of played voice frames to obtain a second frame ratio; the determining the verification result of the current voice frame number based on the comparison result comprises the following steps: and when the first frame number ratio and the second frame number ratio do not exceed a preset frame number ratio threshold, determining that the current voice frame number passes the verification.

Optionally, in some embodiments, the inspection unit may be specifically configured to obtain a preset time delay parameter set between the to-be-played voice frame signal and the recording frame signal; screening the to-be-played voice frame signals corresponding to each preset time delay parameter in the preset time delay parameter set from the to-be-played voice frame signals to obtain candidate to-be-played voice frame signal sets corresponding to each recording frame signal; calculating candidate cross-correlation values between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set respectively; the determining the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number comprises the following steps: and determining the current call state of the first terminal according to the checking result of the current voice frame number and the candidate cross-correlation value.

Optionally, in some embodiments, the detecting unit may be specifically configured to calculate a power spectrum value of the intermediate frequency of each recording frame signal, to obtain a first power spectrum value; respectively calculating the power spectrum value of the frequency point in each voice frame signal to be played in the candidate voice frame signal set to obtain a second power spectrum value; and fusing the first power spectrum value and the second power spectrum value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set.

Optionally, in some embodiments, the test unit may be specifically configured to calculate a mean value of the first power spectrum value, obtain a first power spectrum mean value, and calculate a mean value of the second power spectrum value, obtain a second power spectrum mean value; calculating the difference between the first power spectrum value and the first power spectrum mean value to obtain a first power spectrum difference value, and calculating the difference between the second power spectrum value and the second power spectrum mean value to obtain a second power spectrum difference value; and fusing the first power spectrum difference value and the second power spectrum difference value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set.

Optionally, in some embodiments, the determining unit may be specifically configured to determine a playing state of the to-be-played voice frame signal based on the candidate cross-correlation value; and when the playing state is normal playing and the verification of the current voice frame number is passed, determining that the current conversation state with the first terminal is normal conversation.

Optionally, in some embodiments, the determining unit may be specifically configured to compare the candidate cross-correlation value with a preset cross-correlation threshold; counting the number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset time delay parameter in the comparison result to obtain a basic statistics number set; and determining the playing state of the voice frame signal to be played based on the basic statistical quantity set.

Optionally, in some embodiments, the determining unit may be specifically configured to calculate a mean value of the base statistics number in the base statistics number set, to obtain a base statistics number mean value; screening out the basic statistical quantity with the largest value from the basic statistical quantity set to obtain a target statistical quantity; when the target statistical quantity is the same as a preset quantity threshold value, calculating a quantity ratio between the target statistical quantity and the basic statistical quantity average value; and when the number ratio exceeds a preset number ratio threshold, determining that the playing state of the voice frame signal to be played is normal playing.

Optionally, in some embodiments, the determining unit may be specifically configured to generate, according to the current call state, a prompt message of a normal call; and sending the prompt information to the first terminal, so that the first terminal prompts the current call state according to the prompt information.

Optionally, in some embodiments, the parsing unit may be specifically configured to decode the speech frame signal to obtain a decoded speech frame signal; post-processing the decoded voice frame signal to obtain a voice frame signal to be played; and based on the attribute information, carrying out voice frame number statistics on the decoded voice frame signal and the voice frame signal to be played to obtain the current voice frame number of the voice frame signal.

Optionally, in some embodiments, the parsing unit may be specifically configured to extract a voice frame identifier of the voice frame signal from the attribute information; counting the voice frame identifiers in the decoded voice frame signals to obtain the received voice frame numbers of the decoded voice frame signals; performing voice detection on the to-be-played voice frame signal to obtain the played voice frame number of the to-be-played voice frame signal; and taking the received voice frame number and the played voice frame number as the current voice frame number of the voice frame signal.

Optionally, in some embodiments, the call state detection device may further include a sending unit, where the sending unit may specifically be configured to collect a currently generated sound signal when a sound signal is detected, and perform frame processing on the sound signal to obtain a plurality of sound frame signals; encoding the sound frame signal to generate target call information; and sending the target call information to a second terminal, so that the second terminal detects the call state based on the target call information.

Optionally, in some embodiments, the sending unit may specifically be configured to perform voice detection on the voice frame signal, and generate a target voice frame identifier based on a voice detection result; preprocessing the sound frame signal, and performing voice coding processing on the sound frame signal after preprocessing to obtain a target voice frame signal; counting the target voice frame identifiers in the target voice frame signals to obtain target voice frame numbers, and taking the target voice frame identifiers and the target voice frame numbers as target attribute information of the target voice frame signals; and fusing the target voice frame signal and the target attribute information of the target voice frame signal to obtain target call information.

In addition, the embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores an application program, and the processor is used for running the application program in the memory to realize the call state detection method provided by the embodiment of the invention.

In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any call state detection method provided by the embodiment of the invention.

After receiving current call information sent by a first terminal, analyzing a voice frame signal in the current call information according to attribute information in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then playing the to-be-played voice frame signal, collecting the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then checking the current voice frame number based on the attribute information, calculating a cross-correlation value of the recording frame signal and the to-be-played voice frame signal, and then determining a current call state with the first terminal according to a checking result and the cross-correlation value of the current voice frame number; the method and the device can identify the collected voice signals, further verify the number of voice frames of the voice signals in the whole instant messaging process by listening and sending, and determine whether the voice is successfully played or not by means of cross-correlation calculation of the voice frame signals to be played and the recording frame signals by listening and sending, so that voice monitoring of the whole conversation process can be achieved, and the accuracy of conversation state detection can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a call state detection method according to an embodiment of the present invention;

fig. 2 is a flow chart of a call state detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of determining whether a recording function is disabled according to an embodiment of the present invention;

fig. 4 is a schematic page diagram of a call page displayed at a first terminal according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a detection process of call state detection according to an embodiment of the present invention;

fig. 6 is another flow chart of a call state detection method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a call state detection device according to an embodiment of the present invention;

fig. 8 is another schematic structural diagram of a call state detection device according to an embodiment of the present invention;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a call state detection method, a call state detection device and a computer readable storage medium. The call state detection device may be integrated in an electronic device, and the electronic device may be a server or a device such as a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

For example, referring to fig. 1, taking a case that a call detection device is integrated in an electronic device as an example, after the electronic device receives current call information sent by a first terminal, according to attribute information in the current call information, analyzing a voice frame signal in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then playing the to-be-played voice frame signal, collecting the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then checking the current voice frame number based on the attribute information, calculating a cross-correlation value of the recording frame signal and the to-be-played voice frame signal, and then determining a current call state with the first terminal according to a checking result and the cross-correlation value of the current voice frame number, thereby improving accuracy of call state detection.

The call state is used for indicating whether the parties in the call can receive and play the voice information sent by other communication terminals when the parties in the call carry out audio/video call through the respective communication terminals in the instant communication application, and the state information can comprise normal call and abnormal call, so that the normal call can be the sender and can normally send call information, the listening and playing can normally receive the call information and play the call information, so that the current call can be continued, and the corresponding abnormal call can be the sender and can not normally send the call information, or the listening and playing can not normally receive the call information or can not normally play the call information. In instant messaging applications, the call state has a great impact on the call effect.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The present embodiment will be described in terms of a call state detection apparatus, which may be specifically integrated in an electronic device, where the electronic device may be a server or a device such as a terminal; the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other intelligent devices capable of detecting a call state.

A call state detection method includes:

receiving current call information sent by a first terminal, wherein the current call information comprises voice frame signals and attribute information of the voice frame signals, analyzing the voice frame signals according to the attribute information to obtain current voice frame numbers of the voice frame signals to be played and the voice frame signals, playing the voice frame signals to be played, collecting the played voice frame signals to be played to obtain recording frame signals corresponding to the voice frame signals to be played, checking the current voice frame numbers based on the attribute information, calculating cross-correlation values of the recording frame signals and the voice frame signals to be played, wherein the cross-correlation values are used for indicating the similarity degree of the recording frame signals and the voice frame signals to be played, and determining the current call state with the first terminal according to the checking result and the cross-correlation values of the current voice frame numbers.

As shown in fig. 2, the specific flow of the call state detection method is as follows:

101. and receiving the current call information sent by the first terminal.

The current call information may include a voice frame signal and attribute information of the voice frame signal, where the voice frame signal may be a voice signal detected by detecting voice activity of a recording frame signal, and the recording frame signal may be a sound made by a user and collected by a microphone of the terminal, and is divided into frame signals with equal length according to a certain window size.

The attribute information of the voice frame signal may be information describing an attribute of the voice frame signal, for example, may include a voice frame identifier of the voice frame signal and a base voice frame number of the voice frame signal, and the voice frame number may be the number of voice frame signals in the recording frame signal.

The manner of receiving the current call information sent by the first terminal may be various, and specifically may be as follows:

for example, an information transmission channel is constructed between the network and the first terminal, current call information sent by the first terminal is received through the information transmission channel, or when the current call information is sensitive information, encryption information of the current call information sent by the first terminal can be received, the encryption information is decrypted based on a preset encryption protocol, so as to obtain the current call information, or when the current call information has a larger memory or a larger number of memories, the current call information can be indirectly obtained from the first terminal.

The method for indirectly obtaining the current call information from the first terminal may be various, for example, when the first terminal sends the current call information to the server or the third party terminal, the current call information corresponding to the first terminal sent by the server or the third party terminal may be received, or a call request sent by the first terminal is received, where the call request carries a storage address of the current call information, and based on the storage address, the current call information is obtained from the first terminal, the third party terminal or a call information database.

102. And analyzing the voice frame signal according to the attribute information to obtain the voice frame information to be played and the current voice frame number of the voice frame signal.

The to-be-played voice frame signal can be understood as a voice frame signal which is obtained after decoding and post-processing the voice frame signal and needs to be played.

The current speech frame number may include a received speech frame number and a play speech frame number, where the received speech frame number may be a frame number of the received speech frame signal, and the play speech frame number may be a frame number of the speech frame signal to be played.

The method for analyzing the voice frame signal may be various, and specifically may be as follows:

For example, the voice frame signal is decoded to obtain a decoded voice frame signal, the decoded voice frame signal is post-processed to obtain a voice frame signal to be played, and based on the attribute information, the decoded voice frame signal and the voice frame signal to be played are subjected to voice frame number statistics to obtain the current voice frame number of the voice frame signal.

The method for decoding the voice frame signal may be various, for example, decoding may be performed by adopting a corresponding decoding strategy according to a coding format of the voice frame signal, or a terminal identifier of the first terminal may be obtained, a target decoding protocol corresponding to the terminal identifier is selected from preset decoding protocols, and the voice frame signal is decoded based on the target decoding protocol, so as to obtain a decoded voice frame signal.

After the voice frame signal is decoded, the decoded voice frame signal can be subjected to post-processing, wherein the post-processing can be understood as a processing mode corresponding to the pre-processing of the recording signal by the first terminal, the post-processing has the main effect of converting the decoded voice frame signal into the voice frame signal suitable for being played by the call state detection device, and the post-processing mode can be various, for example, the post-processing mode is suitable for playing equipment which needs to be played currently, or the decoded voice frame signal is compressed or amplified, or playing parameters and the like are set for the decoded voice frame signal, so that the decoded voice frame signal is converted into the voice frame signal to be played directly.

After post-processing is performed on the decoded voice frame signal, statistics can be performed on the decoded voice frame signal and the voice frame signal to be played based on the attribute information, so that the current voice frame number of the voice frame signal can be obtained, and various statistical modes can be used, for example, the voice frame identification of the voice frame signal is extracted from the attribute information, the voice frame identification is counted in the decoded voice frame signal, the received voice frame number of the decoded voice frame signal is obtained, voice detection is performed on the voice frame signal to be played, so that the play voice frame number of the voice frame signal to be played is obtained, and the received voice frame number and the play voice frame number are used as the current voice frame number of the voice frame signal.

The statistics of the voice frame identifier in the decoded voice frame signal may be performed in various manners, for example, the decoded voice frame signal containing the voice frame identifier may be detected in the decoded voice frame signal to obtain a target decoded voice frame signal, and the number of target decoded voice frame signals is counted, so that the number of received voice frames may be obtained, or the number of decoded voice frame signals containing the voice frame identifier may be identified in the decoded voice frame signal, so that the number of received voice frames may be counted.

The method for detecting the voice of the to-be-played voice frame signal can be various, for example, a voice activity detection algorithm (voice activity detection, vad) can be used for detecting whether the to-be-played voice frame signal is a voice signal, so as to obtain a detection result of each to-be-played voice frame signal, and based on the detection result, the number of the to-be-played voice frame signal is counted, so as to obtain the number of the played voice frame signal, and therefore, the number of the played voice frame signal can be found to be used for indicating the number of the to-be-played voice frame signal with the voice frame signal type being the voice frame signal.

103. And playing the voice frame signals to be played, and collecting the played voice frame signals to be played to obtain recording frame signals corresponding to the voice frame signals to be played.

The recording frame signal can be collected by a microphone of the call state detection device when the voice frame signal to be played is played, and the frame signal is obtained after windowing according to a certain window size.

The manner of collecting the recording frame signal corresponding to the voice frame signal to be played can be various, and specifically can be as follows:

for example, a to-be-played voice frame signal is played through a loudspeaker in the call state detection device, the to-be-played voice frame signal is played, and simultaneously, the played to-be-played voice frame signal is recorded through a microphone to obtain a recording signal, and the recording signal is framed according to the window length of each frame in the to-be-played voice frame signal, so that a recording frame signal with the same window length as the to-be-played voice frame signal is obtained.

It should be noted that, after the voice frame signal to be played is converted into sound through electroacoustic conversion, the sound is collected by the microphone after being conducted by air, so as to become an echo signal of the voice frame signal to be played, and finally the echo signal is mixed into the recording signal, so that the recording frame signal is obtained, and the echo time delay is relatively stable.

104. And verifying the current voice frame number based on the attribute information, and calculating the cross-correlation value of the recording frame signal and the voice frame signal to be played.

The cross-correlation value is used for indicating the similarity degree of the recording frame signal and the voice frame signal to be played, the greater the cross-correlation value is, the more similar the voice frame signal to be played is to the recording frame signal, and conversely, the smaller the cross-correlation value is, the voice frame signal to be played is independent of the recording frame signal.

The method for performing checksum calculation on the current voice frame number to calculate the cross-correlation value between the recording frame signal and the voice frame signal to be played may be various, and specifically may be as follows:

s1, checking the current voice frame number based on attribute information.

The method for checking the current voice frame number can be various, and specifically can be as follows:

for example, the basic speech frame number of the speech frame signal is extracted from the attribute information, when the basic speech frame number and the current speech frame number exceed a preset frame number threshold, the basic speech frame number is compared with the current speech frame number, and a verification result of the current speech frame number is determined based on the comparison result.

The number of basic voice frames may be the number of voice frame signals included in the current call information transmitted by the sender. In addition, since the current call information needs to be normally played, both the basic speech frame number and the current speech frame number need to be greater than 0, and thus the preset frame number threshold may be 0 or other positive integers.

The current speech frame number can include a received speech frame number and a played speech frame number, so that a plurality of modes can be adopted for comparing the basic speech frame number with the current speech frame number, for example, the ratio of the received speech frame number to the basic speech frame number can be calculated to obtain a first frame number ratio, the ratio of the received speech frame number to the played speech frame number is calculated to obtain a second frame number ratio, and when the first frame number ratio and the second frame number ratio do not exceed a preset frame number ratio threshold value, the current speech frame number is determined to pass the verification.

It can be found that the purpose of verifying the current voice frame number is to determine whether packet loss or other abnormal conditions exist when receiving the voice frame signal and playing the voice frame signal, so that the voice frame signal cannot be played normally. In addition, taking the basic speech frame number vcnt_0, the received speech frame number vcnt_1, and the play speech frame number vcnt_2 as examples, the preset frame number threshold value is 0, and the preset frame number ratio threshold value is 1.25, the verification of the current speech frame number may be summarized as requiring to meet the following three verification conditions:

(1) Vcnt_0, vcnt_1, vcnt_2 are all greater than 0;

(2) Checking whether the basic voice frame number and the received voice frame number, namely Vcnt_0/Vcnt_1 is smaller than 1.25, if so, not meeting the condition;

(3) The check of the number of played voice frames and the number of received frames of the listener, i.e., vcnt_1/vcnt_2, is less than 1.25, and if so, the condition is not satisfied.

When all three verification conditions are met, the verification result of the current voice frame number can be determined to be verification passing, the lost voice frame signal of the voice frame signal in the transmission and playing process is in a normal range, and the standard of normal conversation is met. In addition, the preset frame ratio thresholds corresponding to the first frame ratio and the second frame ratio may be the same or different.

S2, calculating a cross-correlation value of the recording frame signal and the voice frame signal to be played.

For example, a set of preset time delay parameters between the to-be-played voice frame signal and the recording frame signal may be obtained, the to-be-played voice frame signal corresponding to each preset time delay parameter is selected from the to-be-played voice frame signals, a set of candidate to-be-played voice frame signals corresponding to each recording frame signal is obtained, and candidate cross-correlation values between each recording frame signal and each to-be-played voice frame signal in the corresponding set of candidate to-be-played voice frame signals are calculated respectively.

The preset time delay parameter set may be a set of time delay parameters between a preset to-be-played voice frame signal and a recording frame signal, and since the to-be-played voice frame signal is converted into sound to be played, the sound is collected into an echo signal by a sound collecting device such as a microphone through air conduction, and finally the echo signal is mixed into the recording signal, so as to obtain the recording frame signal, the echo time delay is relatively fixed, so that the recording signal of the current frame has a relatively high cross-correlation degree P (i, k 0) with the to-be-played voice frame signal at a certain position k0 (corresponding to the echo time delay position) in the playing buffer, and the cross-correlation value P (i, k) of the recording signal and the to-be-played voice frame signal at other positions in the playing buffer is relatively small. However, since the recording signal includes other sounds (for example, near-end human voice and other environmental sounds) besides the echo signal of the voice frame signal to be played, and the cross-correlation value of the voice frame signal to be played and the recording frame signal brings a certain interference, and some statistical means are needed to filter the sound, the delay parameter is introduced, where the delay parameter can be understood as a statistical variable with a size of M, and the statistical variable includes a plurality of possible delay positions, and the so-called delay position is understood as that the echo signal corresponding to the voice frame signal of the i-th frame is the K value in the voice frame signal to be played of the (i-K) -th frame, which means that the preset delay parameter is K, K is e M.

The method of screening the to-be-played voice frame signal corresponding to each preset delay parameter in the to-be-played voice frame signal set may be various, for example, taking the preset delay parameters K in the preset delay parameter set as 2, 3 and 4 as examples, the candidate to-be-played voice frame signal corresponding to the recording frame signal of the 5 th frame may be the first 2, 3 or 4 frames of the 5 th frame of the playing buffer, the candidate to-be-played voice frame signal corresponding to the recording frame signal of the 5 th frame may be the 3 rd frame, the 2 nd frame or the 1 st frame, the three candidate to-be-played voice frame signals are used as the candidate to-be-played voice frame signal set corresponding to the recording frame signal of the 5 th frame, and so on, the candidate to-be-played voice frame signal set corresponding to each frame recording frame signal may be obtained.

After the candidate to-be-played voice frame signal set corresponding to each recording frame signal is screened, the candidate cross-correlation value between each to-be-played voice frame in the candidate to-be-played voice frame signal set corresponding to each recording frame signal can be calculated, and various modes of calculating the candidate cross-correlation value can be adopted, for example, the power spectrum value of the frequency point in each recording frame signal can be calculated, the power spectrum value of the frequency point in each to-be-played voice frame signal in the candidate to-be-played voice frame signal set is calculated respectively, a second power spectrum value is obtained, and the first power spectrum value and the second power spectrum value are fused, so that the candidate cross-correlation value between each recording frame signal and the corresponding candidate to-be-played voice frame signal is obtained.

The power spectrum may be calculated in various manners, for example, a signal of a recording frame of a voice frame signal to be played in a current frame may be subjected to fourier transformation, and then power spectrums X (i, j) and D (i, j) of a jth frequency point in an ith frame are calculated, so as to obtain a first power spectrum value and a second power spectrum value. The first power spectrum value X (i, j) of the speech frame signal to be played is buffered by a ring buffer (maximum buffered M frame play signal power spectrum) for cross correlation calculation.

After the first power spectrum value and the second power spectrum are calculated, the first power spectrum value and the second power spectrum value can be fused, and various fusion modes can be adopted, for example, the average value of the first power spectrum value can be calculated to obtain a first power spectrum average value, and the average value of the second power spectrum value can be calculated to obtain a second power spectrum average value. And calculating the difference between the first power spectrum value and the first power spectrum mean value to obtain a first power spectrum difference value, and calculating the difference between the second power spectrum value and the second power spectrum mean value to obtain a second power spectrum difference value. The first power spectrum difference value and the second power spectrum difference value are fused to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set, which can be specifically shown as formula (1):

Wherein P (i, k) is the cross-correlation value of the recording frame signal of the ith frame and the voice frame signal to be played of the kth frame before the voice frame signal to be played of the ith frame in the playing buffer, k is a preset time delay parameter, k corresponds to the voice frame signal to be played of the k frames at the interval of the current ith frame in the playing buffer,x (i-k, j) is a first power spectrum value of a j-th frequency point in a voice frame signal to be played in an i-k-th frame,

for the i-k frame to be played, playing a first power spectrum mean value from a frequency point N1 to a frequency point N2 on a signal frequency domain of the voice frame to be played>

And D (i, j) is a second power spectrum value of the j-th frequency point in the i-th recording frame signal, wherein the second power spectrum value is a second power spectrum average value from the frequency point N1 to the frequency point N2 on the i-th recording frame signal frequency domain.

105. And determining the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number.

For example, the playing state of the to-be-played voice frame signal may be determined based on the candidate cross-correlation value, and when the playing state is normal playing and the current voice frame number passes the verification, the current call state with the first terminal is determined to be normal call.

The playing state is used for indicating whether the to-be-played voice frame signal normally generates sound through equipment such as a loudspeaker and the like to play, so that normal playing and abnormal playing can be included, the normal playing can be understood as that the to-be-played voice frame signal is subjected to electroacoustic conversion and is normally converted into sound to play, and when the to-be-played voice frame signal is normally played, corresponding recording signals can be acquired by sound acquisition equipment such as a microphone and the like. When the voice frame signal to be played is abnormally played, the voice frame signal to be played is abnormal when electroacoustic conversion or voice playing is carried out, and at the moment, corresponding recording signals cannot be accurately acquired by sound acquisition equipment such as a microphone. Based on the candidate cross-correlation values, various ways of determining the playing state may be provided, for example, the candidate cross-correlation values may be compared with a preset cross-correlation threshold, the number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset delay parameter is counted in the comparison result, a basic statistic number set is obtained, and based on the basic statistic number set, the playing state of the speech frame signal to be played is determined.

The number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset delay parameter may be counted in the comparison result, for example, taking the preset delay parameter k=2 as an example, the number of candidate cross-correlation values exceeding the preset cross-correlation threshold among the candidate cross-correlation values between each recording frame signal and the corresponding candidate to-be-played voice frame signal under the delay parameter is counted, so that the basic statistical number under the delay parameter can be obtained, and by analogy, the basic statistical number corresponding to each delay parameter can be obtained, and the basic statistical numbers are combined to obtain the basic statistical number set.

After the basic number set is counted, the playing state of the voice frame signal to be played can be determined based on the basic number set, and various modes can be used for determining the playing state, for example, the average value of the basic number in the basic number set is calculated to obtain the average value of the basic number, the basic number with the largest value is screened out from the basic number set to obtain the target number, when the target number is the same as the preset number threshold, the number ratio between the target number and the average value of the basic number is calculated, and when the number ratio exceeds the preset number ratio, the playing state of the voice frame signal to be played is determined to be normal playing.

Because the recording signal includes other sounds (e.g., near-end human voice and other environmental sounds) besides the echo signal of the to-be-played voice signal, the other sounds bring certain interference to the cross-correlation value of the to-be-played voice frame signal and the recording frame signal, so that the purpose of determining the playing state of the to-be-played voice frame signal based on the basic statistical quantity set is to filter the interference, thereby more accurately determining the preset time delay parameter between the to-be-played voice frame signal and the recording frame signal, further judging the playing state of the to-be-played voice frame signal, taking the statistical variable C (k) with the preset time delay parameter set being M as an example, the filtering method is mainly to calculate candidate cross-correlation values of continuous recording frame signals, each frame of recording frame signal can obtain M P (i, k) values (candidate cross-correlation values), each frame of recording frame signalsWhen the value of P (i, k 1) is greater than the preset cross-correlation threshold value thrd_p (e.g., 0.5), 1 is added to the statistical variable C (k 1), and the rest C (k) is unchanged, k+.k1. When the maximum value of the M C (k) statistical variables is equal to a preset number threshold value thrd_l, such as C (k 2) =thrd_l, thrd_l=200; and satisfy the following

Greater than a threshold value of the ratio of the preset number, THRD_C, here +. >

And (3) for the average value of C (k), determining that the voice frame signal to be played is normally played, which means that the recording function is normal and effective, otherwise, determining that the voice frame signal to be played is not normally played, which means that the recording function is invalid. Therefore, it can be found that the process of judging the failure of the recording function by calculating the cross-correlation value can be as shown in fig. 3, the power spectrums of the to-be-played voice frame signal and the recording frame signal are calculated respectively to obtain a first power spectrum value and a second power spectrum value, then the first power spectrum value of the to-be-played voice frame signal is subjected to data caching, then the cross-correlation value calculation is performed based on the first power spectrum value and the second power spectrum value, then the cross-correlation value is counted, and the playing state of the to-be-played voice frame signal and the failure condition of the recording function are judged based on the counting result.

Optionally, when determining that the current call state with the first terminal is normal call, the method may further send a prompt message of the current call state to the first terminal, for example, the prompt message of the normal call may be generated according to the current call state, and the prompt message is sent to the first terminal, so that the first terminal prompts the current call state according to the prompt message.

The method for prompting the call state includes various modes, for example, when the first terminal is a hardware device product, a light prompt can be performed on a prompt lamp on the first terminal, or a vibration prompt can be performed on the first terminal through a vibration motor, or call state display can be performed on a call page.

The call state display mode on the call page may be various, for example, the prompt information of the normal call may be directly displayed on the call page, or the target call icon corresponding to the normal call may be selected from the call state icons according to the prompt information to display the call icon, where the call icon is mainly used to prompt the speaker corresponding to the first terminal that the current call state is the normal call, and the call icon may be various, and the call page displayed on the first terminal may be as shown in fig. 4, taking the call icon as a small horn.

Optionally, when the call terminal detecting device detects that the user makes a sound, the current sound signal may be collected and processed, for example, the current sound signal may be collected and the sound signal may be subjected to framing processing to obtain a plurality of sound frame signals, and the sound frame signals may be subjected to encoding processing to generate the target call information, and the target call information may be sent to the second terminal, so that the second terminal detects the call state based on the target call information.

The method for encoding the voice frame signal to generate the target call information may be various, for example, voice detection may be performed on the voice frame signal, and the target voice frame identifier may be generated based on the voice detection result. And carrying out preprocessing on the sound frame signal, and carrying out voice coding processing on the sound frame signal after the preprocessing to obtain a target voice frame signal. And counting the target voice frame identifiers in the target voice frame signals to obtain target voice frame numbers, taking the target voice frame identifiers and the target voice frame numbers as target attribute information of the target voice frame signals, and fusing the target voice frame signals and the target attribute information of the target voice frame signals to obtain target call information.

The voice detection method for the voice frame signal may be various, for example, each voice frame signal may be detected by a vad algorithm, and a voice signal with vad of 1 is selected from the voice frame signals as a target voice frame signal corresponding to voice, so that the target voice frame signal is used as a voice detection result.

After the voice detection, the target voice frame identifier may be generated based on the voice detection result in various manners, for example, the target voice frame signals are respectively numbered, so as to obtain the target voice frame identifier corresponding to each target voice frame signal, or the target voice frame signals are respectively sequenced, and the sequencing result is converted into the identifier of each target voice frame signal, so as to obtain the target voice frame identifier.

After the target voice frame identifier is generated, voice encoding processing can be performed on the pre-processed voice frame signal, and various modes of voice encoding processing can be performed, for example, a preset voice encoding protocol or an encoding strategy is adopted to encode the pre-processed voice frame signal, so as to obtain an initial voice frame signal, and the initial voice frame identifier is identified through the target voice frame identifier, so that the target voice frame signal is obtained.

The method for identifying the initial voice frame signal by the target voice frame identifier may be various, for example, a basic voice frame signal corresponding to the target voice frame identifier is screened out from the initial voice frame signal, and the target voice frame identifier is added to the basic voice frame identifier, so as to obtain the target voice frame signal.

After the pre-processing voice frame signal is subjected to voice coding processing, the target voice frame identification can be counted in the target voice frame signal, and various counting modes can be adopted, for example, the target voice frame signal with the target voice frame identification can be screened out from the target voice frame signal, the number of the screened target voice frame signal is counted, the target voice frame number can be obtained, the target voice frame number can indicate the number of the target voice frame signal contained in the current call information sent to the call state detection device by the first terminal, and whether the second terminal completely receives and plays the target voice frame signal is judged.

After the target call information is sent to the second terminal, the second terminal may further receive target prompt information of a call state corresponding to the target call information, for example, send the target call information to the second terminal, and the second terminal detects the call state according to the target call information, when the second terminal detects that the call state is a normal call, the second terminal may receive the target prompt information corresponding to the normal call returned by the second terminal, and prompt the call state based on the target prompt information, where a mode of prompting the call state may be the same as that of prompting by the first terminal, and will not be described again. The second terminal may be the same as or different from the first terminal.

In this scheme, it may be found that the completion of the detection of the call state needs to be completed together by both parties involved in the instant messaging service, that is, the speaker and the listener need to complete the transmission and playing of the current call information together, the current call state is detected by the transmission and playing results, the whole process of detecting the call state may be as shown in fig. 5, in the speaker, the recording signal is obtained through the microphone of the first terminal corresponding to the speaker, the recording signal is divided into frame signals with equal length according to a certain window size (for example, 20 ms), each frame of recording signal is judged to be a voice by the vad algorithm, when vad=1, the frame is a voice signal, and a voice frame identifier is carried out on the frame, the voice frame identifier is bound with voice encoding data and then sent to the listener, meanwhile, the speaker counts the number of voice frames identified in a certain time period to obtain a basic voice frame number, and the encoded voice frame signal, the basic voice frame number and the voice frame identifier are transmitted as target call information to the terminal of the listener through the channel. In a listener, decoding a voice frame signal, detecting a voice frame identifier, counting the number of received voice frames, performing post-processing on the decoded voice frame signal to obtain a to-be-played voice frame signal, then performing vad algorithm detection on the to-be-played voice frame signal, and counting the to-be-played voice frame signal when vad=1, thereby counting the number of played voice frames. And carrying out voice frame number verification on the played voice frame number, the received voice frame number and the basic voice frame number. And for the voice frame signal to be played, playing the voice frame signal to be played through a loudspeaker, collecting the sound corresponding to the voice frame signal to be played to obtain a recording frame signal, calculating the cross-correlation value of the recording frame signal and the voice frame signal to be played, and checking the cross-correlation value. And after the verification of the voice frame number and the cross correlation value is passed, sending prompt information to the first terminal, so that the first terminal displays the status identifier of the small loudspeaker on the call page.

As can be seen from the foregoing, after receiving the current call information sent by the first terminal, the embodiment of the present application parses a voice frame signal in the current call information according to attribute information in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then plays the to-be-played voice frame signal, collects the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then verifies the current voice frame number based on the attribute information, calculates a cross-correlation value between the recording frame signal and the to-be-played voice frame signal, and then determines a current call state with the first terminal according to a verification result and the cross-correlation value of the current voice frame number; the method and the device can identify the collected voice signals, further verify the number of voice frames of the voice signals in the whole instant messaging process by listening and sending, and determine whether the voice is successfully played or not by means of cross-correlation calculation of the voice frame signals to be played and the recording frame signals by listening and sending, so that voice monitoring of the whole conversation process can be achieved, and the accuracy of conversation state detection can be improved.

According to the method described in the above embodiments, examples are described in further detail below.

In this embodiment, the call state detection device is specifically integrated in an electronic device, and the electronic device is a terminal, and in order to distinguish a first terminal of a speaker, the terminal may be described as a target terminal.

As shown in fig. 6, a call state detection method specifically includes the following steps:

201. and the target terminal receives the current call information sent by the first terminal.

For example, the target terminal constructs an information transmission channel with the first terminal through the network, receives the current call information sent by the first terminal through the information transmission channel, or when the current call information is sensitive information, can also receive the encrypted information of the current call information sent by the first terminal, and decrypts the encrypted information based on a preset encryption protocol to obtain the current call information, or when the first terminal sends the current call information to the server or the third party terminal, can receive the current call information corresponding to the first terminal sent by the server or the third party terminal, or receives a call request sent by the first terminal, where the call request carries a storage address of the current call information, and obtains the current call information from the first terminal, the third party terminal or a call information database based on the storage address.

202. And the target terminal decodes the voice frame signal to obtain a decoded voice frame signal.

For example, the target terminal may decode the voice frame signal according to the encoding format of the voice frame signal by adopting a corresponding decoding policy, or may also obtain a terminal identifier of the first terminal, screen a target decoding protocol corresponding to the terminal identifier from preset decoding protocols, and decode the voice frame signal based on the target decoding protocol, thereby obtaining a decoded voice frame signal.

203. And the target terminal performs post-processing on the decoded voice frame signal to obtain a voice frame signal to be played.

For example, the target terminal may adapt to a playback device that needs to be played back currently, or compress or amplify the decoded speech frame signal, or set playback parameters for the decoded speech frame signal, etc., so that the decoded speech frame signal is converted into a speech frame signal to be played back that can be played back directly.

204. And the target terminal performs voice frame number statistics on the decoded voice frame signal and the voice frame signal to be played based on the attribute information to obtain the current voice frame number of the voice frame signal.

For example, the target terminal extracts the voice frame identifier of the voice frame signal from the attribute information, detects the decoded voice frame signal containing the voice frame identifier in the decoded voice frame signal, obtains the target decoded voice frame signal, counts the number of the target decoded voice frame signals, and thus can obtain the number of the received voice frames, or recognizes the number of the decoded voice frame signals containing the voice frame identifier in the decoded voice frame signal, and thus counts the number of the received voice frames. Detecting whether the to-be-played voice frame signal is a voice signal or not by adopting a vad algorithm, so as to obtain a detection result of each to-be-played voice frame signal, counting the number of the to-be-played voice frame signal which is the voice signal based on the detection result, so as to obtain the played voice frame number of the to-be-played voice frame signal, and taking the received voice frame number and the played voice frame number as the current voice frame number of the voice frame signal.

205. The target terminal plays the voice frame signal to be played, and collects the played voice frame signal to be played to obtain a recording frame signal corresponding to the voice frame signal to be played.

For example, the target terminal converts the voice frame signal to be played into sound through electroacoustic conversion by a loudspeaker, and then the sound is collected by a microphone after air conduction, so that the sound is an echo signal of the voice frame signal to be played, and finally the echo signal is mixed into a recording signal, so that the recording frame signal is obtained.

206. And the target terminal checks the current voice frame number based on the attribute information.

For example, the target terminal extracts the basic speech frame number of the speech frame signal from the attribute information, when the basic speech frame number and the current speech frame number exceed a preset frame number threshold, calculates the ratio of the received speech frame number to the basic speech frame number to obtain a first frame number ratio, calculates the ratio of the received speech frame number to the played speech frame number to obtain a second frame number ratio, and when the first frame number ratio and the second frame number ratio do not exceed the preset frame number ratio threshold, determines that the current speech frame number passes the verification.

207. And the target terminal calculates the cross-correlation value of the recording frame signal and the voice frame signal to be played.

For example, the target terminal may obtain a set of preset delay parameters between the to-be-played voice frame signal and the recording frame signal, screen the to-be-played voice frame signal corresponding to each preset delay parameter from the to-be-played voice frame signal, obtain a set of candidate to-be-played voice frame signals corresponding to each recording frame signal, perform fourier transform on the to-be-played voice frame signal recording frame signal of the current frame, and then calculate power spectrums X (i, j) and D (i, j) of the j-th frequency point of the i-th frame, so as to obtain a first power spectrum value and a second power spectrum value. The first power spectrum value X (i, j) of the speech frame signal to be played is buffered by a ring buffer (maximum buffered M frame play signal power spectrum) for cross correlation calculation. And calculating the average value of the first power spectrum value to obtain a first power spectrum average value, and calculating the average value of the second power spectrum value to obtain a second power spectrum average value. And calculating the difference between the first power spectrum value and the first power spectrum mean value to obtain a first power spectrum difference value, and calculating the difference between the second power spectrum value and the second power spectrum mean value to obtain a second power spectrum difference value. And fusing the first power spectrum difference value and the second power spectrum difference value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set, wherein the candidate cross-correlation value can be specifically shown as a formula (1).

208. And the target terminal determines the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number.

For example, the target terminal may compare the candidate cross-correlation value with a preset cross-correlation threshold, count the number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset delay parameter in the comparison result, obtain a basic statistics number set, calculate the average value of the basic statistics number in the basic statistics number set, obtain a basic statistics number average value, screen the basic statistics number with the largest value in the basic statistics number set, obtain the target statistics number, calculate the number ratio between the target statistics number and the basic statistics number average value when the target statistics number is the same as the preset number threshold, and determine that the playing state of the speech frame signal to be played is normal playing when the number ratio exceeds the preset number ratio. When the playing state is normal playing and the verification of the current voice frame number is passed, the target terminal determines that the current conversation state with the first terminal is normal conversation.

209. And the target terminal sends prompt information of the current call state to the first terminal.

For example, the target terminal may generate a prompt message of a normal call according to the current call state, and send the prompt message to the first terminal.

210. And the first terminal prompts the current call state according to the prompt information.

For example, when the first terminal is a hardware device product, the first terminal may perform light prompt through a prompt lamp, or perform vibration prompt through a vibration motor, or directly display prompt information of a normal call on a call page, or may also screen, according to the prompt information, a target call icon corresponding to the normal call from a call state icon for displaying, where the call icon is mainly used to prompt that a current call state of a speaker corresponding to the first terminal is the normal call, and the call icon may have various forms, taking the call icon as a small horn as an example, and the call page displayed on the first terminal may be as shown in fig. 4.

211. When the target terminal detects the sound signal, the target terminal collects the current sound signal and generates target call information according to the current sound signal.

For example, when the target terminal detects a sound signal, the target terminal collects the currently generated sound signal and performs framing processing on the sound signal to obtain a plurality of sound frame signals. And detecting each sound frame signal through a vad algorithm, and screening out sound signals with vad of 1 from the sound frame signals as target sound frame signals corresponding to the voice, thereby taking the target sound frame signals as voice detection results. And numbering the target sound frame signals respectively to obtain target voice frame identifiers corresponding to each target sound frame signal, or sequencing the target sound frame signals respectively, and converting the sequencing result into the identifiers of each target sound frame signal to obtain the target voice frame identifiers. The voice frame signal is preprocessed, a preset voice coding protocol or a coding strategy is adopted to code the preprocessed voice frame signal to obtain an initial voice frame signal, a basic voice frame signal corresponding to a target voice frame identifier is screened out from the initial voice frame signal, and the target voice frame identifier is added to the basic voice frame identifier, so that the target voice frame signal is obtained.

The target terminal screens out target voice frame signals with target voice frame identifiers from the target voice frame signals, counts the number of the screened target voice frame signals, can obtain target voice frame numbers, takes the target voice frame identifiers and the target voice frame numbers as target attribute information of the target voice frame signals, and fuses the target voice frame signals and the target attribute information of the target voice frame signals to obtain target call information.

212. The target terminal sends the target call information to the second terminal and receives the target prompt information of the call state corresponding to the target call information returned by the second terminal.

For example, the target terminal sends the target call information to the second terminal, so that the second terminal detects the call state according to the target call information, when the second terminal detects that the call state is a normal call, the second terminal can receive the target prompt information corresponding to the normal call returned by the second terminal, and prompt the call state based on the target prompt information, and the mode of prompting the call state can be the same as that of prompting by the first terminal, which will not be described in detail herein.

The first terminal and the second terminal may be the same or different.

As can be seen from the above, after receiving the current call information sent by the first terminal, the target terminal in this embodiment parses a voice frame signal in the current call information according to attribute information in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then plays the to-be-played voice frame signal, collects the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then verifies the current voice frame number based on the attribute information, calculates a cross-correlation value between the recording frame signal and the to-be-played voice frame signal, and then determines a current call state with the first terminal according to a verification result and the cross-correlation value of the current voice frame number; the method and the device can identify the collected voice signals, further verify the number of voice frames of the voice signals in the whole instant messaging process by listening and sending, and determine whether the voice is successfully played or not by means of cross-correlation calculation of the voice frame signals to be played and the recording frame signals by listening and sending, so that voice monitoring of the whole conversation process can be achieved, and the accuracy of conversation state detection can be improved.

In order to better implement the above method, the embodiment of the present invention further provides a call state detection device, where the call state detection device may be integrated in an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 7, the call state apparatus may include a receiving unit 301, an analyzing unit 302, an acquiring unit 303, a checking unit 304, and a determining unit 305, as follows:

(1) A receiving unit 301;

a receiving unit 301, configured to receive current call information sent by the first terminal, where the current call information includes a voice frame signal and attribute information of the voice frame signal.

For example, the receiving unit 301 may be specifically configured to construct an information transmission channel with the first terminal through a network, receive, through the information transmission channel, current call information sent by the first terminal, or, when the current call information is sensitive information, may also receive encrypted information of the current call information sent by the first terminal, decrypt the encrypted information based on a preset encryption protocol, to obtain the current call information, or, when the current call information has a larger memory or a larger number, may also indirectly obtain the current call information from the first terminal.

(2) A parsing unit 302;

the parsing unit 302 is configured to parse the voice frame signal according to the attribute information, so as to obtain a voice frame signal to be played and a current voice frame number of the voice frame signal.

For example, the parsing unit 302 may be specifically configured to decode a voice frame signal to obtain a decoded voice frame signal, perform post-processing on the decoded voice frame signal to obtain a to-be-played voice frame signal, and perform voice frame count on the decoded voice frame signal and the to-be-played voice frame signal based on attribute information to obtain a current voice frame number of the voice frame signal.

(3) An acquisition unit 303;

the acquisition unit 303 is configured to play the to-be-played voice frame signal, and acquire the played to-be-played voice frame signal, so as to obtain a recording frame signal corresponding to the to-be-played voice frame signal.

For example, the acquisition unit 303 may be specifically configured to play the to-be-played voice frame signal through a speaker, record the played to-be-played voice frame signal through a microphone while playing the to-be-played voice frame signal, obtain a recording signal, and frame the recording signal according to the window length of each frame in the to-be-played voice frame signal, thereby obtaining a recording frame signal with the same window length as that of the to-be-played voice frame signal.

(4) A verification unit 304;

the verification unit 304 is configured to verify the current speech frame number based on the attribute information, and calculate a cross-correlation value of the recording frame signal and the speech frame signal to be played, where the cross-correlation value is used to indicate a similarity degree between the recording frame signal and the speech frame signal to be played.

For example, the verification unit 304 may be specifically configured to extract a base speech frame number of the speech frame signal from the attribute information, compare the base speech frame number with the current speech frame number when the base speech frame number and the current speech frame number exceed a preset frame number threshold, and determine a verification result of the current speech frame number based on the comparison result. And acquiring a preset time delay parameter set between the to-be-played voice frame signal and the recording frame signal, screening out to-be-played voice frame signals corresponding to each preset time delay parameter from the to-be-played voice frame signals, obtaining candidate to-be-played voice frame signal sets corresponding to each recording frame signal, and respectively calculating candidate cross-correlation values between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal sets.

(5) A determination unit 305;

a determining unit 305, configured to determine a current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number.

For example, the determining unit 305 may be specifically configured to compare the candidate cross-correlation value with a preset cross-correlation threshold, count the number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset delay parameter in the comparison result, obtain a basic statistical number set, and determine the playing state of the speech frame signal to be played based on the basic statistical number set. When the playing state is normal playing and the verification of the current voice frame number is passed, determining that the current conversation state with the first terminal is normal conversation.

Optionally, the call state detection device may further include a sending unit 306, as shown in fig. 8, specifically may be as follows:

and a sending unit 306, configured to, when detecting the sound signal, collect the current sound signal to generate target call information, and send the target call information to the second terminal, so that the second terminal detects the call state based on the target call information.

For example, the sending unit 306 may specifically be configured to, when detecting a sound signal, collect a currently generated sound signal, perform frame processing on the sound signal to obtain a plurality of sound frame signals, perform encoding processing on the sound frame signals to generate target call information, and send the target call information to the second terminal, so that the second terminal detects a call state based on the target call information.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the foregoing, in this embodiment, after the receiving unit 3012 receives the current call information sent by the first terminal, the parsing unit 302 parses the voice frame signal in the current call information according to the attribute information in the current call information to obtain a to-be-played voice frame signal and a current voice frame number of the voice frame signal, then the collecting unit 303 plays the to-be-played voice frame signal and collects the played to-be-played voice frame signal to obtain a recording frame signal corresponding to the to-be-played voice frame signal, then the checking unit 304 checks the current voice frame number based on the attribute information, calculates a cross-correlation value between the recording frame signal and the to-be-played voice frame signal, and then the determining unit 305 determines a current call state with the first terminal according to the check result and the cross-correlation value of the current voice frame number; the method and the device can identify the collected voice signals, further verify the number of voice frames of the voice signals in the whole instant messaging process by listening and sending, and determine whether the voice is successfully played or not by means of cross-correlation calculation of the voice frame signals to be played and the recording frame signals by listening and sending, so that voice monitoring of the whole conversation process can be achieved, and the accuracy of conversation state detection can be improved.

The embodiment of the invention also provides an electronic device, as shown in fig. 9, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 9 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

For example, an information transmission channel is constructed between the network and the first terminal, current call information sent by the first terminal is received through the information transmission channel, or when the current call information is sensitive information, encryption information of the current call information sent by the first terminal can be received, the encryption information is decrypted based on a preset encryption protocol, so as to obtain the current call information, or when the current call information has a larger memory or a larger number of memories, the current call information can be indirectly obtained from the first terminal. Decoding the voice frame signal to obtain a decoded voice frame signal, performing post-processing on the decoded voice frame signal to obtain a voice frame signal to be played, and performing voice frame number statistics on the decoded voice frame signal and the voice frame signal to be played based on attribute information to obtain the current voice frame number of the voice frame signal. The method comprises the steps of playing a voice frame signal to be played through a loudspeaker, recording the played voice frame signal to be played through a microphone while the voice frame signal to be played is played, obtaining a recording signal, and framing the recording signal according to the window length of each frame in the voice frame signal to be played, so that the recording frame signal with the same window length as the window length of the voice frame signal to be played is obtained. And extracting the basic voice frame number of the voice frame signal from the attribute information, comparing the basic voice frame number with the current voice frame number when the basic voice frame number and the current voice frame number exceed a preset frame number threshold, and determining a verification result of the current voice frame number based on the comparison result. And acquiring a preset time delay parameter set between the to-be-played voice frame signal and the recording frame signal, screening out to-be-played voice frame signals corresponding to each preset time delay parameter from the to-be-played voice frame signals, obtaining candidate to-be-played voice frame signal sets corresponding to each recording frame signal, and respectively calculating candidate cross-correlation values between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal sets. Comparing the candidate cross-correlation value with a preset cross-correlation threshold value, counting the number of the candidate cross-correlation values exceeding the preset cross-correlation threshold value under each preset time delay parameter in a comparison result, obtaining a basic statistical quantity set, and determining the playing state of the voice frame signal to be played based on the basic statistical quantity set. When the playing state is normal playing and the verification of the current voice frame number is passed, determining that the current conversation state with the first terminal is normal conversation. When the voice signal is detected, the currently generated voice signal is collected, the voice signal is subjected to framing processing to obtain a plurality of voice frame signals, the voice frame signals are subjected to coding processing to generate target call information, and the target call information is sent to the second terminal, so that the second terminal detects the call state based on the target call information.

The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.

As can be seen from the above, after receiving the current call information sent by the first terminal, the embodiment of the invention analyzes the voice frame signal in the current call information according to the attribute information in the current call information to obtain the to-be-played voice frame signal and the current voice frame number of the voice frame signal, then plays the to-be-played voice frame signal, collects the played to-be-played voice frame signal to obtain the recording frame signal corresponding to the to-be-played voice frame signal, then checks the current voice frame number based on the attribute information, calculates the cross-correlation value of the recording frame signal and the to-be-played voice frame signal, and then determines the current call state with the first terminal according to the check result and the cross-correlation value of the current voice frame number; the method and the device can identify the collected voice signals, further verify the number of voice frames of the voice signals in the whole instant messaging process by listening and sending, and determine whether the voice is successfully played or not by means of cross-correlation calculation of the voice frame signals to be played and the recording frame signals by listening and sending, so that voice monitoring of the whole conversation process can be achieved, and the accuracy of conversation state detection can be improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform any of the steps in the call state detection method provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the computer readable storage medium can execute the steps in any of the call state detection methods provided in the embodiments of the present invention, the beneficial effects that any of the call state detection methods provided in the embodiments of the present invention can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.

Among other things, according to one aspect of the present application, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations of the call state detection aspect described above.

The foregoing describes in detail a method, apparatus and computer readable storage medium for detecting call state provided by the embodiments of the present invention, and specific examples are applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A call state detection method, comprising:

2. The call state detection method according to claim 1, wherein the verifying the current number of voice frames based on the attribute information includes:

Extracting a basic voice frame number of the voice frame signal from the attribute information;

when the basic voice frame number and the current voice frame number exceed a preset frame number threshold, comparing the basic voice frame number with the current voice frame number;

and determining a verification result of the current voice frame number based on the comparison result.

3. The call state detection method according to claim 2, wherein the current number of voice frames includes a received voice frame number and a played voice frame number, and the comparing the base number of voice frames with the current number of voice frames includes:

calculating the ratio of the number of received voice frames to the number of basic voice frames to obtain a first frame ratio;

calculating the ratio of the number of received voice frames to the number of played voice frames to obtain a second frame ratio;

the determining the verification result of the current voice frame number based on the comparison result comprises the following steps: and when the first frame number ratio and the second frame number ratio do not exceed a preset frame number ratio threshold, determining that the current voice frame number passes the verification.

4. The call state detection method according to claim 1, wherein the calculating a cross-correlation value between the recording frame signal and the speech frame signal to be played comprises:

Acquiring a preset time delay parameter set between the voice frame signal to be played and the recording frame signal;

screening the to-be-played voice frame signals corresponding to each preset time delay parameter in the preset time delay parameter set from the to-be-played voice frame signals to obtain candidate to-be-played voice frame signal sets corresponding to each recording frame signal;

calculating candidate cross-correlation values between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set respectively;

the determining the current call state with the first terminal according to the verification result and the cross-correlation value of the current voice frame number comprises the following steps: and determining the current call state of the first terminal according to the checking result of the current voice frame number and the candidate cross-correlation value.

5. The method for detecting a call state according to claim 4, wherein the calculating the candidate cross-correlation value between each recording frame signal and each to-be-played speech frame signal in the corresponding candidate to-be-played speech frame signal set includes:

calculating the power spectrum value of the intermediate frequency point in each recording frame signal to obtain a first power spectrum value;

respectively calculating the power spectrum value of the frequency point in each voice frame signal to be played in the candidate voice frame signal set to obtain a second power spectrum value;

And fusing the first power spectrum value and the second power spectrum value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set.

6. The method for detecting a call state according to claim 5, wherein the fusing the first power spectrum value and the second power spectrum value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set includes:

calculating the average value of the first power spectrum value to obtain a first power spectrum average value, and calculating the average value of the second power spectrum value to obtain a second power spectrum average value;

calculating the difference between the first power spectrum value and the first power spectrum mean value to obtain a first power spectrum difference value, and calculating the difference between the second power spectrum value and the second power spectrum mean value to obtain a second power spectrum difference value;

and fusing the first power spectrum difference value and the second power spectrum difference value to obtain a candidate cross-correlation value between each recording frame signal and each to-be-played voice frame signal in the corresponding candidate to-be-played voice frame signal set.

7. The method for detecting a call state according to claim 4, wherein determining the current call state of the first terminal according to the check result of the current number of voice frames and the candidate cross-correlation value comprises:

determining the playing state of the voice frame signal to be played based on the candidate cross-correlation value;

and when the playing state is normal playing and the verification of the current voice frame number is passed, determining that the current conversation state with the first terminal is normal conversation.

8. The call state detection method according to claim 7, wherein the determining the play state of the to-be-played voice frame signal based on the candidate cross-correlation value includes:

comparing the candidate cross-correlation value with a preset cross-correlation threshold value;

counting the number of candidate cross-correlation values exceeding the preset cross-correlation threshold under each preset time delay parameter in the comparison result to obtain a basic statistics number set;

and determining the playing state of the voice frame signal to be played based on the basic statistical quantity set.

9. The call state detection method according to claim 8, wherein the determining, based on the set of base statistics, a play state of the to-be-played voice frame signal includes:

Calculating the average value of the basic statistical quantity in the basic statistical quantity set to obtain the average value of the basic statistical quantity;

screening out the basic statistical quantity with the largest value from the basic statistical quantity set to obtain a target statistical quantity;

when the target statistical quantity is the same as a preset quantity threshold value, calculating a quantity ratio between the target statistical quantity and the basic statistical quantity average value;

and when the number ratio exceeds a preset number ratio threshold, determining that the playing state of the voice frame signal to be played is normal playing.

10. The call state detection method according to claim 7, wherein after the determining that the current call state with the first terminal is a normal call, further comprising:

generating prompt information of normal call according to the current call state;

and sending the prompt information to the first terminal, so that the first terminal prompts the current call state according to the prompt information.

11. The method for detecting a call state according to any one of claims 1 to 10, wherein the parsing the voice frame signal according to the attribute information to obtain a voice frame signal to be played and a current voice frame number of the voice frame signal includes:

Decoding the voice frame signal to obtain a decoded voice frame signal;

post-processing the decoded voice frame signal to obtain a voice frame signal to be played;

and based on the attribute information, carrying out voice frame number statistics on the decoded voice frame signal and the voice frame signal to be played to obtain the current voice frame number of the voice frame signal.

12. The method for detecting a call state according to claim 11, wherein the performing voice frame count on the decoded voice frame signal and the voice frame signal to be played based on the attribute information to obtain a current voice frame number of the voice frame signal includes:

extracting a voice frame identifier of the voice frame signal from the attribute information;

counting the voice frame identifiers in the decoded voice frame signals to obtain the received voice frame numbers of the decoded voice frame signals;

performing voice detection on the to-be-played voice frame signal to obtain the played voice frame number of the to-be-played voice frame signal;

and taking the received voice frame number and the played voice frame number as the current voice frame number of the voice frame signal.

13. The call state detection method according to any one of claims 1 to 10, characterized by further comprising:

When a sound signal is detected, collecting the current generated sound signal, and carrying out framing processing on the sound signal to obtain a plurality of sound frame signals;

encoding the sound frame signal to generate target call information;

and sending the target call information to a second terminal, so that the second terminal detects the call state based on the target call information.

14. The call state detection method according to claim 13, wherein the encoding the sound frame signal to generate target call information includes:

performing voice detection on the voice frame signal, and generating a target voice frame identifier based on a voice detection result;

preprocessing the sound frame signal, and performing voice coding processing on the sound frame signal after preprocessing to obtain a target voice frame signal;

counting the target voice frame identifiers in the target voice frame signals to obtain target voice frame numbers, and taking the target voice frame identifiers and the target voice frame numbers as target attribute information of the target voice frame signals;

and fusing the target voice frame signal and the target attribute information of the target voice frame signal to obtain target call information.

15. A call state detection method, comprising:

the acquisition unit is used for playing the to-be-played voice frame signals and acquiring the played to-be-played voice frame signals to obtain recording frame signals corresponding to the to-be-played voice frame signals;

16. An electronic device comprising a processor and a memory, the memory storing an application, the processor being configured to run the application in the memory to perform the steps in the call state detection method of any one of claims 1 to 14.

17. A computer program product comprising computer programs/instructions which when executed by a processor implement the steps in the call state detection method of any one of claims 1 to 14.

18. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the call state detection method of any one of claims 1 to 14.