CN110782901B

CN110782901B - Method, storage medium and device for identifying voice of network telephone

Info

Publication number: CN110782901B
Application number: CN201911071415.7A
Authority: CN
Inventors: 黄远坤; 李斌; 黄继武
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2021-12-24
Anticipated expiration: 2039-11-05
Also published as: CN110782901A

Abstract

The invention provides a method, a storage medium and a device for identifying voice of a network telephone, wherein the method comprises the following steps: respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal; taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure; respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure; and inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result. The invention not only can effectively identify the network telephone voice of the fixed source fixed terminal, but also can quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.

Description

Method, storage medium and device for identifying voice of network telephone

Technical Field

The present invention relates to the field of voice over internet protocol recognition, and in particular, to a method, a storage medium, and an apparatus for recognizing voice over internet protocol.

Background

With the development of the internet and the increasing maturity of audio compression technology, the communication modes of people become diversified. The advent of VoIP voice over ip technology has enabled people to communicate more conveniently and at a lower cost. Therefore, internet phone technology attracts a large number of users, and internet phones are gradually replacing traditional fixed phones and mobile phones, and become one of the main communication ways of people. Compared with the traditional fixed telephone and mobile telephone, the network telephone user does not have a fixed telephone number and does not need to use a telephone card for dialing. The user can communicate with the other party only by dialing the telephone number of the other party through specific network telephone software. As such, some internet phone software allows users to set their phone numbers at will, and the modified phone numbers do not need to be verified. Therefore, lawless persons can utilize the loophole to disguise the identity of the lawless persons by setting the own telephone number to a specific telephone number (such as a telephone number of a public security bureau, a telephone number of a bank or a telephone number of a government department) through some internet phone number changing software, thereby implementing fraud. Thus, identifying whether a telephone call is an internet telephone can help the called party to identify the caller, thereby enabling the called party to avoid, to some extent, fraud that may be encountered.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a method, a storage medium and a device for recognizing voice of internet phone, which are intended to solve the problem that the prior art cannot efficiently recognize voice of internet phone with unknown source.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for recognizing voice of a network phone, comprising the steps of:

decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;

respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal;

taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure;

respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure;

and inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result.

The method for recognizing the voice of the network telephone is characterized in that the step of decompressing, resampling and carrying out high-pass filtering processing on the received voice signal to obtain a filtered voice signal comprises the following steps:

decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals;

and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.

The method for recognizing the voice of the network telephone is characterized in that the step of respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal comprises the following steps:

performing framing processing on the filtered voice signal to obtain an nth sample of an ith frame: y is_i[n]＝w[n]*x[(i-1)*S+n]Wherein, in the step (A),

converting the filtered voice signal subjected to framing into a standardized Mel scale spectrogram:

wherein the content of the first and second substances,

is the Mel scale frequency, f is the Hertz scale frequency, K_bAs a boundary point, f _L0 is the lowest frequency, f_H＝F _s2 is the highest frequency, F_sFor the sampling frequency,. DELTA.F for the frequency resolution, F_kFor the frequency of the kth discrete Fourier transform data point, K_m(M ∈ {1, 2, …, M }) is the mth Mel-band filter;

converting the filtered voice signal subjected to framing into a standardized anti-Mel scale spectrogram:

wherein the content of the first and second substances,

is reverse Mel scale;

converting the filtered voice signal subjected to framing into a pile-up waveform frame signal:

the method for recognizing the voice of the network telephone, wherein the step of extracting the time domain information of the voice signal through a first convolutional neural network structure by taking the heap waveform frame signal as a network input, comprises the following steps:

the method comprises the steps that a first convolution neural network structure is constructed in advance, the first convolution neural network structure comprises 6 groups of convolution modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolution modules from the input end comprise a convolution layer, a maximum pooling layer, a linear rectification function and two batch normalization layers, and the last group of convolution modules comprise a convolution layer, a batch normalization layer, a linear rectification function and a global average pooling layer;

performing network parameter training on the first convolutional neural network structure through a main classifier connected to the output end of the first convolutional neural network structure and an auxiliary classifier connected to the 4 th group of convolutional modules, wherein the total cost functions of the two classifiers are: LossA ═ α loss₀+β*loss₁Wherein, loss₀And Loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein alpha and beta are weights and satisfy that alpha + beta is 1;

and after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into the first convolutional neural network structure, and extracting time domain information of the voice signal.

The method for recognizing the voice of the network telephone is characterized in that the step of respectively taking the Mel scale spectrogram and the inverse Mel scale spectrogram as network inputs and extracting the frequency domain information of the voice signal through a second convolutional neural network structure comprises the following steps of:

a second convolutional neural network structure is constructed in advance, the second convolutional neural network structure comprises 6 groups of convolutional modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolutional modules from the input end comprise a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function and two batch normalization layers, and the last group of convolutional modules comprise a convolutional layer, a batch normalization layer, a linear rectification function and a global average pooling layer;

performing network parameter training on the second convolutional neural network structure through a main classifier connected to the output end of the second convolutional neural network structure and an auxiliary classifier connected to the 3 rd group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossB ═ γ loss₀+δ*loss₁Wherein, loss₀And loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1;

and after the second convolutional neural network structure is trained, removing the main classifier and the auxiliary classifier, respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through the second convolutional neural network structure.

The method for recognizing voice over internet phone, wherein the main classifier comprises a full connection layer and a softmax function.

The method for recognizing the voice of the network telephone is characterized in that the classification module is a fully-connected neural network classifier, and the fully-connected neural network classifier comprises a first fully-connected layer, a second fully-connected layer and a softmax function.

The method for recognizing the voice of the network telephone is characterized in that the time domain information and the frequency domain information of the voice signal are input into a trained classification module, and the step of outputting the classification result comprises the following steps:

inputting the time domain information and the frequency domain information of the voice signal into the fully-connected neural network classifier, and respectively setting the node numbers of the first layer of fully-connected layer and the second layer of fully-connected layer to 768 and 2;

if the full-connection neural network classifier outputs [0,1], judging that the voice signal is the voice of the network telephone;

and if the full-connection neural network classifier outputs [1,0], judging that the voice signal is the voice of the mobile phone.

A storage medium comprising a plurality of instructions stored thereon, the instructions adapted to be loaded by a processor and to carry out in particular the steps of the method of recognizing voice over internet phone according to the present invention.

An apparatus for recognizing voice of a network telephone, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the method of recognizing voice over internet phone of the present invention. .

Has the advantages that: the invention provides a method for recognizing network telephone voice, which adopts a trained first convolution neural network structure and a trained second convolution neural network structure to extract multi-domain depth characteristics of voice information and utilizes the extracted characteristics to recognize the network telephone voice. Compared with the prior art, the method can effectively identify the network telephone voice of the fixed source fixed terminal, and can also quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.

Drawings

Fig. 1 is a flowchart of a method for recognizing voice of a network phone according to an embodiment of the present invention.

FIG. 2a is a spectrogram of a mobile phone call recording at a sampling rate of 48 kHz.

FIG. 2b is a spectrogram of the voice recording of the Internet phone call at a sampling rate of 16 kHz.

FIG. 2c is a diagram of a voice spectrum of a mobile phone call recording at a resampling rate of 8 kHz.

FIG. 2d is a diagram of a voice spectrogram of an Internet phone call recording at a resampling rate of 8 kHz.

Fig. 3a is a histogram of the zero-crossing rate of high-pass filtered back-and-forth telephone call recordings.

Fig. 3b is a histogram of the zero-crossing rate of the internet phone call recordings before and after the high-pass filtering.

FIG. 4 is a flow chart of framing and format conversion representation of a speech signal.

Fig. 5 is a regularized logarithmic spectrum, a regularized logarithmic mel scale spectrum, and a regularized logarithmic inverse mel scale spectrum of the call recording of the mobile phone and the internet phone after the high-pass filtering.

Fig. 6a is a schematic structural diagram of a first convolutional neural network structure.

Fig. 6b is a structural diagram of a second convolutional neural network structure.

FIG. 7 is a diagram of a fully-connected convolutional neural network classifier designed according to the present invention.

Fig. 8 is a block diagram of an apparatus for recognizing voice of a network phone according to the present invention.

Fig. 9 is a schematic diagram of the detection accuracy of the speech segments with different lengths in scene 0 and scene 5.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The existing document proposes a network telephone fingerprint to identify the network telephone, the designed network telephone fingerprint comprises two parts, the first part is composed of the packet loss rate of the voice frame, the correlation between packets, the statistical characteristic of noise spectrum, the voice quality of the call and other characteristics; the second part is a path traversal signature formed by inputting the features extracted by the first part into a decision tree. Since the method relies on the packet loss characteristic, packet loss may become a less effective characteristic as the network quality improves. In addition, this method assumes a fixed and unique identification fingerprint for each call source, and therefore this method can only be applied to detect a network telephone that is dialed from one fixed source to another designated terminal (e.g., a fixed telephone to a fixed telephone), and cannot identify various network telephones from unknown sources.

The prior document also provides a method for detecting voice of a network telephone by identifying a coder-decoder, which comprises the steps of firstly recoding voice by using a known coder-decoder to form a multi-dimensional characteristic model containing a noise spectrum and a voice signal time domain histogram, then comparing the obtained characteristic model with a reference characteristic model of a candidate coder-decoder to identify which coder-decoder a section of voice is processed, and judging whether the voice to be detected passes through a VoIP network or not according to the type of the coder-decoder. However, since audio data is transmitted in different telephone networks and various devices, speech signal encoded streams are transcoded while passing through different types of communication networks. The last codec used in the PSTN public telephone network may mask most of the footprint of the codec used in the VoIP network. In addition, both VoIP and PSTN networks may use the same codec, such as g.711. Therefore, a pure recognition codec is not sufficient to recognize voice of a network phone.

Based on the problems in the prior art, the present embodiment provides a method for recognizing voice of a network phone, as shown in fig. 1, which includes the steps of:

In the embodiment, the trained first convolutional neural network structure and the trained second convolutional neural network structure are adopted to extract the multi-domain depth features of the voice information, and the extracted features are utilized to identify the voice of the network telephone. Compared with the prior art, the method can effectively identify the network telephone voice of the fixed source fixed terminal, and can also quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.

In some embodiments, the decompressing, resampling and high-pass filtering the received speech signal to obtain a filtered speech signal includes: decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals; and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.

Specifically, when a handset is used to record a received speech signal, different handsets will default to using different sampling rates to sample the speech signal, for example, the handset will use sampling rates of 16kHz, 44.1kHz, 48kHz, etc., which results in the captured speech including many unwanted high frequency signals. Because the bandwidth of the VoIP and PSTN networks is limited, the embodiment selects to resample the decompressed voice signal into a waveform signal with 8kHz and 16 bits, so that most voice content can be reserved, and the problem of insufficient bandwidth can be solved. Fig. 2 a-2 d show the spectrogram of a 48kHz handset call record signal, the spectrogram of a 16kHz network telephone call record signal, and the spectrogram after resampling them with a sampling rate of 8kHz, from which it can be seen that the high-frequency parts of the 48kHz and 16kHz voice signals contain little information, while the resampled signals still retain most of the information in the original voice.

The embodiment also performs high-pass filtering on the resampled voice signal, and because the human ear is not sensitive to the high-frequency part of the voice signal, the voice of the network telephone and the voice of the mobile phone call are difficult to distinguish through the human ear. The present embodiment thus uses a high-pass filter to suppress the low-frequency part of the speech signal and amplify the high-frequency information in the speech. Fig. 3 a-3 b show the zero-crossing rate of 12000 segments of a 25 ms cell phone call and network telephone recording before and after passing through a high pass filter. It can be seen from fig. 3 a-3 b that the difference in high frequency content between the handset call recording and the voice over internet protocol recording is amplified after the high pass filtering period. For a signal s [ n ], the embodiment uses a second-order differential filter to high-pass filter, and the high-pass filtered signal x [ n ] can be obtained by the following formula: x [ n ] — 0.5 x s [ n-1] + s [ n ] -0.5 x s [ n +1], where n represents a sample point in the time domain signal.

In some embodiments, in order to facilitate subsequent deep learning to extract features, in this embodiment, the speech signal is subjected to decompression, resampling and high-pass filtering, and then is subjected to framing processing, and is converted into three forms, namely a mel-spectrum, an inverse mel-spectrum and a heap-type waveform frame, as shown in fig. 4.

Specifically, in this embodiment, the frame length of the audio frame is represented by N, and the frame shift is represented by S, then the nth sample of the ith frame can be obtained by the following formula: y is_i[n]＝w[n]*x[(i-1)*S+n]Wherein w [ n ]]For the Hamming window function, the following formula is defined

And then calculating a Mel scale spectrogram, an inverse Mel scale spectrogram and a stacked waveform frame signal of the voice after framing.

This example uses

Indicating a transition from the Hertz scale to the Merr scale, using

Representing the inverse transformation, the mel scale frequency is calculated by the following formula compared to the hertz scale frequency f:

the reverse Mel scale language spectrum can be calculated by the following formulaCalculating:

wherein f is_L,f_HAnd F and_srespectively, the lowest frequency, the highest frequency, and the sampling frequency, and in the present embodiment, the sampling frequency is set to 8000Hz, the lowest frequency is set to 0, and the highest frequency is set to half the sampling frequency, i.e., 4000 Hz. N is the number of discrete Fourier transform points, and the value of N is the same as the number of frame length points. For the ith frame speech signal, we first find its discrete fourier transform:

wherein the frequency f of the kth discrete Fourier transform data point_kK Δ F, where Δ F is the frequency resolution,

the spectral energy of each frame is calculated by the following formula: e_i[k]＝|Y_i[k]|²Since the spectrum is symmetrical, the present embodiment selects only half of the spectrum, i.e. the spectrum is symmetrical

Next, the present embodiment filters the energy spectrum by using M mel triangular filters, where the mth triangular filter in the filter bank is defined as:

wherein, K_bIn order to be the boundary point,

f

_L0 is the lowest frequency, f_H＝F_sThe maximum frequency is/2, in this example, K_m(M ∈ {1, 2, …, M }) is the center frequency of the mth mel-band filter, and the present embodiment calculates the mel-scale spectrogram by the following formula,

and finally, carrying out standardization treatment on the Mel scale spectrogram to obtain a final standardized Mel scale spectrogram:

wherein the content of the first and second substances,

in some embodiments, rather than calculating a mel scale spectrogram, the present implementation filters the energy spectrum using an inverse mel-triangle filter bank in a spherical inverse mel scale spectrogram, wherein the mth inverse mel filter is defined as:

calculating an inverse mel scale spectrum by the following formula:

and finally, carrying out standardization processing on the anti-Mel scale spectrogram to obtain a final standardized anti-Mel scale spectrogram:

wherein the content of the first and second substances,

fig. 5 shows normalized spectrogram of different scales of the high-pass filtered signal, and it can be seen from the graph that the energy of the high-frequency part of the network telephone recording is weaker than that of the mobile phone call recording. Compared with the spectrogram with Hertz scales, the spectrogram with the Mel scales and the reverse Mel scales can enlarge the difference between the mobile phone call recording and the internet phone call recording and reduce the difference between different internet phone calls recording.

In some embodiments, the stacked waveform frame retains phase information of the speech signal compared to the starting mel scale spectrogram and the inverse mel scale spectrogram, so that information that cannot be captured in the speech spectrogram can be captured by using the stacked waveform frame as an input of the depth network

Dimensional stacked waveform frame matrix

In some embodiments, two different convolutional neural network structures are designed, frequency domain information and time domain information of the speech signal are respectively extracted through a first convolutional neural network structure and a second convolutional neural network structure, and finally the extracted depth features are a vector with 2304 dimensions.

In some embodiments, the step of extracting time domain information of the speech signal by the first convolutional neural network structure with the stacked waveform frame signal as a network input comprises: as shown in fig. 6a, a first convolutional neural network structure is pre-constructed, the first convolutional neural network structure includes 6 groups of convolutional modules connected in series in sequence from an input end to an output end, the first 5 groups of convolutional modules from the input end each include a convolutional layer, a max pooling layer, a linear rectification function (ReLU), and two batch normalization layers (BN), and the last group of convolutional modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and a global average pooling layer; in the embodiment, the first 3 groups of convolution modules are designed to be one-dimensional convolution for extracting the intra-frame features of each frame of the stacked waveform frame, the 4 th convolution module is designed to be two-dimensional convolution for extracting the intra-frame features and fusing the inter-frame extracted information, and the last two convolution modules are also designed to be one-dimensional convolution for fusing the inter-frame features.

In this embodiment, a main classifier composed of a full connection layer and a softmax function is connected to the output end of the first convolutional neural network, an auxiliary classifier is connected to the output end of the 4 th convolutional module, and network parameter training is performed on the first convolutional neural network structure through the main classifier and the auxiliary classifier, where the total cost function of the two classifiers is: LossA ═ α loss₀+β*loss₁Wherein, loss₀And loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein alpha and beta are weights and satisfy that alpha + beta is 1; and after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into a first convolutional neural network structure (SWF-net), and extracting time domain information of the voice signal.

In some embodiments, the step of extracting frequency domain information of the speech signal by a second convolutional neural network structure with the mel scale spectrum and the inverse mel scale spectrum as network inputs respectively comprises: as shown in fig. 6b, a second convolutional neural network structure is pre-constructed, where the second convolutional neural network structure includes 6 groups of convolutional modules connected in series in sequence from an input end to an output end, the first 5 groups of convolutional modules from the input end each include a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function, and two batch normalization layers, and the last group of convolutional modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and a global average pooling layer. In this embodiment, the second convolutional neural network structure has two differences from the first convolutional neural network structure, the first difference is that the dimensionality of the kernel function in the convolutional module is different, and since two adjacent data points have a certain association for the spectrogram, this embodiment directly uses two data points in the first 5 convolutional modules in the second convolutional neural network structureExtracting features by using a dimensional convolution kernel, and designing a specific pooling size and a convolution step length to enable the second convolution neural network structure to have the same output dimension as the first convolution neural network structure; the second point is different from the location of the auxiliary classifier, which in this embodiment is connected at the output of the 3 rd convolution module. Performing network parameter training on the second convolutional neural network structure through the main classifier and the auxiliary classifier, wherein the total cost function of the two classifiers is as follows: LossB ═ γ loss₀+δ*loss₁Wherein, loss₀And loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1; after the structure training of the second convolutional neural network is finished, removing the main classifier and the auxiliary classifier to obtain corresponding convolutional neural network model names of MS-net and IMS-net; and respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure.

In some embodiments, after the depth features are extracted, the present embodiment uses a classification module to fuse the extracted features and make a final decision. Many classifiers can be used as the classification module in this stage, and the embodiment uses a fully connected neural network classifier as the classification module. As shown in fig. 7, the fully-connected neural network classifier consists of two fully-connected layers and one softmax function. Wherein the number of nodes of the first layer fully connected layer and the second layer fully connected layer is 768 and 2. Inputting the extracted 2304-dimensional feature vector into the trained fully-connected neural network classifier, and outputting [0,1] or [1,0] by the fully-connected neural network classifier; in this embodiment, the output [0,1] indicates that the classifier determines the voice signal to be detected as the VoIP internet phone call record, and the output [1,0] indicates that the classifier determines the voice signal to be detected as the mobile phone call record.

In some embodiments, a storage medium is also provided, in which a plurality of instructions are stored, the instructions being adapted to be loaded by a processor and to specifically perform the steps of the method for recognizing voice over internet phone according to the present invention.

In some embodiments, there is also provided an apparatus for recognizing a network voice signal, wherein, as shown in fig. 8, the apparatus comprises a processor 10 adapted to implement instructions; and a storage medium 20 adapted to store a plurality of instructions adapted to be loaded by the processor 10 and to carry out the steps of the method of recognizing voice over internet phone.

The recognition performance of the network voice signal of the invention is tested by the existing database VPCID as follows:

the present embodiment sets the frame length N to 240, the frame shift S to 120, and the number M of triangular filters to 48. The frame number L is set to 132. A mel scale spectrum of 48 × 132, an inverse mel scale spectrum of 48 × 132, and a stacked waveform frame of 240 × 132 are obtained. In the evaluation index, the present embodiment selects a true rate, a true negative rate, and an accuracy rate, where the voice of the internet phone is a positive sample, and the voice of the mobile phone call is a negative sample.

1. Verifying the effect brought by the auxiliary classifier:

table 1 shows the detection performance for a 2-second long speech segment when classifiers with different weights are used in SFW-net, and table 2 shows the detection performance for a 2-second long speech segment when classifiers with different weights are used in MS-net and IMS-net.

TABLE 1 detection Performance of classifiers in SFW-net using different weights

TABLE 2 detection Performance of classifiers in MS-net and IMS-net using different weights

As can be seen from tables 1-2, the detection performance using the auxiliary classifier is better than that without the auxiliary classifier, and there is a possibility of optimizing the weight of the auxiliary classifier.

2. Overall effect of using different classification modules:

the classification module of the invention can be composed of different classifiers and even some decision-making methods, and the use of different classifiers or decision-making methods can achieve good performance. Table 3 shows the detection performance for a 2-second long speech segment using a general background model-gaussian mixture model (UBM-GMM), a Ensemble classifier (FLD Ensemble) classifier as a classification module, a Majority Voting mechanism (Majority Voting) as a classification module, and a fully-connected neural network classifier (NN-based classifier) as a classification module.

TABLE 3 detection Performance Using different classifiers

It can be seen from the table that very good detection performance can be achieved using different classification methods. The performance of the fully-connected neural network classifier designed by the method is better than that of the existing classifier and decision method.

3. Detecting the effect of different source voice signals:

since the invention starts with actual call recording, the algorithm design process does not depend on data from a certain source. Therefore, the invention can effectively detect the call voices from different sources. Therefore, the present invention sets up different scenarios, and table 4 shows the degree of source mismatch in the training data and the test data in each scenario. Wherein the flagged factors indicate that the training data and the test data are of different origins in the factors. Table 5 shows the detection performance of the present invention for a 2 second long speech segment in different scenarios. As can be seen from table 5, the method proposed by the present invention has good detection performance for voices from many different sources.

TABLE 4 mismatch factors in various detection scenarios

TABLE 5 detection Performance of the present invention under various mismatch factor scenarios

4. Detecting the effect of voice segments with different lengths:

the algorithm designed by the invention extracts the intraframe characteristics of the short-time audio frame and performs the fusion of the interframe characteristics in the characteristic extraction process. The invention can therefore be used to detect speech segments of different lengths. Fig. 9 shows the detection accuracy of the present invention for 1 second long, 2 second long, 4 second long, 6 second long, and 8 second long speech segments in scene 0 and scene 5. It can be seen from the figure that the detection performance of the present invention reaches a substantial saturation for the voice segment of more than 6 seconds. And the voice segment with the length shorter than 6 seconds can also have good detection accuracy.

In summary, based on the statistical characteristics of the voice signals, the invention finds that the energy of the voice signals of the network telephone distributed in the high-frequency part is weaker than the energy of the voice signals of the mobile telephone distributed in the high-frequency part, so that on the problem of detecting the voice signals of the network telephone, the invention firstly carries out high-pass filtering processing on the voice signals; the invention analyzes the spectrogram with different scales, finds that compared with a linear scale spectrogram, the nonlinear scale spectrogram can amplify the difference between different types of call recording signals and reduce the difference between the same type of voice signals, so that the invention uses a Mel scale spectrogram and an inverse Mel scale spectrogram as network input in algorithm design, compared with the prior art, the invention abandons a method for manually designing characteristics and uses deep learning to automatically extract the characteristics of the network input, and the first convolution neural network structure designed by the invention extracts the in-frame characteristics firstly and then integrates the inter-frame characteristics to extract the deep characteristics; the method has the advantages that the designed second convolutional neural network structure directly extracts the characteristics of the frequency domain in a mode of extracting adjacent data points of the spectrogram, information extracted by the three trained subnetworks can be effectively fused together, and by combining the technologies, the method can effectively detect the network telephone call records from the same source and can robustly detect the network telephone call records from different sources.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A method for recognizing voice of a network telephone, comprising the steps of:

inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result;

the step of converting the filtered speech signal into a normalized mel scale spectrogram, a normalized anti-mel scale spectrogram and a stacked waveform frame signal respectively comprises:

converting the filtered speech signal into a standardized Mel-time scaleDegree language spectrogram:

wherein the content of the first and second substances,

is the Mel scale frequency, f is the Hertz scale frequency, K_bAs a boundary point, f_L0 is the lowest frequency, f_H＝F_s2 is the highest frequency, F_sFor the sampling frequency,. DELTA.F for the frequency resolution, F_kFor the frequency of the kth discrete Fourier transform data point, K_m(M ∈ {1, 2, …, M }) is the mth Mel-band filter;

wherein the content of the first and second substances,

is reverse Mel scale;

the step of extracting the time domain information of the speech signal by using the heap waveform frame signal as a network input through a first convolutional neural network structure comprises:

performing network parameter training on the first convolutional neural network structure through a main classifier connected to the output end of the first convolutional neural network structure and an auxiliary classifier connected to a fourth group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossA ═ α loss₀+βloss₁Wherein, loss₀And loss₁Representing primary and secondary classifiers respectivelyThe cross entropy loss function of the classifier, wherein alpha and beta are weights, and alpha + beta is 1;

after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into the first convolutional neural network structure, and extracting time domain information of a voice signal;

the step of extracting the frequency domain information of the voice signal through a second convolutional neural network structure by respectively taking the normalized Mel scale spectrogram and the normalized inverse Mel scale spectrogram as network inputs comprises:

performing network parameter training on the second convolutional neural network structure through a main classifier connected to the output end of the second convolutional neural network structure and an auxiliary classifier connected to a third group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossB ═ γ loss₀+δ*loss₁Wherein, loss₀And loss₁Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1;

2. The method of claim 1, wherein the step of decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal comprises:

3. The method of claim 1, wherein the primary classifier comprises a full connectivity layer and a softmax function.

4. The method for recognizing voice over internet phone of claim 1, wherein the classification module is a fully-connected neural network classifier comprising a first fully-connected layer, a second fully-connected layer and a softmax function.

5. The method of claim 4, wherein the time domain information and the frequency domain information of the voice signal are input into a trained classification module, and the step of outputting the classification result comprises:

6. A storage medium comprising a plurality of instructions stored thereon, the instructions adapted to be loaded by a processor and to carry out the steps of the method of recognizing voice over internet phone of any of claims 1-5.

7. An apparatus for recognizing voice of a network telephone, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the method of recognizing voice over internet phone of any of claims 1-5.