CN110782901B - Method, storage medium and device for identifying voice of network telephone - Google Patents

Method, storage medium and device for identifying voice of network telephone Download PDF

Info

Publication number
CN110782901B
CN110782901B CN201911071415.7A CN201911071415A CN110782901B CN 110782901 B CN110782901 B CN 110782901B CN 201911071415 A CN201911071415 A CN 201911071415A CN 110782901 B CN110782901 B CN 110782901B
Authority
CN
China
Prior art keywords
neural network
voice
voice signal
network structure
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911071415.7A
Other languages
Chinese (zh)
Other versions
CN110782901A (en
Inventor
黄远坤
李斌
黄继武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201911071415.7A priority Critical patent/CN110782901B/en
Publication of CN110782901A publication Critical patent/CN110782901A/en
Application granted granted Critical
Publication of CN110782901B publication Critical patent/CN110782901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification

Abstract

The invention provides a method, a storage medium and a device for identifying voice of a network telephone, wherein the method comprises the following steps: respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal; taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure; respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure; and inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result. The invention not only can effectively identify the network telephone voice of the fixed source fixed terminal, but also can quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.

Description

Method, storage medium and device for identifying voice of network telephone
Technical Field
The present invention relates to the field of voice over internet protocol recognition, and in particular, to a method, a storage medium, and an apparatus for recognizing voice over internet protocol.
Background
With the development of the internet and the increasing maturity of audio compression technology, the communication modes of people become diversified. The advent of VoIP voice over ip technology has enabled people to communicate more conveniently and at a lower cost. Therefore, internet phone technology attracts a large number of users, and internet phones are gradually replacing traditional fixed phones and mobile phones, and become one of the main communication ways of people. Compared with the traditional fixed telephone and mobile telephone, the network telephone user does not have a fixed telephone number and does not need to use a telephone card for dialing. The user can communicate with the other party only by dialing the telephone number of the other party through specific network telephone software. As such, some internet phone software allows users to set their phone numbers at will, and the modified phone numbers do not need to be verified. Therefore, lawless persons can utilize the loophole to disguise the identity of the lawless persons by setting the own telephone number to a specific telephone number (such as a telephone number of a public security bureau, a telephone number of a bank or a telephone number of a government department) through some internet phone number changing software, thereby implementing fraud. Thus, identifying whether a telephone call is an internet telephone can help the called party to identify the caller, thereby enabling the called party to avoid, to some extent, fraud that may be encountered.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a method, a storage medium and a device for recognizing voice of internet phone, which are intended to solve the problem that the prior art cannot efficiently recognize voice of internet phone with unknown source.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for recognizing voice of a network phone, comprising the steps of:
decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;
respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal;
taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure;
respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure;
and inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result.
The method for recognizing the voice of the network telephone is characterized in that the step of decompressing, resampling and carrying out high-pass filtering processing on the received voice signal to obtain a filtered voice signal comprises the following steps:
decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals;
and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.
The method for recognizing the voice of the network telephone is characterized in that the step of respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal comprises the following steps:
performing framing processing on the filtered voice signal to obtain an nth sample of an ith frame: y isi[n]=w[n]*x[(i-1)*S+n]Wherein, in the step (A),
Figure BDA0002261069560000031
converting the filtered voice signal subjected to framing into a standardized Mel scale spectrogram:
Figure BDA0002261069560000032
wherein the content of the first and second substances,
Figure BDA0002261069560000033
Figure BDA0002261069560000034
Figure BDA0002261069560000035
Figure BDA0002261069560000036
Figure BDA0002261069560000037
is the Mel scale frequency, f is the Hertz scale frequency, KbAs a boundary point, f L0 is the lowest frequency, fHF s2 is the highest frequency, FsFor the sampling frequency,. DELTA.F for the frequency resolution, FkFor the frequency of the kth discrete Fourier transform data point, Km(M ∈ {1, 2, …, M }) is the mth Mel-band filter;
converting the filtered voice signal subjected to framing into a standardized anti-Mel scale spectrogram:
Figure BDA0002261069560000041
wherein the content of the first and second substances,
Figure BDA0002261069560000042
Figure BDA0002261069560000043
Figure BDA0002261069560000044
Figure BDA0002261069560000045
Figure BDA0002261069560000046
is reverse Mel scale;
converting the filtered voice signal subjected to framing into a pile-up waveform frame signal:
Figure BDA0002261069560000047
the method for recognizing the voice of the network telephone, wherein the step of extracting the time domain information of the voice signal through a first convolutional neural network structure by taking the heap waveform frame signal as a network input, comprises the following steps:
the method comprises the steps that a first convolution neural network structure is constructed in advance, the first convolution neural network structure comprises 6 groups of convolution modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolution modules from the input end comprise a convolution layer, a maximum pooling layer, a linear rectification function and two batch normalization layers, and the last group of convolution modules comprise a convolution layer, a batch normalization layer, a linear rectification function and a global average pooling layer;
performing network parameter training on the first convolutional neural network structure through a main classifier connected to the output end of the first convolutional neural network structure and an auxiliary classifier connected to the 4 th group of convolutional modules, wherein the total cost functions of the two classifiers are: LossA ═ α loss0+β*loss1Wherein, loss0And Loss1Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein alpha and beta are weights and satisfy that alpha + beta is 1;
and after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into the first convolutional neural network structure, and extracting time domain information of the voice signal.
The method for recognizing the voice of the network telephone is characterized in that the step of respectively taking the Mel scale spectrogram and the inverse Mel scale spectrogram as network inputs and extracting the frequency domain information of the voice signal through a second convolutional neural network structure comprises the following steps of:
a second convolutional neural network structure is constructed in advance, the second convolutional neural network structure comprises 6 groups of convolutional modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolutional modules from the input end comprise a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function and two batch normalization layers, and the last group of convolutional modules comprise a convolutional layer, a batch normalization layer, a linear rectification function and a global average pooling layer;
performing network parameter training on the second convolutional neural network structure through a main classifier connected to the output end of the second convolutional neural network structure and an auxiliary classifier connected to the 3 rd group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossB ═ γ loss0+δ*loss1Wherein, loss0And loss1Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1;
and after the second convolutional neural network structure is trained, removing the main classifier and the auxiliary classifier, respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through the second convolutional neural network structure.
The method for recognizing voice over internet phone, wherein the main classifier comprises a full connection layer and a softmax function.
The method for recognizing the voice of the network telephone is characterized in that the classification module is a fully-connected neural network classifier, and the fully-connected neural network classifier comprises a first fully-connected layer, a second fully-connected layer and a softmax function.
The method for recognizing the voice of the network telephone is characterized in that the time domain information and the frequency domain information of the voice signal are input into a trained classification module, and the step of outputting the classification result comprises the following steps:
inputting the time domain information and the frequency domain information of the voice signal into the fully-connected neural network classifier, and respectively setting the node numbers of the first layer of fully-connected layer and the second layer of fully-connected layer to 768 and 2;
if the full-connection neural network classifier outputs [0,1], judging that the voice signal is the voice of the network telephone;
and if the full-connection neural network classifier outputs [1,0], judging that the voice signal is the voice of the mobile phone.
A storage medium comprising a plurality of instructions stored thereon, the instructions adapted to be loaded by a processor and to carry out in particular the steps of the method of recognizing voice over internet phone according to the present invention.
An apparatus for recognizing voice of a network telephone, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the method of recognizing voice over internet phone of the present invention. .
Has the advantages that: the invention provides a method for recognizing network telephone voice, which adopts a trained first convolution neural network structure and a trained second convolution neural network structure to extract multi-domain depth characteristics of voice information and utilizes the extracted characteristics to recognize the network telephone voice. Compared with the prior art, the method can effectively identify the network telephone voice of the fixed source fixed terminal, and can also quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.
Drawings
Fig. 1 is a flowchart of a method for recognizing voice of a network phone according to an embodiment of the present invention.
FIG. 2a is a spectrogram of a mobile phone call recording at a sampling rate of 48 kHz.
FIG. 2b is a spectrogram of the voice recording of the Internet phone call at a sampling rate of 16 kHz.
FIG. 2c is a diagram of a voice spectrum of a mobile phone call recording at a resampling rate of 8 kHz.
FIG. 2d is a diagram of a voice spectrogram of an Internet phone call recording at a resampling rate of 8 kHz.
Fig. 3a is a histogram of the zero-crossing rate of high-pass filtered back-and-forth telephone call recordings.
Fig. 3b is a histogram of the zero-crossing rate of the internet phone call recordings before and after the high-pass filtering.
FIG. 4 is a flow chart of framing and format conversion representation of a speech signal.
Fig. 5 is a regularized logarithmic spectrum, a regularized logarithmic mel scale spectrum, and a regularized logarithmic inverse mel scale spectrum of the call recording of the mobile phone and the internet phone after the high-pass filtering.
Fig. 6a is a schematic structural diagram of a first convolutional neural network structure.
Fig. 6b is a structural diagram of a second convolutional neural network structure.
FIG. 7 is a diagram of a fully-connected convolutional neural network classifier designed according to the present invention.
Fig. 8 is a block diagram of an apparatus for recognizing voice of a network phone according to the present invention.
Fig. 9 is a schematic diagram of the detection accuracy of the speech segments with different lengths in scene 0 and scene 5.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The existing document proposes a network telephone fingerprint to identify the network telephone, the designed network telephone fingerprint comprises two parts, the first part is composed of the packet loss rate of the voice frame, the correlation between packets, the statistical characteristic of noise spectrum, the voice quality of the call and other characteristics; the second part is a path traversal signature formed by inputting the features extracted by the first part into a decision tree. Since the method relies on the packet loss characteristic, packet loss may become a less effective characteristic as the network quality improves. In addition, this method assumes a fixed and unique identification fingerprint for each call source, and therefore this method can only be applied to detect a network telephone that is dialed from one fixed source to another designated terminal (e.g., a fixed telephone to a fixed telephone), and cannot identify various network telephones from unknown sources.
The prior document also provides a method for detecting voice of a network telephone by identifying a coder-decoder, which comprises the steps of firstly recoding voice by using a known coder-decoder to form a multi-dimensional characteristic model containing a noise spectrum and a voice signal time domain histogram, then comparing the obtained characteristic model with a reference characteristic model of a candidate coder-decoder to identify which coder-decoder a section of voice is processed, and judging whether the voice to be detected passes through a VoIP network or not according to the type of the coder-decoder. However, since audio data is transmitted in different telephone networks and various devices, speech signal encoded streams are transcoded while passing through different types of communication networks. The last codec used in the PSTN public telephone network may mask most of the footprint of the codec used in the VoIP network. In addition, both VoIP and PSTN networks may use the same codec, such as g.711. Therefore, a pure recognition codec is not sufficient to recognize voice of a network phone.
Based on the problems in the prior art, the present embodiment provides a method for recognizing voice of a network phone, as shown in fig. 1, which includes the steps of:
decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;
respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal;
taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure;
respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure;
and inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result.
In the embodiment, the trained first convolutional neural network structure and the trained second convolutional neural network structure are adopted to extract the multi-domain depth features of the voice information, and the extracted features are utilized to identify the voice of the network telephone. Compared with the prior art, the method can effectively identify the network telephone voice of the fixed source fixed terminal, and can also quickly and efficiently identify the network telephone voice generated by the unknown source and the unknown terminal.
In some embodiments, the decompressing, resampling and high-pass filtering the received speech signal to obtain a filtered speech signal includes: decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals; and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.
Specifically, when a handset is used to record a received speech signal, different handsets will default to using different sampling rates to sample the speech signal, for example, the handset will use sampling rates of 16kHz, 44.1kHz, 48kHz, etc., which results in the captured speech including many unwanted high frequency signals. Because the bandwidth of the VoIP and PSTN networks is limited, the embodiment selects to resample the decompressed voice signal into a waveform signal with 8kHz and 16 bits, so that most voice content can be reserved, and the problem of insufficient bandwidth can be solved. Fig. 2 a-2 d show the spectrogram of a 48kHz handset call record signal, the spectrogram of a 16kHz network telephone call record signal, and the spectrogram after resampling them with a sampling rate of 8kHz, from which it can be seen that the high-frequency parts of the 48kHz and 16kHz voice signals contain little information, while the resampled signals still retain most of the information in the original voice.
The embodiment also performs high-pass filtering on the resampled voice signal, and because the human ear is not sensitive to the high-frequency part of the voice signal, the voice of the network telephone and the voice of the mobile phone call are difficult to distinguish through the human ear. The present embodiment thus uses a high-pass filter to suppress the low-frequency part of the speech signal and amplify the high-frequency information in the speech. Fig. 3 a-3 b show the zero-crossing rate of 12000 segments of a 25 ms cell phone call and network telephone recording before and after passing through a high pass filter. It can be seen from fig. 3 a-3 b that the difference in high frequency content between the handset call recording and the voice over internet protocol recording is amplified after the high pass filtering period. For a signal s [ n ], the embodiment uses a second-order differential filter to high-pass filter, and the high-pass filtered signal x [ n ] can be obtained by the following formula: x [ n ] — 0.5 x s [ n-1] + s [ n ] -0.5 x s [ n +1], where n represents a sample point in the time domain signal.
In some embodiments, in order to facilitate subsequent deep learning to extract features, in this embodiment, the speech signal is subjected to decompression, resampling and high-pass filtering, and then is subjected to framing processing, and is converted into three forms, namely a mel-spectrum, an inverse mel-spectrum and a heap-type waveform frame, as shown in fig. 4.
Specifically, in this embodiment, the frame length of the audio frame is represented by N, and the frame shift is represented by S, then the nth sample of the ith frame can be obtained by the following formula: y isi[n]=w[n]*x[(i-1)*S+n]Wherein w [ n ]]For the Hamming window function, the following formula is defined
Figure BDA0002261069560000111
And then calculating a Mel scale spectrogram, an inverse Mel scale spectrogram and a stacked waveform frame signal of the voice after framing.
This example uses
Figure BDA0002261069560000112
Indicating a transition from the Hertz scale to the Merr scale, using
Figure BDA0002261069560000113
Representing the inverse transformation, the mel scale frequency is calculated by the following formula compared to the hertz scale frequency f:
Figure BDA0002261069560000114
the reverse Mel scale language spectrum can be calculated by the following formulaCalculating:
Figure BDA0002261069560000115
wherein f isL,fHAnd F andsrespectively, the lowest frequency, the highest frequency, and the sampling frequency, and in the present embodiment, the sampling frequency is set to 8000Hz, the lowest frequency is set to 0, and the highest frequency is set to half the sampling frequency, i.e., 4000 Hz. N is the number of discrete Fourier transform points, and the value of N is the same as the number of frame length points. For the ith frame speech signal, we first find its discrete fourier transform:
Figure BDA0002261069560000121
wherein the frequency f of the kth discrete Fourier transform data pointkK Δ F, where Δ F is the frequency resolution,
Figure BDA0002261069560000122
the spectral energy of each frame is calculated by the following formula: ei[k]=|Yi[k]|2Since the spectrum is symmetrical, the present embodiment selects only half of the spectrum, i.e. the spectrum is symmetrical
Figure BDA0002261069560000123
Next, the present embodiment filters the energy spectrum by using M mel triangular filters, where the mth triangular filter in the filter bank is defined as:
Figure BDA0002261069560000124
wherein, KbIn order to be the boundary point,
Figure BDA0002261069560000125
f L0 is the lowest frequency, fH=FsThe maximum frequency is/2, in this example, Km(M ∈ {1, 2, …, M }) is the center frequency of the mth mel-band filter, and the present embodiment calculates the mel-scale spectrogram by the following formula,
Figure BDA0002261069560000126
and finally, carrying out standardization treatment on the Mel scale spectrogram to obtain a final standardized Mel scale spectrogram:
Figure BDA0002261069560000127
wherein the content of the first and second substances,
Figure BDA0002261069560000128
in some embodiments, rather than calculating a mel scale spectrogram, the present implementation filters the energy spectrum using an inverse mel-triangle filter bank in a spherical inverse mel scale spectrogram, wherein the mth inverse mel filter is defined as:
Figure BDA0002261069560000131
Figure BDA0002261069560000132
calculating an inverse mel scale spectrum by the following formula:
Figure BDA0002261069560000133
and finally, carrying out standardization processing on the anti-Mel scale spectrogram to obtain a final standardized anti-Mel scale spectrogram:
Figure BDA0002261069560000134
wherein the content of the first and second substances,
Figure BDA0002261069560000135
fig. 5 shows normalized spectrogram of different scales of the high-pass filtered signal, and it can be seen from the graph that the energy of the high-frequency part of the network telephone recording is weaker than that of the mobile phone call recording. Compared with the spectrogram with Hertz scales, the spectrogram with the Mel scales and the reverse Mel scales can enlarge the difference between the mobile phone call recording and the internet phone call recording and reduce the difference between different internet phone calls recording.
In some embodiments, the stacked waveform frame retains phase information of the speech signal compared to the starting mel scale spectrogram and the inverse mel scale spectrogram, so that information that cannot be captured in the speech spectrogram can be captured by using the stacked waveform frame as an input of the depth network
Figure BDA0002261069560000136
Dimensional stacked waveform frame matrix
Figure BDA0002261069560000141
In some embodiments, two different convolutional neural network structures are designed, frequency domain information and time domain information of the speech signal are respectively extracted through a first convolutional neural network structure and a second convolutional neural network structure, and finally the extracted depth features are a vector with 2304 dimensions.
In some embodiments, the step of extracting time domain information of the speech signal by the first convolutional neural network structure with the stacked waveform frame signal as a network input comprises: as shown in fig. 6a, a first convolutional neural network structure is pre-constructed, the first convolutional neural network structure includes 6 groups of convolutional modules connected in series in sequence from an input end to an output end, the first 5 groups of convolutional modules from the input end each include a convolutional layer, a max pooling layer, a linear rectification function (ReLU), and two batch normalization layers (BN), and the last group of convolutional modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and a global average pooling layer; in the embodiment, the first 3 groups of convolution modules are designed to be one-dimensional convolution for extracting the intra-frame features of each frame of the stacked waveform frame, the 4 th convolution module is designed to be two-dimensional convolution for extracting the intra-frame features and fusing the inter-frame extracted information, and the last two convolution modules are also designed to be one-dimensional convolution for fusing the inter-frame features.
In this embodiment, a main classifier composed of a full connection layer and a softmax function is connected to the output end of the first convolutional neural network, an auxiliary classifier is connected to the output end of the 4 th convolutional module, and network parameter training is performed on the first convolutional neural network structure through the main classifier and the auxiliary classifier, where the total cost function of the two classifiers is: LossA ═ α loss0+β*loss1Wherein, loss0And loss1Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein alpha and beta are weights and satisfy that alpha + beta is 1; and after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into a first convolutional neural network structure (SWF-net), and extracting time domain information of the voice signal.
In some embodiments, the step of extracting frequency domain information of the speech signal by a second convolutional neural network structure with the mel scale spectrum and the inverse mel scale spectrum as network inputs respectively comprises: as shown in fig. 6b, a second convolutional neural network structure is pre-constructed, where the second convolutional neural network structure includes 6 groups of convolutional modules connected in series in sequence from an input end to an output end, the first 5 groups of convolutional modules from the input end each include a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function, and two batch normalization layers, and the last group of convolutional modules includes a convolutional layer, a batch normalization layer, a linear rectification function, and a global average pooling layer. In this embodiment, the second convolutional neural network structure has two differences from the first convolutional neural network structure, the first difference is that the dimensionality of the kernel function in the convolutional module is different, and since two adjacent data points have a certain association for the spectrogram, this embodiment directly uses two data points in the first 5 convolutional modules in the second convolutional neural network structureExtracting features by using a dimensional convolution kernel, and designing a specific pooling size and a convolution step length to enable the second convolution neural network structure to have the same output dimension as the first convolution neural network structure; the second point is different from the location of the auxiliary classifier, which in this embodiment is connected at the output of the 3 rd convolution module. Performing network parameter training on the second convolutional neural network structure through the main classifier and the auxiliary classifier, wherein the total cost function of the two classifiers is as follows: LossB ═ γ loss0+δ*loss1Wherein, loss0And loss1Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1; after the structure training of the second convolutional neural network is finished, removing the main classifier and the auxiliary classifier to obtain corresponding convolutional neural network model names of MS-net and IMS-net; and respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure.
In some embodiments, after the depth features are extracted, the present embodiment uses a classification module to fuse the extracted features and make a final decision. Many classifiers can be used as the classification module in this stage, and the embodiment uses a fully connected neural network classifier as the classification module. As shown in fig. 7, the fully-connected neural network classifier consists of two fully-connected layers and one softmax function. Wherein the number of nodes of the first layer fully connected layer and the second layer fully connected layer is 768 and 2. Inputting the extracted 2304-dimensional feature vector into the trained fully-connected neural network classifier, and outputting [0,1] or [1,0] by the fully-connected neural network classifier; in this embodiment, the output [0,1] indicates that the classifier determines the voice signal to be detected as the VoIP internet phone call record, and the output [1,0] indicates that the classifier determines the voice signal to be detected as the mobile phone call record.
In some embodiments, a storage medium is also provided, in which a plurality of instructions are stored, the instructions being adapted to be loaded by a processor and to specifically perform the steps of the method for recognizing voice over internet phone according to the present invention.
In some embodiments, there is also provided an apparatus for recognizing a network voice signal, wherein, as shown in fig. 8, the apparatus comprises a processor 10 adapted to implement instructions; and a storage medium 20 adapted to store a plurality of instructions adapted to be loaded by the processor 10 and to carry out the steps of the method of recognizing voice over internet phone.
The recognition performance of the network voice signal of the invention is tested by the existing database VPCID as follows:
the present embodiment sets the frame length N to 240, the frame shift S to 120, and the number M of triangular filters to 48. The frame number L is set to 132. A mel scale spectrum of 48 × 132, an inverse mel scale spectrum of 48 × 132, and a stacked waveform frame of 240 × 132 are obtained. In the evaluation index, the present embodiment selects a true rate, a true negative rate, and an accuracy rate, where the voice of the internet phone is a positive sample, and the voice of the mobile phone call is a negative sample.
1. Verifying the effect brought by the auxiliary classifier:
table 1 shows the detection performance for a 2-second long speech segment when classifiers with different weights are used in SFW-net, and table 2 shows the detection performance for a 2-second long speech segment when classifiers with different weights are used in MS-net and IMS-net.
TABLE 1 detection Performance of classifiers in SFW-net using different weights
Figure BDA0002261069560000171
TABLE 2 detection Performance of classifiers in MS-net and IMS-net using different weights
Figure BDA0002261069560000172
As can be seen from tables 1-2, the detection performance using the auxiliary classifier is better than that without the auxiliary classifier, and there is a possibility of optimizing the weight of the auxiliary classifier.
2. Overall effect of using different classification modules:
the classification module of the invention can be composed of different classifiers and even some decision-making methods, and the use of different classifiers or decision-making methods can achieve good performance. Table 3 shows the detection performance for a 2-second long speech segment using a general background model-gaussian mixture model (UBM-GMM), a Ensemble classifier (FLD Ensemble) classifier as a classification module, a Majority Voting mechanism (Majority Voting) as a classification module, and a fully-connected neural network classifier (NN-based classifier) as a classification module.
TABLE 3 detection Performance Using different classifiers
Figure BDA0002261069560000181
It can be seen from the table that very good detection performance can be achieved using different classification methods. The performance of the fully-connected neural network classifier designed by the method is better than that of the existing classifier and decision method.
3. Detecting the effect of different source voice signals:
since the invention starts with actual call recording, the algorithm design process does not depend on data from a certain source. Therefore, the invention can effectively detect the call voices from different sources. Therefore, the present invention sets up different scenarios, and table 4 shows the degree of source mismatch in the training data and the test data in each scenario. Wherein the flagged factors indicate that the training data and the test data are of different origins in the factors. Table 5 shows the detection performance of the present invention for a 2 second long speech segment in different scenarios. As can be seen from table 5, the method proposed by the present invention has good detection performance for voices from many different sources.
TABLE 4 mismatch factors in various detection scenarios
Figure BDA0002261069560000191
TABLE 5 detection Performance of the present invention under various mismatch factor scenarios
Figure BDA0002261069560000192
4. Detecting the effect of voice segments with different lengths:
the algorithm designed by the invention extracts the intraframe characteristics of the short-time audio frame and performs the fusion of the interframe characteristics in the characteristic extraction process. The invention can therefore be used to detect speech segments of different lengths. Fig. 9 shows the detection accuracy of the present invention for 1 second long, 2 second long, 4 second long, 6 second long, and 8 second long speech segments in scene 0 and scene 5. It can be seen from the figure that the detection performance of the present invention reaches a substantial saturation for the voice segment of more than 6 seconds. And the voice segment with the length shorter than 6 seconds can also have good detection accuracy.
In summary, based on the statistical characteristics of the voice signals, the invention finds that the energy of the voice signals of the network telephone distributed in the high-frequency part is weaker than the energy of the voice signals of the mobile telephone distributed in the high-frequency part, so that on the problem of detecting the voice signals of the network telephone, the invention firstly carries out high-pass filtering processing on the voice signals; the invention analyzes the spectrogram with different scales, finds that compared with a linear scale spectrogram, the nonlinear scale spectrogram can amplify the difference between different types of call recording signals and reduce the difference between the same type of voice signals, so that the invention uses a Mel scale spectrogram and an inverse Mel scale spectrogram as network input in algorithm design, compared with the prior art, the invention abandons a method for manually designing characteristics and uses deep learning to automatically extract the characteristics of the network input, and the first convolution neural network structure designed by the invention extracts the in-frame characteristics firstly and then integrates the inter-frame characteristics to extract the deep characteristics; the method has the advantages that the designed second convolutional neural network structure directly extracts the characteristics of the frequency domain in a mode of extracting adjacent data points of the spectrogram, information extracted by the three trained subnetworks can be effectively fused together, and by combining the technologies, the method can effectively detect the network telephone call records from the same source and can robustly detect the network telephone call records from different sources.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (7)

1. A method for recognizing voice of a network telephone, comprising the steps of:
decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal;
respectively converting the filtered voice signals into a standardized Mel scale spectrogram, a standardized inverse Mel scale spectrogram and a stacked waveform frame signal;
taking the pile-up waveform frame signal as network input, and extracting time domain information of a voice signal through a first convolutional neural network structure;
respectively taking the standardized Mel scale spectrogram and the standardized inverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through a second convolutional neural network structure;
inputting the time domain information and the frequency domain information of the voice signal into a trained classification module, and outputting a classification result;
the step of converting the filtered speech signal into a normalized mel scale spectrogram, a normalized anti-mel scale spectrogram and a stacked waveform frame signal respectively comprises:
performing framing processing on the filtered voice signal to obtain an nth sample of an ith frame: y isi[n]=w[n]*x[(i-1)*S+n]Wherein, in the step (A),
Figure FDA0003338022220000011
converting the filtered speech signal into a standardized Mel-time scaleDegree language spectrogram:
Figure FDA0003338022220000012
wherein the content of the first and second substances,
Figure FDA0003338022220000013
Figure FDA0003338022220000014
Figure FDA0003338022220000021
Figure FDA0003338022220000022
Figure FDA0003338022220000023
Figure FDA0003338022220000024
is the Mel scale frequency, f is the Hertz scale frequency, KbAs a boundary point, fL0 is the lowest frequency, fH=Fs2 is the highest frequency, FsFor the sampling frequency,. DELTA.F for the frequency resolution, FkFor the frequency of the kth discrete Fourier transform data point, Km(M ∈ {1, 2, …, M }) is the mth Mel-band filter;
converting the filtered voice signal subjected to framing into a standardized anti-Mel scale spectrogram:
Figure FDA0003338022220000025
wherein the content of the first and second substances,
Figure FDA0003338022220000026
Figure FDA0003338022220000027
Figure FDA0003338022220000028
Figure FDA0003338022220000029
Figure FDA00033380222200000210
Figure FDA00033380222200000211
is reverse Mel scale;
converting the filtered voice signal subjected to framing into a pile-up waveform frame signal:
Figure FDA0003338022220000031
the step of extracting the time domain information of the speech signal by using the heap waveform frame signal as a network input through a first convolutional neural network structure comprises:
the method comprises the steps that a first convolution neural network structure is constructed in advance, the first convolution neural network structure comprises 6 groups of convolution modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolution modules from the input end comprise a convolution layer, a maximum pooling layer, a linear rectification function and two batch normalization layers, and the last group of convolution modules comprise a convolution layer, a batch normalization layer, a linear rectification function and a global average pooling layer;
performing network parameter training on the first convolutional neural network structure through a main classifier connected to the output end of the first convolutional neural network structure and an auxiliary classifier connected to a fourth group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossA ═ α loss0+βloss1Wherein, loss0And loss1Representing primary and secondary classifiers respectivelyThe cross entropy loss function of the classifier, wherein alpha and beta are weights, and alpha + beta is 1;
after the training of the first convolutional neural network structure is finished, removing the main classifier and the auxiliary classifier, inputting the stacked waveform frame signal into the first convolutional neural network structure, and extracting time domain information of a voice signal;
the step of extracting the frequency domain information of the voice signal through a second convolutional neural network structure by respectively taking the normalized Mel scale spectrogram and the normalized inverse Mel scale spectrogram as network inputs comprises:
a second convolutional neural network structure is constructed in advance, the second convolutional neural network structure comprises 6 groups of convolutional modules which are sequentially connected in series from an input end to an output end, the first 5 groups of convolutional modules from the input end comprise a convolutional layer, a maximum pooling layer, a two-dimensional convolutional kernel function and two batch normalization layers, and the last group of convolutional modules comprise a convolutional layer, a batch normalization layer, a linear rectification function and a global average pooling layer;
performing network parameter training on the second convolutional neural network structure through a main classifier connected to the output end of the second convolutional neural network structure and an auxiliary classifier connected to a third group of convolutional modules, wherein the total cost functions of the two classifiers are as follows: LossB ═ γ loss0+δ*loss1Wherein, loss0And loss1Respectively representing cross entropy loss functions of a main classifier and an auxiliary classifier, wherein gamma and delta are weights and satisfy that gamma + delta is 1;
and after the second convolutional neural network structure is trained, removing the main classifier and the auxiliary classifier, respectively taking the standardized Mel scale spectrogram and the standardized reverse Mel scale spectrogram as network inputs, and extracting frequency domain information of the voice signal through the second convolutional neural network structure.
2. The method of claim 1, wherein the step of decompressing, resampling and high-pass filtering the received voice signal to obtain a filtered voice signal comprises:
decompressing the received voice signal, and resampling to 8kHz and 16 bit waveform signals;
and carrying out high-pass filtering on the resampled waveform signal by adopting a second-order differential filter to obtain a filtered voice signal: x [ n ] — 0.5 × s [ n-1] + s [ n ] -0.5 × s [ n +1], where n represents a sample point in the time domain signal.
3. The method of claim 1, wherein the primary classifier comprises a full connectivity layer and a softmax function.
4. The method for recognizing voice over internet phone of claim 1, wherein the classification module is a fully-connected neural network classifier comprising a first fully-connected layer, a second fully-connected layer and a softmax function.
5. The method of claim 4, wherein the time domain information and the frequency domain information of the voice signal are input into a trained classification module, and the step of outputting the classification result comprises:
inputting the time domain information and the frequency domain information of the voice signal into the fully-connected neural network classifier, and respectively setting the node numbers of the first layer of fully-connected layer and the second layer of fully-connected layer to 768 and 2;
if the full-connection neural network classifier outputs [0,1], judging that the voice signal is the voice of the network telephone;
and if the full-connection neural network classifier outputs [1,0], judging that the voice signal is the voice of the mobile phone.
6. A storage medium comprising a plurality of instructions stored thereon, the instructions adapted to be loaded by a processor and to carry out the steps of the method of recognizing voice over internet phone of any of claims 1-5.
7. An apparatus for recognizing voice of a network telephone, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the method of recognizing voice over internet phone of any of claims 1-5.
CN201911071415.7A 2019-11-05 2019-11-05 Method, storage medium and device for identifying voice of network telephone Active CN110782901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911071415.7A CN110782901B (en) 2019-11-05 2019-11-05 Method, storage medium and device for identifying voice of network telephone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911071415.7A CN110782901B (en) 2019-11-05 2019-11-05 Method, storage medium and device for identifying voice of network telephone

Publications (2)

Publication Number Publication Date
CN110782901A CN110782901A (en) 2020-02-11
CN110782901B true CN110782901B (en) 2021-12-24

Family

ID=69389087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911071415.7A Active CN110782901B (en) 2019-11-05 2019-11-05 Method, storage medium and device for identifying voice of network telephone

Country Status (1)

Country Link
CN (1) CN110782901B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112566170B (en) * 2020-11-25 2022-07-29 中移(杭州)信息技术有限公司 Network quality evaluation method, device, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504768A (en) * 2016-10-21 2017-03-15 百度在线网络技术(北京)有限公司 Phone testing audio frequency classification method and device based on artificial intelligence
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning
CN109102005A (en) * 2018-07-23 2018-12-28 杭州电子科技大学 Small sample deep learning method based on shallow Model knowledge migration
CN109493874A (en) * 2018-11-23 2019-03-19 东北农业大学 A kind of live pig cough sound recognition methods based on convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978374B2 (en) * 2015-09-04 2018-05-22 Google Llc Neural networks for speaker verification

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504768A (en) * 2016-10-21 2017-03-15 百度在线网络技术(北京)有限公司 Phone testing audio frequency classification method and device based on artificial intelligence
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning
CN109102005A (en) * 2018-07-23 2018-12-28 杭州电子科技大学 Small sample deep learning method based on shallow Model knowledge migration
CN109493874A (en) * 2018-11-23 2019-03-19 东北农业大学 A kind of live pig cough sound recognition methods based on convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CNN architectures for large-scale audio classification;Shawn Hershey etc;《2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20170110;第1-5页 *
基于卷积神经网络和随机森林的音频分类方法;付炜 等;《计算机应用》;20181225;第58-62页 *

Also Published As

Publication number Publication date
CN110782901A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
JP6535706B2 (en) Method for creating a ternary bitmap of a data set
US11488605B2 (en) Method and apparatus for detecting spoofing conditions
CN109326299B (en) Speech enhancement method, device and storage medium based on full convolution neural network
CN107331385A (en) A kind of identification of harassing call and hold-up interception method
CN110111814B (en) Network type identification method and device
CN111508524B (en) Method and system for identifying voice source equipment
CN113794805A (en) Detection method and detection system for GOIP fraud telephone
CN106936997A (en) It is a kind of based on the rubbish voice recognition methods of social networks collection of illustrative plates and system
Ayoub et al. Gammatone frequency cepstral coefficients for speaker identification over VoIP networks
CN110782901B (en) Method, storage medium and device for identifying voice of network telephone
Beritelli et al. A pattern recognition system for environmental sound classification based on MFCCs and neural networks
CN110556114B (en) Speaker identification method and device based on attention mechanism
CN111951809A (en) Multi-person voiceprint identification method and system
CN113744742B (en) Role identification method, device and system under dialogue scene
CN114745720A (en) Voice-variant fraud telephone detection method and device
CN112151070B (en) Voice detection method and device and electronic equipment
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
CN114842851A (en) Voiceprint recognition method, system, equipment and storage medium
CN117457008A (en) Multi-person voiceprint recognition method and device based on telephone channel
Ayoub et al. Investigation of the relation between amount of VoIP speech data and performance in speaker identification task over VoIP networks
NISSY et al. Telephone Voice Speaker Recognition Using Mel Frequency Cepstral Coefficients with Cascaded Feed Forward Neural Network
CN115662475A (en) Audio data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant