CN112738338A

CN112738338A - Telephone recognition method, device, equipment and medium based on deep learning

Info

Publication number: CN112738338A
Application number: CN202011564958.5A
Authority: CN
Inventors: 凌波; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-30
Anticipated expiration: 2040-12-25
Also published as: CN112738338B

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a telephone identification method based on deep learning, which comprises the following steps: collecting a call voice signal of a client; extracting the characteristics of the call voice signal; inputting the characteristics of the call voice signals into a voice classification model, and obtaining the classification of the call voice signals, wherein the classification comprises normal calls, harassing calls and fraudulent calls. The invention also relates to a telephone recognition device, an electronic device and a medium based on deep learning. The invention improves the accuracy of capturing effective characteristics and improves the recognition rate of crank calls and fraud calls.

Description

Telephone recognition method, device, equipment and medium based on deep learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a telephone identification method and device based on deep learning, electronic equipment and a computer readable storage medium.

Background

With the rapid development of mobile communication services in the world, users enjoy the convenient services of the mobile network and simultaneously have problems, including that a large part of users often suffer from harassment and fraud of bad, illegal calls or information, and the problems become important problems in the industry.

For the harassment prevention and fraud prevention of the telephone of the mobile client, the manual labeling method is adopted in the industry at present: when the receiving personnel marks that the dial-in telephone is a harassing telephone, the dial-in number is taken as a possibly risky number to be put into a cloud database, and when the number is marked for more than a certain number of times, the number is taken as a blacklist number to be processed. In recent years, there has been a case where suspicious numbers are identified by using characteristics between data by combining a processing means such as a bayesian algorithm with a data mining technique.

At present, the identification and interception of harassment and fraud calls are still in the research stage, but the analysis of the existing identification and interception technology still has some defects and problems: 1) a large amount of manual marking is needed for marking numbers included in a database to carry out the work, 2) the recognition rate of the traditional anti-harassment technology to harassing calls is low, and effective recognition cannot be carried out on voice calls with possible fraud suspicions, and 3) the functions of voice data without considering a large data set on harassment and anti-fraud recognition are realized.

Disclosure of Invention

The invention provides a telephone recognition method, a device, electronic equipment and a computer readable storage medium based on deep learning, and mainly aims to improve the accuracy of capturing effective features, enable voice recognition to have better accuracy and improve the recognition rate of harassing calls and fraud calls.

In order to achieve the above object, the present invention provides a phone identification method based on deep learning, which includes:

collecting a call voice signal of a client;

extracting the characteristics of the call voice signal;

inputting the characteristics of the call voice signals into a voice classification model to obtain the classification of the call voice signals, wherein the classification comprises normal calls, harassing calls and fraud calls;

wherein the step of extracting the feature of the call voice signal comprises:

extracting PLP features in the call voice signal by using openSMILE;

utilizing script to call and extract a config file corresponding to the PLP characteristics to generate PLP characteristic data corresponding to a call voice signal;

and performing feature re-extraction on the PLP feature data by using a Faster RCNN network to obtain the features of the call voice signal.

Optionally, the step of extracting PLP features in the call voice signal by using openSMILE includes:

after sampling, windowing and discrete Fourier transform, the call voice signal is subjected to the square sum of the real part and the imaginary part of the short-time voice frequency spectrum to obtain a short-time power spectrum,

P(f)＝R_x[X(f)]²+I_m[X(f)]²

wherein X (f) is the short-time frequency spectrum of the call voice signal, f is the frequency axis of the short-time frequency spectrum of the call voice signal, R_x[X(f)]²Is the real part of the short-time spectrum, I, of a speech signal of a call_m[X(f)]²Is the imaginary part of the short-time spectrum of the call speech signal, and p (f) is the short-time power spectrum of the call speech signal;

performing critical frequency band analysis on the short-time power spectrum of the call voice signal to obtain a plurality of critical bandwidth auditory spectrums theta (k) of the call voice signal;

equal loudness pre-emphasis is performed on a plurality of critical bandwidth auditory spectra theta (k) by the following formula,

Γ(k)＝E[f₀(k)]θ(k)，(k＝1，2，...，17)

wherein, gamma (k) is the auditory spectrum after equal loudness pre-emphasis, f₀(k) Denotes the frequency corresponding to the center frequency of the k-th critical bandwidth auditory spectrum, Ef₀(k)]Representing the frequency f₀(k) The corresponding equal loudness curve is obtained by the following formula:

intensity-loudness conversion of multiple critical-bandwidth auditory spectra after equal loudness pre-emphasis by

φ(k)＝Γ(k)^0.33

Wherein phi (k) is a plurality of critical bandwidth auditory spectra after intensity-loudness conversion;

and (3) carrying out inverse Fourier transform on the plurality of threshold bandwidth auditory spectrums theta (k) subjected to intensity-loudness conversion to obtain a call voice signal subjected to inverse Fourier transform, calculating an all-pole model, and solving cepstrum coefficients of the call voice signal to obtain PLP characteristics.

Optionally, the step of performing critical band analysis on the short-time power spectrum of the call voice signal includes:

the short-time power spectrum of the call voice signal is subjected to critical band analysis by the following formula,

wherein Z (f) is Bark domain frequency;

mapping the frequency axis f of the frequency spectrum P (f) to Bark frequency Z to obtain 17 frequency bands, multiplying and summing the energy spectrum of each frequency band by a weighting coefficient to obtain a critical bandwidth auditory spectrum theta (k),

wherein Z is₀(k) Representing the center frequency of the k-th critical bandwidth auditory spectrum, # (#) (Z-Z)₀(k) A weighting coefficient corresponding to each frequency band, and P (f (z)) an energy spectrum corresponding to each frequency band.

Alternatively,

the construction steps of the Faster RCNN network comprise:

constructing a fast RCNN network through a convolutional layer, an RNP network, a comprehensive convolutional layer and a full-link layer;

extracting a feature map of the voice features through the convolutional layer;

generating a candidate region by the RNP network;

judging the type of the anchor frame by utilizing softmax, and obtaining a candidate area by correcting the anchor frame;

obtaining a candidate region through the feature map extracted by the comprehensive convolutional layer and the RNP network, and extracting a plurality of candidate feature maps;

and synthesizing a plurality of candidate feature maps through the full connection layer.

Optionally, the speech classification model is a Transformer network.

Optionally, the step of constructing the Transformer network comprises

Constructing a Transformer network through an encoder and a decoder;

coding the characteristics of the call voice signal extracted by the Faster RCNN network through the coder to obtain a context semantic vector;

and performing data decoding on the obtained context semantic vector through the decoder, and obtaining classification categories through a layer of softmax.

Optionally, the method further comprises: and combining the fast RCNN and the Transformer network into a voice type recognition network, and uploading the voice type recognition network to the cloud.

In order to solve the above problem, the present invention further provides a phone recognition apparatus based on deep learning, the apparatus comprising:

the acquisition module is used for acquiring a call voice signal of the client;

the feature extraction module is used for extracting the features of the call voice signals collected by the collection module;

the classification module is used for constructing a voice classification model, inputting the characteristics of the call voice signals extracted by the characteristic extraction module into the voice classification model and obtaining the classification of the call voice signals, wherein the classification comprises normal calls, harassing calls and fraud calls;

wherein the feature extraction module comprises:

the first feature extraction submodule extracts PLP features in the call voice signal by using openSMILE;

the characteristic data generation submodule is used for calling a config file corresponding to the PLP characteristic extracted by the first characteristic extraction submodule by using a script to generate PLP characteristic data corresponding to the call voice signal;

and the second feature extraction submodule performs feature re-extraction on the PLP feature data generated by the feature data generation submodule by using a fast RCNN network.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the telephone identification method based on deep learning.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the deep learning based phone recognition method described above.

The phone recognition method, the device, the electronic equipment and the computer readable storage medium based on deep learning extract the PLP characteristics to generate PLP data, then utilize the fast RCNN network to carry out secondary characteristic extraction, abandon some useless characteristic data through the secondary characteristic extraction, improve the accuracy of capturing effective characteristics and enable voice recognition to have better accuracy.

Drawings

Fig. 1 is a schematic flowchart of a deep learning-based phone recognition method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for extracting features of a call voice signal according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for extracting PLP features from a call voice signal by using openSMILE according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating a method for performing critical band analysis on a short-time power spectrum of a call voice signal according to an embodiment of the present invention;

FIG. 5 is a block diagram of a deep learning based phone identification apparatus according to an embodiment of the present invention;

fig. 6 is a schematic internal structural diagram of an electronic device implementing a deep learning-based phone recognition method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a telephone identification method based on deep learning. Referring to fig. 1, a flowchart of a deep learning-based phone recognition method according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the phone recognition method based on deep learning includes:

step S100: collecting a call voice signal of a client;

step S200: extracting the feature of the call voice signal, as shown in fig. 2, includes:

step S210, extracting PLP (perceptual Linear prediction coefficient) features in the call voice signal by using openSMILE;

step S220, utilizing script to call and extract a config file corresponding to the PLP characteristics to generate PLP characteristic data corresponding to the call voice signal, and preferably utilizing a CSV file to store data;

and step S230, performing feature re-extraction on the PLP feature data by using the Faster RCNN.

Step S300: inputting the characteristics of the call voice signals into a voice classification model, and obtaining the classification of the call voice signals, wherein the classification comprises normal calls, harassing calls and fraudulent calls.

In one embodiment, as shown in fig. 3, step S210 includes: obtaining a frequency spectrum of a voice signal through Fourier transform, squaring amplitude, performing critical band integration, performing equal loudness pre-emphasis and intensity loudness transformation, and performing inverse Fourier transform and linear prediction to obtain PLP characteristics, specifically comprising:

step S211, performing spectrum analysis on the call voice signal, that is, performing fourier transform on the call voice signal to obtain a spectrum, and then squaring the amplitude, specifically: after sampling, windowing and discrete Fourier transform, the call voice signal is subjected to the square sum of the real part and the imaginary part of the short-time voice frequency spectrum to obtain a short-time power spectrum,

P(f)＝R_x[X(f)]²+I_m[X(f)]² (1)

wherein X (f) is the short-time frequency spectrum of the call voice signal, f is the frequency axis of the short-time frequency spectrum of the call voice signal, R_x[X(f)]²Is the real part of the short-time spectrum, I, of a speech signal of a call_m[X(f)]²Is the imaginary part of the short-time spectrum of the call speech signal, and p (f) is the short-time power spectrum of the call speech signal.

Step S212, performing critical frequency band analysis on the short-time power spectrum of the call voice signal to obtain a plurality of critical bandwidth auditory spectrums theta (k) of the call voice signal, wherein the division of the critical frequency bands reflects the masking effect of human auditory sense,

step S213, equal loudness pre-emphasis is performed on a plurality of critical bandwidth auditory spectra θ (k) by the following formula (5), preferably, equal loudness curve emphasis is performed on θ (k) using a simulated human ear equal loudness curve e (f) (about 40dB),

Γ(k)＝E[f₀(k)]θ(k)，(k＝1，2，...，17) (5)

wherein, gamma (k) is the auditory spectrum after equal loudness pre-emphasis, f₀(k) Denotes the frequency corresponding to the center frequency of the k-th critical bandwidth auditory spectrum, Ef₀(k)]Representing the frequency f₀(k) The corresponding equal loudness curve is obtained by the following formula (6):

step S214, the intensity-loudness conversion is carried out on a plurality of critical bandwidth auditory spectrums theta (k) after equal loudness pre-emphasis through the following formula (7), and the nonlinear relation between the intensity of the analog sound and the loudness felt by human ears is approximate

φ(k)＝Г(k)^0.33 (7)

Where phi (k) is a number of critical bandwidth auditory spectra after intensity-loudness conversion.

Step S215, the plurality of threshold bandwidth auditory spectrums theta (k) after the intensity-loudness conversion are subjected to Fourier inverse transformation, the call voice signals after the Fourier inverse transformation are obtained, an all-pole model is calculated, cepstrum coefficients of the call voice signals are solved, and PLP characteristics are obtained.

In one embodiment, step S212 includes:

critical band analysis of the short-time power spectrum of a speech signal of a call by the following equation (2)

Wherein Z (f) is Bark domain frequency.

Mapping the frequency axis f of the spectrum P (f) to the Bark frequency Z to obtain 17 bands (Bark domain has 24 bands, range from 20 to 15500Hz, 17 bands can be obtained through the previous processing step), multiplying the energy spectrum in each of the 17 bands by the weighting coefficient of formula (3) to obtain the critical bandwidth auditory spectrum theta (k) after summation

In one embodiment, in step S230, the fast RCNN network includes Conv Layers, RegionProposalNetwork, roiploling, and Classification, wherein Conv Layers: extracting feature maps (feature maps) of voice features by using conv, relu and pooling basic cnn network layers; RegionProposalNetwork: the network is used for generating regionproposals, judging the types of anchors by utilizing softmax, and then obtaining accurate regionposals by correcting the anchors; RoiPooling: and collecting the input featurewebps and regionproplases, and synthesizing the two parts of information to extract the provalsfeaturewebps. That is, the step of constructing the Faster RCNN network includes:

extracting a feature map of the voice features through the convolutional layer;

generating a candidate region by the RNP network;

synthesizing a plurality of candidate feature maps through the full connectivity layer

In one embodiment, step S300 includes:

the voice classification model is a Transrormer network, that is, a Transformer network is adopted to classify the call voice data.

The transform network is composed of two parts, namely an encoder and a decoder, wherein the Encoders in the transform network firstly encode the voice data features extracted by the fast RCNN network to obtain a context semantic vector (context), and the context semantic vector obtained by the Encoders in the transform network decodes the data by using the context semantic vector, and finally obtains a classification category through a layer of softmax. That is, the step of constructing the Transformer network comprises

Constructing a Transformer network through an encoder and a decoder;

performing data decoding on the obtained context semantic vector through the decoder, and obtaining classification category through a layer of softmax

In one embodiment, both parts of Decoders have positionencoding added to them at the same time. The Positionalencoding formula is:

pos represents the position of the voice content after the feature of the voice data is extracted by the fast RCNN, i represents the dimension of the voice content, and d_modelPE is positionalencoding, which is the dimension of the model.

In one embodiment, step S300 further comprises:

the speech classification model is trained.

In one embodiment, the deep learning based phone recognition method further comprises:

dialing and information interception of blacklist numbers comprise: if the voice recognition result of the newly dialed number is recognized as abnormal call (harassment, fraud), the user is prompted to carry out blacklist inclusion processing, the number included in the blacklist is intercepted in the subsequent dialing process, the number is also shared in the blacklist of the mobile phone system, and the information transmission of the number is also intercepted.

and feeding back the recognition result of the call voice data to the client.

The telephone recognition method based on deep learning realizes the recognition and interception of harassment prevention and fraud prevention at the telephone terminal based on deep learning, can analyze the dialing voice accessed by users in real life, and can effectively recognize harassment prevention and fraud prevention for numbers with high risk, and the users can directly pull the dialed number into a blacklist by using the system according to the risk prompt without additional processing by the users. In addition, according to the shared blacklist, the junk information of the identified number can be intercepted. Interception processing of dialing and message sending effectively helps a user to filter blacklist numbers, on one hand, the user does not need to answer processed harassment dialing, on the other hand, the frequency of telephone fraud cases is reduced, and the loss of lives and properties of the user is effectively reduced.

In one embodiment, in order to improve recognition efficiency and reduce energy consumption, the fast RCNN and the transform network are combined into a voice category recognition network, the voice category recognition network is uploaded to a cloud, feature analysis recognition is performed on call voice which has been uploaded by a client, if the recognition result is a normal call, the call recognition process is ended, if the recognition result is an abnormal call, the call recognition result is pushed to a user and suggested to be included in a blacklist for processing, and after the user confirms, the confirmation result is returned to the cloud again, specifically, as shown in fig. 4, the phone recognition method based on deep learning includes:

collecting a voice call signal of a client;

carrying out voice audio feature extraction on the voice call signal;

constructing a voice category identification network, and uploading the voice category identification network to a cloud, wherein the voice category identification network comprises a Faster RCNN and a Transformer network;

recognizing voice audio features through a voice category recognition network;

if the voice audio features are normal calls, the call recognition process is ended;

if the voice audio feature is abnormal voice call, judging whether the voice audio feature harasses the call or frauds the call;

if the voice audio features are harassing calls or fraudulent calls, sending an early warning signal to the client;

obtaining a feedback identification result of the client, and if the feedback identification result is a crank call or a fraud call, putting the crank call or the fraud call into a blacklist;

and if the feedback identification result is not a harassing call or a fraudulent call, ending the identification process.

Preferably, the deep learning based phone recognition method further includes:

the method comprises the steps of collecting voice call signals of a client, sending an instruction whether to upload a voice file to a cloud end, uploading the voice file to the cloud end when the uploading instruction of the client is received, and identifying the voice file through a voice category identification network.

Preferably, the step of sending the warning signal to the client includes:

and intercepting the data to the client by dialing or sending information.

In one embodiment, the deep learning based phone recognition method further comprises

A database for storing the call voice signals recognized as abnormal calls by the voice category recognition network and fed back as normal calls by the client, and also storing other call voice signals

According to the phone recognition method based on deep learning, a deep learning technology is adopted, a user feeds back a recognition result to the cloud end, marked voice data required by training of a voice category recognition network can be added, and the voice category recognition network can improve the voice recognition rate due to the increase of the voice data volume and the adjustment of a network model. Therefore, with the progress of the voice recognition process, the improvement of the model recognition rate can be seen through continuously collecting available harassment-type voice calls, fraud-type voice calls and voice calls in normal scenes.

Fig. 5 is a functional block diagram of the telephone recognition device based on deep learning according to the present invention. The phone recognition apparatus 100 based on deep learning according to the present invention can be installed in an electronic device. According to the implemented functions, the deep learning based phone recognition apparatus may include a number acquisition module 110, a feature extraction module 120, and a classification module 130. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the number acquisition module 110 acquires a call voice signal of the client;

the feature extraction module 120 extracts features of the call voice signal acquired by the acquisition module;

the classification module 130 constructs a voice classification model, inputs the features of the call voice signals extracted by the feature extraction module into the voice classification model, and obtains the classification of the call voice signals, wherein the classification comprises normal calls, nuisance calls and fraud calls;

wherein the feature extraction module 120 comprises:

the first feature extraction submodule 121 extracts PLP features in the call voice signal by using openSMILE;

a feature data generation sub-module 122, which uses the script to call the config file corresponding to the PLP feature extracted by the first feature extraction sub-module to generate PLP feature data corresponding to the call voice signal;

the second feature extraction sub-module 123 performs feature re-extraction on the PLP feature data generated by the feature data generation sub-module by using the fast RCNN network.

In one embodiment, the first feature extraction submodule 121 includes:

a short-time power spectrum obtaining unit, which obtains a short-time power spectrum by sampling, windowing and discrete Fourier transform of a call voice signal and taking the square sum of the real part and the imaginary part of the short-time voice spectrum,

P(f)＝R_x[X(f)]²+I_m[X(f)]²

a critical frequency band analysis unit for performing critical frequency band analysis on the short-time power spectrum of the call voice signal to obtain a plurality of critical bandwidth auditory spectrums theta (k) of the call voice signal;

and an equal loudness pre-emphasis unit which performs equal loudness pre-emphasis on the plurality of critical bandwidth auditory spectra θ (k) by the following formula.

Г(k)＝E[f₀(k)]θ(k)，(k＝1，2，...，17)

Wherein, f is the auditory spectrum after equal loudness pre-emphasis, f₀(k) Denotes the frequency corresponding to the center frequency of the k-th critical bandwidth auditory spectrum, Ef₀(k)]Representing the frequency f₀(k) The corresponding equal loudness curve is obtained by the following formula:

an intensity-loudness conversion unit for performing intensity-loudness conversion on the plurality of critical-bandwidth auditory spectra after equal-loudness pre-emphasis by the following formula

φ(k)＝Г(k)^0.33

and the characteristic obtaining unit is used for obtaining a call voice signal after Fourier inverse transformation by carrying out Fourier inverse transformation on the plurality of threshold bandwidth auditory spectrums theta (k) subjected to intensity-loudness conversion, calculating an all-pole model, solving a cepstrum coefficient of the call voice signal and obtaining PLP characteristics.

Preferably, the critical band analyzing unit includes:

a band analysis subunit for performing critical band analysis on the short-time power spectrum of the call voice signal by the following formula,

wherein Z (f) is Bark domain frequency;

a critical bandwidth auditory spectrum obtaining subunit, which maps the frequency axis f of the frequency spectrum P (f) to Bark frequency Z to obtain 17 frequency bands, and obtains a critical bandwidth auditory spectrum theta (k) after the energy spectrum of each frequency band is multiplied and summed by a weighting coefficient,

wherein Z is₀(k) Representing the center frequency of the k-th critical bandwidth auditory spectrum, # (#) (Z-Z)₀(k) A weighting coefficient corresponding to each frequency band, and P (f (z)) an energy spectrum corresponding to each frequency band

In one embodiment, the Faster RCNN network includes a convolutional layer, an RNP network, a comprehensive convolutional layer, and a fully-connected layer, the convolutional layer extracting a feature map of speech features; the RNP is used for generating a candidate region, judging the type of an anchor frame by utilizing softmax, and obtaining the candidate region by correcting the anchor frame; the characteristic graph extracted by the comprehensive convolutional layer and the RNP network obtain a candidate region, and a plurality of candidate characteristic graphs are extracted; the full connectivity layer integrates a plurality of candidate feature maps.

In one embodiment, the speech classification model is a Transformer network, the Transformer network includes an encoder and a decoder, the encoder encodes the features of the speech signal extracted by the fast RCNN network to obtain a context semantic vector, the encoder performs data decoding on the obtained context semantic vector, and a classification category is obtained through a layer of softmax.

Fig. 6 is a schematic structural diagram of an electronic device implementing a deep learning-based phone recognition method according to the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a deep learning based phone identification program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a phone recognition program based on deep learning, etc., but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a phone recognition program based on deep learning, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 6 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The deep learning based phone recognition program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

collecting a call voice signal of a client;

extracting the characteristics of the call voice signal;

establishing a voice classification model, inputting the characteristics of the call voice signals into the voice classification model, and obtaining the classification of the call voice signals, wherein the classification comprises normal calls, harassing calls and fraud calls;

wherein the step of extracting the feature of the call voice signal comprises:

extracting PLP features in the call voice signal by using openSMILE;

calling and extracting a config file corresponding to the PLP characteristics by using a script to generate PLP characteristic data corresponding to the call voice signal;

and (5) performing feature re-extraction on the PLP feature data by using a fast RCNN network.

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again. It should be emphasized that, in order to further ensure the privacy and security of the data to be audited, the audit data may also be stored in a node of a block chain.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium includes a computer program, where the computer program is executed by a processor, and the computer program implements the following operations:

collecting a call voice signal of a client;

extracting the characteristics of the call voice signal;

wherein the step of extracting the feature of the call voice signal comprises:

extracting PLP features in the call voice signal by using openSMILE;

The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the above-mentioned phone recognition method, device and electronic device based on deep learning, and will not be described herein again.

The telephone recognition method, the telephone recognition device, the electronic equipment and the medium based on deep learning adopt the deep learning technology, better utilize a large amount of voice data and the correlation among the data, and improve the recognition rate of the conversation voice category. A novel network structure Transformer network is adopted in the voice category identification network model, and the model updating speed and the identification efficiency can be accelerated by utilizing the GPU to accelerate calculation. The conversation voice data is utilized again, the voice data characteristics are utilized to train the voice category recognition network model, and the early warning function can be achieved after the conversation voice content of the user is immediately analyzed and classified. Certain recognition rate can be achieved in the aspect of harassment prevention, the dialing frequency of harassment calls is greatly reduced, and in addition, the blacklist processing is assisted for users. The system can give an early warning in the aspect of fraud prevention, reduces the crime rate of telephone fraud, and particularly protects young people and old people with low precaution consciousness.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for phone recognition based on deep learning, the method comprising:

collecting a call voice signal of a client;

extracting the characteristics of the call voice signal;

wherein the step of extracting the feature of the call voice signal comprises:

extracting PLP features in the call voice signal by using openSMILE;

2. The method for phone recognition based on deep learning of claim 1, wherein the step of extracting PLP features in the call voice signal by openSMILE comprises:

P(f)＝R_x[X(f)]²+I_m[X(f)]²

Γ(k)＝E[f₀(k)]θ(k),(k＝1,2,…,17)

φ(k)＝Γ(k)^0.33

3. The deep learning-based phone recognition method of claim 2, wherein the step of performing critical band analysis on the short-time power spectrum of the call voice signal comprises:

wherein Z (f) is Bark domain frequency;

mapping the frequency axis f of the short-time power spectrum P (f) to Bark frequency Z to obtain 17 frequency bands, multiplying the energy spectrum of each frequency band by a weighting coefficient to obtain a critical bandwidth auditory spectrum theta (k),

4. The deep learning-based phone recognition method of claim 1, wherein the fast RCNN network is constructed by:

extracting a feature map of the voice features through the convolutional layer;

generating a candidate region by the RNP network;

5. The method of claim 1, wherein the speech classification model is a Transformer network.

6. The deep learning-based phone recognition method of claim 5, wherein the transform network is constructed by

Constructing a Transformer network through an encoder and a decoder;

7. The deep learning based phone recognition method of claim 5, further comprising: and combining the fast RCNN and the Transformer network into a voice type recognition network, and uploading the voice type recognition network to the cloud.

8. An apparatus for phone recognition based on deep learning, the apparatus comprising:

the acquisition module is used for acquiring a call voice signal of the client;

wherein the feature extraction module comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a deep learning based phone recognition method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the deep learning based phone recognition method according to any one of claims 1 to 7.