CN113611329B

CN113611329B - Voice abnormality detection method and device

Info

Publication number: CN113611329B
Application number: CN202110750572.1A
Authority: CN
Inventors: 王喜
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2023-10-24
Anticipated expiration: 2041-07-02
Also published as: CN113611329A

Abstract

The specification discloses a method and a device for detecting voice abnormality, and particularly discloses a method and a device for acquiring voice information to be detected, determining corresponding voice information characteristics, and inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected so as to detect the voice abnormality of the voice information to be detected. And if the voice information to be detected is determined to be abnormal voice information and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the abnormal voice information under the condition of starting the standby service equipment, determining that the network voice attack behavior exists, and processing the voice source through a preset abnormal processing strategy. Therefore, the abnormal detection is carried out on the intelligent voice response window session, the attack action of the lawless persons can be identified, so that the voice source is processed by adopting a preset abnormal processing strategy, the attack action of the lawless persons is prevented, and the system is maintained stable.

Description

Voice abnormality detection method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for detecting voice anomalies.

Background

At present, intelligent voice response technology is developed greatly and applied to the fields such as intelligent customer service, information search and the like, so that the execution efficiency of some businesses in the fields is effectively improved, and great convenience is brought to the daily life of users.

In practical applications, there may be some unlawful parties that attack a platform that provides an intelligent voice response, for example, the intelligent voice response window of the platform is continuously occupied, so that other users that need to execute the voice response service cannot execute the service smoothly. In order to cope with the adverse effect caused by such a situation, in the prior art, a capacity expansion manner is generally adopted by the platform, that is, the number of devices for providing the voice response service is increased, however, the operation cost of the platform is greatly increased by the manner, and the attack behavior of an illegal person on the platform cannot be reduced.

Therefore, how to effectively prevent the attack action of the unlawful party on the voice response service provided by the platform is a problem to be solved.

Disclosure of Invention

The present disclosure provides a method and apparatus for detecting voice anomalies, so as to partially solve the above problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a method for detecting voice anomalies, which comprises the following steps:

acquiring voice information to be detected;

determining voice information characteristics corresponding to the voice information to be detected, wherein the voice information characteristics comprise: at least one of voiceprint characteristics, data transmission characteristics and voice call request characteristics, wherein the data transmission characteristics are used for representing characteristics of the voice information to be detected on data transmission quantity, and the voice call request characteristics are used for representing call window characteristics of a voice call request corresponding to the voice information to be detected;

inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected;

according to the recognition structure, detecting voice abnormality of the voice information to be detected;

if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the abnormal voice information under the condition of starting the standby service equipment, determining that the voice source has network voice attack behaviors, and processing the voice source through a preset abnormal processing strategy.

Optionally, the voiceprint feature includes: at least one of a speech speed characteristic corresponding to the voice information to be detected and a volume characteristic corresponding to the voice information to be detected;

the data transmission features include: at least one of the data size of the data transmission packet corresponding to the voice information to be detected, the number of bytes per second of the voice information to be detected in the transmission process and the number of the data transmission packet corresponding to the voice information to be detected;

the voice session request feature includes: at least one of a session window size of a voice session request corresponding to the voice information to be detected and a header size corresponding to the session window of the voice information to be detected.

Optionally, the recognition model comprises a speech reconstruction model and a speech anomaly recognition model;

inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected, wherein the recognition result specifically comprises the following steps:

inputting the voice information characteristics into a pre-trained voice reconstruction model and a voice abnormality recognition model to determine a reconstruction score corresponding to the voice information to be detected through the voice reconstruction model and determine an abnormality score aiming at the voice information to be detected through the voice abnormality recognition model;

According to the recognition result, detecting the voice abnormality of the voice information to be detected, specifically including:

and detecting the voice abnormality of the voice information to be detected according to the abnormality score and the reconstruction score.

Optionally, according to the anomaly score and the reconstruction score, performing voice anomaly detection on the voice information to be detected specifically includes:

determining the confidence corresponding to the abnormal score;

and carrying out voice abnormality detection on the voice information to be detected according to the confidence level, the abnormality score and the reconstruction score.

Optionally, according to the confidence, the anomaly score and the reconstruction score, performing voice anomaly detection on the voice information to be detected specifically includes:

determining a penalty weight corresponding to the confidence according to the magnitude of the confidence, wherein if the confidence is lower, the penalty weight is higher;

determining a compensated anomaly score according to the confidence level, the penalty weight and the anomaly score;

and carrying out voice abnormality detection on the voice information to be detected according to the compensated abnormal score and the reconstruction score.

Optionally, training the speech reconstruction model specifically includes:

acquiring first sample voice information;

inputting the first sample voice information into the voice reconstruction model to reconstruct the first sample voice information to obtain reconstructed voice information;

and training the voice reconstruction model according to the deviation between the reconstructed voice information and the first sample voice information.

Optionally, training the speech abnormality recognition model specifically includes:

acquiring second sample voice information;

inputting the second sample voice information into the voice abnormality recognition model to obtain an abnormality score corresponding to the second sample voice information;

and training the voice abnormality recognition model by taking the abnormal score corresponding to the minimized second sample voice information and the labeling information corresponding to the second sample voice information as optimization targets.

The present specification provides a device for detecting speech anomalies, comprising:

the acquisition module is used for acquiring the voice information to be detected;

the determining module is configured to determine a voice information feature corresponding to the voice information to be detected, where the voice information feature includes: at least one of voiceprint characteristics, data transmission characteristics and voice conversation request characteristics, wherein the data transmission characteristics are used for representing characteristics of the voice information to be detected on data transmission quantity, and the voice conversation request characteristics are used for representing conversation window characteristics of a voice conversation request corresponding to the voice information to be detected;

The input module is used for inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected;

the detection module is used for detecting voice abnormality of the voice information to be detected according to the identification result;

the abnormality processing module is used for determining that the voice information to be detected is abnormal voice information, and if the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition of starting the standby service equipment, determining that the voice source has network voice attack behaviors, and processing the voice source through a preset abnormality processing strategy.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described method of speech abnormality detection.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of speech anomaly detection described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the method for detecting voice abnormality provided in the present specification, firstly, voice information to be detected is obtained, then, voice information characteristics corresponding to the voice information to be detected are determined, and the voice information characteristics include: at least one of voiceprint characteristics, data transmission characteristics and voice conversation request characteristics, wherein the data transmission characteristics are used for representing characteristics of voice information to be detected on data transmission quantity, the voice conversation request characteristics are used for representing conversation window characteristics of voice conversation requests corresponding to the voice information to be detected, the voice information to be detected characteristics are input into a pre-trained recognition model, a recognition result aiming at the voice information to be detected is obtained, and voice abnormality detection is carried out on the voice information to be detected according to the recognition result. And if the voice information to be detected is determined to be abnormal voice information and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition of starting the standby service equipment, determining that the voice source has network voice attack behaviors, and processing the voice source through a preset abnormal processing strategy.

According to the method, the attack behaviors of the illegal persons can be identified by detecting voice abnormality of the voice information corresponding to the intelligent voice response service, and a preset abnormality processing strategy is adopted for the attack behaviors of the illegal persons to perform abnormality processing on the voice source corresponding to the abnormal voice information so as to prevent the attack behaviors of the illegal persons and maintain the stability of the system.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a flow chart of a method for detecting speech anomalies in the present disclosure;

FIG. 2 is a schematic diagram of the recognition model training process provided herein;

FIG. 3 is a schematic diagram of an apparatus for detecting speech anomalies provided in the present disclosure;

fig. 4 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for detecting voice anomalies in the present specification, which includes the following steps:

step S100, obtaining the voice information to be detected.

In order to solve the problem that a platform for providing intelligent voice response is attacked by lawless persons, an intelligent voice response window is continuously occupied, and other users who need to execute voice response services cannot execute the services smoothly, the voice abnormality detection method is provided in the specification, so that voice abnormality detection is carried out on an ongoing voice session in the intelligent voice response window, and certain abnormality processing measures are adopted when voice information is determined to be abnormal voice information, so that the influence of attack actions of lawless persons on the intelligent voice response platform is reduced, the system is maintained stably, and the service execution efficiency of intelligent voice response service is improved.

The method for detecting voice abnormality provided in the present specification may be executed by a terminal device such as a desktop computer, or may be executed by a platform or a server that provides an intelligent voice response service. In the following, for convenience of description, only a platform providing an intelligent voice response service will be described as an execution subject.

In specific implementation, the platform can detect voice abnormality for each intelligent voice response service, so that the voice information to be detected can be the voice information related to each intelligent voice response service. In practical application, most of intelligent voice response services are normal, only a small part of the intelligent voice response services are attack behaviors initiated by lawbreakers, so that the voice abnormality detection scheme in the specification can be monitored for each intelligent voice response service when the duration of the intelligent voice response service is monitored to reach a set duration, and voice abnormality detection of the intelligent voice response service is initiated.

Step S102, determining a voice information feature corresponding to the voice information to be detected, where the voice information feature includes: at least one of voiceprint characteristics, data transmission characteristics and voice conversation request characteristics, wherein the data transmission characteristics are used for representing characteristics of the voice information to be detected on data transmission quantity, and the voice conversation request characteristics are used for representing conversation window characteristics of voice conversation requests corresponding to the voice information to be detected.

In specific implementation, after the platform acquires the voice information to be detected, the voice information feature extraction is performed on the voice information to be detected. In practical applications, a general intelligent voice response is a voice dialogue established between a platform and a terminal (such as a mobile phone, a tablet computer, a notebook computer) owned by a client, and a user sends voice information to the platform through the terminal, where the voice information is usually sent by a person. When an lawbreaker attacks a platform providing an intelligent voice response, these attacks may usually be done by a program, that is, the lawbreaker is based on the voice information sent by the terminal as the voice information of the robot. Therefore, there should be a large difference between the voice information characteristics of the voice information corresponding to these attacks and the voice information characteristics of the voice information of the person.

In an actual service scene, in an intelligent voice response of a client, the client needs to perform voice conversation with the platform, communicate service contents and give corresponding voice response according to the response of the platform, voice information involved in each round of voice interaction in the conversation has relatively large length difference, the voice speed and the voice size also change along with the content of the conversation, and meanwhile, after the service related to the conversation contents is completed, the client immediately ends the conversation and releases an intelligent voice response window.

However, the purpose of the attack is not to communicate with the dialogue, so that the speech speed and the sound corresponding to the voice information of the attack do not change along with the content of the dialogue, and in each round of voice interaction, the difference between the related voice information is relatively smaller than that of the voice information sent by the real client, and meanwhile, when the voice dialogue corresponding to the attack occupies the intelligent voice response window for a long time, the platform allocates more resources for the voice dialogue to meet the requirement of the voice dialogue.

Thus, the abnormal voice information can be identified in the specification from at least three aspects of voiceprint feature, data transmission feature and voice conversation request feature.

The voiceprint feature is used for representing the acoustic spectrum characteristic of the voice information to be detected, and the voiceprint feature can include: the voice information to be detected includes voice speed characteristics (such as the voice speed, the voice speed variation range), volume characteristics (such as the volume, the volume maximum value) corresponding to the voice information to be detected, and the like.

In practical application, the speech speed and the volume of the voice information of the client can be changed along with the dialogue content, and the speech speed of the voice information of the robot can be average, so that the volume is stable. Thus, the larger the variation amplitude of the speech speed corresponding to the to-be-detected speech information is, the more likely the to-be-detected speech information is normal speech information, and otherwise, the more likely the to-be-detected speech information is abnormal speech information. The larger the volume maximum value corresponding to the voice information to be detected is, the more likely the voice information to be detected is normal voice information, otherwise, the more likely the voice information to be detected is abnormal voice information.

A data transmission characteristic for characterizing a characteristic of the voice information to be detected in terms of data transmission amount, the data transmission characteristic may include: the data size of the data transmission packet corresponding to the voice information to be detected (for example, average packet size (Average Packet Size), standard deviation of packet size in forward stream (Fwd Packet Length Std), etc.), the number of bytes per second (for example, number of bytes per second flowing (Flow bytes/s)) of the voice information to be detected in the transmission process, the number of data transmission Packets corresponding to the voice information to be detected (for example, number of Packets per second flowing (Flow Packets/s), number of Packets in reverse stream (total Bwd Packets)), etc.

In practical application, the voice information of the client can change along with the dialogue content, the dialogue content in each round of dialogue is long or short, and after the business content communication related to the dialogue content is completed, the voice dialogue can be ended immediately, and the intelligent voice response window is released, so that the data volume of the data transmission packet corresponding to the normal voice information can be smaller, the standard deviation of the data volume of the data transmission packet can be larger, and the number of the data packets can be smaller. The corresponding data amount of the voice information of the robot is relatively large, the standard deviation of the data amount of the data transmission packet is relatively small, and the number of the data packets is relatively large.

Thus, the larger the data volume of the data transmission packet corresponding to the voice information to be detected is, the more likely the voice information to be detected is abnormal voice information, otherwise, the more likely the voice information to be detected is normal voice information. The larger the standard deviation of the data amount of the data transmission packet corresponding to the voice information to be detected is, the more likely the voice information to be detected is normal voice information. Otherwise, the more likely the voice information to be detected is abnormal voice information.

A voice session request feature for characterizing a session window feature of a voice session request corresponding to voice information to be detected, the voice session request feature comprising: a session window size (e.g., TCP window size) of a voice session request corresponding to the voice information to be detected, a packet header size (e.g., traffic packet header size) corresponding to the session window of the voice information to be detected, and so on.

In an actual service scenario, after the service content related to the dialogue content corresponding to the voice information of the client is communicated, the voice dialogue is ended immediately, and the intelligent voice response window is released, however, the abnormal voice information of the robot occupies the intelligent voice response window for a long time, so that the platform allocates more session resources for the abnormal voice information in order to meet the requirement of the abnormal voice dialogue. Thus, the larger the TCP window corresponding to the voice information to be detected is, the more likely the voice information to be detected is abnormal voice information, and conversely, the more likely the voice information to be detected is normal voice information. The larger the flow packet head corresponding to the voice information to be detected is, the more likely the voice information to be detected is abnormal voice information, otherwise, the more likely the voice information to be detected is normal voice information.

Step S104, inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected.

Step S106, according to the recognition result, voice abnormality detection is carried out on the voice information to be detected.

In specific implementation, after the platform acquires the voice information characteristics corresponding to the voice information to be detected, the voice information characteristics are input into a pre-trained recognition model, so that the voice information to be detected is recognized through the recognition model, and whether the voice information to be detected is normal voice information or abnormal voice information is judged. The recognition model may include a speech reconstruction model and a speech anomaly recognition model.

When the platform carries out abnormal voice information recognition through the recognition model, the voice information features are input into a pre-trained voice reconstruction model, so that the reconstruction scores corresponding to the voice information to be detected are determined through the voice reconstruction model. Meanwhile, the platform also inputs the voice information characteristics into a pre-trained voice abnormality recognition model, and determines an abnormality score for the voice information to be detected through the voice abnormality recognition model. And then, the platform detects voice abnormality of the voice information to be detected according to the obtained abnormality score and the reconstruction score.

In practical application, when the platform determines the reconstruction score, the voice information feature can be input into a pre-trained voice reconstruction model, then the voice reconstruction model adopts a preset automatic encoder to encode the voice information feature, then the voice information is decoded to obtain the reconstruction voice information, and then the platform compares the reconstruction voice information with the voice information to be detected according to the reconstruction voice information to determine the reconstruction score corresponding to the voice information to be detected. The larger the difference between the reconstructed voice information and the voice information to be detected is, the larger the corresponding reconstruction score is, and the higher the possibility that the voice information to be detected is abnormal voice information is.

The difference between the reconstructed voice information and the voice information to be detected can be represented by the distance between the characterization vector corresponding to the reconstructed voice information feature and the characterization vector corresponding to the voice information feature. The smaller the distance between the two, the smaller the difference between the reconstructed speech information and the speech information to be detected. In practical application, the distance between the reconstructed voice information features corresponding to the reconstructed voice information and the voice information features corresponding to the voice information to be detected can be directly used as the reconstructed score corresponding to the voice information to be detected.

In practical application, the difference between the obtained reconstructed voice information and the normal voice information before encoding and decoding is smaller after the normal voice information is encoded and decoded by the voice reconstruction model, however, the difference between the obtained reconstructed voice information and the abnormal voice information before encoding and decoding is larger after the abnormal voice information is encoded and decoded by the voice reconstruction model. Therefore, in the present specification, the voice abnormality detection can be performed on the voice information to be detected by the voice reconstruction model.

When the platform determines the abnormal score, the characteristics of the voice information to be detected are input into a pre-trained voice abnormal recognition model, and the voice information to be detected is classified by the voice abnormal recognition model so as to determine the abnormal score aiming at the voice information to be detected. The higher the probability that the voice information to be detected is abnormal voice information, the higher the abnormal score of the voice information to be detected.

Further, in the specification, for the abnormal score of the voice information to be detected, the confidence corresponding to the abnormal score is correspondingly set, so that the influence of the recognition error of the voice abnormal recognition model on the recognition result is reduced. When the confidence corresponding to the abnormal score is high, the abnormal score comparison is provided with a referential property, and the abnormal score comparison can reflect the abnormal condition of the voice information to be detected. On the contrary, when the confidence corresponding to the abnormal score is low, no matter the value of the abnormal score is high or low, the classification result of the voice abnormal recognition model is quite possibly wrong, and the abnormal score does not have a reference condition. At this time, a penalty weight corresponding to the confidence level may be determined according to the magnitude of the confidence level, where the lower the confidence level is, the higher the penalty weight is.

And finally, the platform determines the compensated abnormal score according to the confidence level, the punishment weight and the abnormal score, and performs voice abnormality detection on the voice information to be detected according to the compensated abnormal score and the reconstruction score.

In this specification, the expression of the total anomaly score corresponding to the voice information to be detected may be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,the total abnormal score of the voice information i to be detected;

the compensated abnormal score of the voice information i to be detected is obtained;

the abnormal score of the voice information i to be detected is obtained;

confidence corresponding to the abnormal score of the voice information i to be detected;

η is a weight coefficient corresponding to an abnormal score item of the voice information i to be detected;

e is a penalty weight corresponding to the confidence coefficient set according to the magnitude of the confidence coefficient, wherein the smaller the confidence coefficient is, the larger the value corresponding to the e is,the larger the corresponding value, the higher the total anomaly score;

and (5) the reconstruction score corresponding to the voice information i to be detected.

When the total abnormal score corresponding to the voice information to be detected is higher, the voice information to be detected is more likely to be abnormal voice information. In the present specification, when it is determined that the total abnormality score corresponding to the voice information to be detected is greater than the set abnormality score, it may be determined that the voice information to be detected is the abnormal voice information.

Step S108, if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the abnormal voice information under the condition of starting the standby service equipment, determining that the voice source has network voice attack behaviors, and processing the voice source through a preset abnormal processing strategy.

In specific implementation, after the platform determines that the voice information to be detected is abnormal voice information, the standby service equipment is started, and the standby service equipment is used for executing voice interaction service. And then, under the condition of starting the standby service equipment, the platform re-acquires the voice information from the voice source corresponding to the voice information to be detected, carries out abnormality detection on the voice information, and if the voice information is still abnormal voice information, processes the voice source corresponding to the abnormal voice information through a preset abnormality processing strategy. Therefore, the platform can improve the service capacity of the platform by starting the standby service equipment so as to avoid the situation that the normal voice information is recognized as the abnormal voice information due to the limitation of the service capacity of the platform, improve the accuracy of voice abnormality detection and improve the service execution efficiency of the intelligent voice response service.

Because, in the actual service scenario, when the load of the platform is too large, the situation that the platform cannot respond to the voice of the client in time, so that the intelligent voice response service corresponding to the client continuously occupies the intelligent voice response window may also occur, and at this time, the platform is very likely to misjudge the voice information of the client as abnormal voice information. Based on the above, after the platform recognizes the abnormal voice information, the platform can detect the recognized abnormal voice information again in the above manner, so as to eliminate the normal voice information misjudged as the abnormal voice information due to the overlarge load of the platform itself, and obtain the abnormal voice information corresponding to the attack behavior, so that the subsequent platform can perform the abnormal processing on the abnormal voice information.

In the present disclosure, when the platform processes a voice source corresponding to abnormal voice information, an Internet Protocol (IP) address of the voice source corresponding to the abnormal voice information may be added to a blacklist to prevent an attack of an lawbreaker corresponding to the abnormal voice information, so as to maintain stability of the system.

Through the steps, the platform can detect voice abnormality of voice information corresponding to the intelligent voice response service, recognize attack behaviors of illegal persons, and adopt a preset abnormality processing strategy aiming at the attack behaviors of the illegal persons to perform abnormality processing on voice sources corresponding to the abnormal voice information so as to achieve the purposes of preventing the attack behaviors of the illegal persons and maintaining the system stable.

For the abnormal voice detection scheme used for detecting the voice abnormality, the specification also provides a training process of a voice reconstruction model adopted when the voice abnormality detection method is implemented and a training process of a voice abnormality recognition model adopted.

In specific implementation, when a speech reconstruction model is subjected to model training, a platform firstly acquires first sample speech information, then inputs the first sample speech information into the speech reconstruction model, the speech reconstruction model encodes the first sample speech information by adopting a preset automatic encoder, then decodes the first sample speech information to obtain reconstructed speech information, and then trains the speech reconstruction model according to the deviation between the reconstructed speech information and the first sample speech information.

Wherein, the coding function implemented by the automatic encoder can be expressed as:

z _i ＝f(x _i )；

wherein x is _i Denoted as the i-th speech information to be detected, f () denotes the encoder.

Then, the reconstruction loss can be expressed as:

wherein g () represents a decoder corresponding to the encoder f (), and L () represents a distance function between the reconstructed speech information corresponding to the i-th speech information to be detected and the i-th speech information to be detected.

The automatic encoder may include: a multi-layered stacked depth self-encoder (Stacked Autoencoder, SAE), a variance self-encoder (Variational Autoencoder, VAE), a noise reduction self-encoder (Denoising Autoencoder), a Ladder Network (Ladder Network), and the like.

In addition, when training the voice anomaly recognition model, the platform firstly acquires second sample voice information, then inputs the second sample voice information into the voice anomaly recognition model to obtain an anomaly score corresponding to the second sample voice information, and finally trains the voice anomaly recognition model by taking the anomaly score corresponding to the second sample voice information and labeling information corresponding to the second sample voice information as optimization targets.

The first sample voice information used in the training of the voice anomaly recognition model and the second sample voice information used in the training of the voice reconstruction model can be the same or different.

In the specification, because the tag information acquisition cost of the sample voice information is high, a large amount of sample voice information without tag information exists in the sample voice information acquired by the platform. Thus, the platform cannot train the model through labeled full supervised learning because it is difficult to obtain enough sample speech information with the label information. In view of this situation, a set of model training methods matching the above-described speech abnormality detection scheme is provided in the present specification.

In a specific implementation, the platform needs to pre-process the acquired sample voice data before performing model training. For example, after the platform obtains the sample voice data, the original data is cleaned, repeated data is removed, and invalid data is removed (for example, the sample voice data with excessive data missing). And then, the platform can perform model training according to the preprocessed sample voice information.

When the model is trained, the platform firstly carries out first training on the model by taking the minimized loss function as an optimization target according to the sample voice information with the label information, and a trained recognition model (comprising a voice reconstruction model and a voice anomaly recognition model) is obtained. And then, the platform selects part of sample voice information from the sample voice information without the label information, inputs the sample voice information into the trained recognition model to obtain a corresponding classification result, and takes the obtained classification result as the label information corresponding to the part of sample voice information.

And then, the platform re-inputs all the sample voice information with the label information in the sample voice information into the trained recognition model, performs second-round training on the recognition model by taking the minimized loss function as an optimization target to obtain a recognition model after secondary training, selects part of the sample voice information from the sample voice information without the label information again, inputs the part of the sample voice information into the recognition model after secondary training to obtain a corresponding classification result, and takes the obtained classification result as the label information corresponding to the part of the sample voice information.

Then, the platform re-inputs all the sample voice information with the label information in the sample voice information into the trained recognition model again, and performs third training on the recognition model by taking the minimized loss function as an optimization target to obtain a recognition model after three training; and selecting part of sample voice information from the sample voice information without the label information again, inputting the sample voice information into the recognition model after the secondary training to obtain a corresponding classification result, and taking the obtained classification result as the label information corresponding to the part of sample voice information.

And (3) repeating the steps circularly until the model training is determined to be completed after the preset training conditions are determined to be met.

In the present specification, the preset training condition may have various forms, for example, when the turn of the model training reaches the set turn, it may be determined that the preset training condition is satisfied; for another example, the model is validated using a validation sample after each round of training, and after determining that the validation passes, it is determined that a preset training condition is satisfied, and so on. Other ways are not illustrated in detail herein.

In an actual service scene, the number of abnormal voice information is far smaller than that of normal voice information, so that when the platform trains the model, the number of abnormal sample voice information is far smaller than that of normal sample voice information in the obtained training sample set. In this specification, in order to avoid the situation that the trained recognition model has poor recognition accuracy due to excessive fitting of the trained model to normal voice information, in the model training process, a focus loss function is introduced, where the focus loss function may be expressed as:

PL(P _t )＝-(1-P _t ) ^γ log(P _t )；

Wherein P is _t Representing the probability that the sample voice information t is abnormal voice information;

PL(P _t ) Representing a focus loss function corresponding to the sample voice information t;

gamma is the hyper-parameter of the focal point loss function.

Thus, when the number of abnormal voice information in the sample voice information is far smaller than that of normal voice information, namely, the positive and negative sample voice information is extremely unbalanced in distribution, the influence of large duty ratio (namely, normal voice information) on model training can be reduced.

Further, referring to fig. 2, in this specification, the speech reconstruction model and the speech anomaly recognition model may be jointly trained.

During model training, the platform generates sample voice information X from the sample _i Extracting voice information feature Y _i And then the voice information is characterized by Y _i Input into a voice reconstruction model, and the voice information feature Y is obtained by the voice reconstruction model _i Encoding and decoding again to obtain reconstructed voice information X' _i Then, based on the reconstructed speech information X' _i And sample speech information X _i Deviation between, determining the sample speech information X _i Corresponding reconstruction losses.

At the same time, the platform transmits the sample voice information X _i Speech information feature Y of (2) _i Inputting into a speech abnormality recognition model to obtain the sample speech information X _i Corresponding to the classification result and the sample voice information X _i Confidence (confidence) corresponding to the classification result. Then, the platform generates the voice information X according to the sample _i Corresponding to the classification result and the sample voice information X _i Determining the sample speech information X corresponding to the deviation between the tag information _i Corresponding classification loss and, based on the sample speech information X _i Confidence corresponding to the classification result, and the sample voice information X _i Label information corresponding to confidence coefficient corresponding to the classification result, and determining the sample voice information X _i The corresponding confidence loss. And further based on the sample voice information X _i Corresponding class loss and the sample speech information X _i Corresponding confidence loss, determining the sample speech information X _i Corresponding post-compensation classification loss.

Then, the platform generates the voice information X according to the sample _i Corresponding reconstruction loss and the sample speech information X _i The corresponding compensated classification loss is used for constructing a loss function trained by a model, and the constructed loss function can be expressed by the following formula:

L＝μL _rec +L _cls ；

wherein L is _rec Representing reconstruction loss corresponding to sample voice information, L _cls And (3) representing the compensated classification loss corresponding to the sample voice information, wherein mu is a super parameter for adjusting the weights of the reconstruction loss and the classification loss, and the larger mu is, the larger the reconstruction loss has a larger influence on the model.

Finally, the platform carries out joint training on the voice reconstruction model and the voice abnormality recognition model by minimizing the loss function L to obtain a trained voice reconstruction model and a trained voice abnormality recognition model.

Wherein if the sample voice information X _i If the sample voice information is the sample voice information with the label information in the training sample acquired by the platform, the confidence of the classification result corresponding to the sample voice information can be set to be 1. If the sample voice information X _i The method is that sample voice information without label information in training samples acquired by a platform is determined by a trained recognition model _i And takes the classification result as the sample voice information X _i After the label information of the classification result is given by means of manual checkingA confidence level.

It should be noted that, the scheme of voice anomaly detection provided in the present specification may be applied to a variety of intelligent voice response services, such as voice prompting, intelligent customer service, information searching, etc., and the present specification does not limit specific usage scenarios.

The above method for detecting voice anomalies provided for one or more embodiments of the present disclosure further provides a corresponding device for detecting voice anomalies based on the same concept, as shown in fig. 3.

Fig. 3 is a schematic diagram of a device for detecting voice anomalies provided in the present specification, including:

the acquisition module 300 is used for acquiring the voice information to be detected;

a determining module 301, configured to determine a voice information feature corresponding to the voice information to be detected, where the voice information feature includes: at least one of voiceprint characteristics, data transmission characteristics and voice conversation request characteristics, wherein the data transmission characteristics are used for representing characteristics of the voice information to be detected on data transmission quantity, and the voice conversation request characteristics are used for representing conversation window characteristics of a voice conversation request corresponding to the voice information to be detected;

the input module 302 is configured to input the voice information feature into a pre-trained recognition model, so as to obtain a recognition result for the voice information to be detected;

the detection module 303 is configured to detect a voice abnormality of the voice information to be detected according to the recognition result;

the exception handling module 304 is configured to determine that a network voice attack behavior exists in the voice source if the voice information to be detected is determined to be the exception voice information, and if the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the exception voice information under the condition that the standby service equipment is started, and process the voice source through a preset exception handling policy.

the data transmission features include: at least one of the data size of the data transmission packet corresponding to the voice information to be detected, the number of bytes per second in the transmission process of the voice information to be detected and the number of the data transmission packet corresponding to the voice information to be detected;

Optionally, the recognition model comprises a speech reconstruction model and a speech anomaly recognition model; the input module 302 is specifically configured to input the voice information feature into a pre-trained voice reconstruction model and a voice anomaly recognition model, so as to determine, through the voice reconstruction model, a reconstruction score corresponding to the voice information to be detected, and determine, through the voice anomaly recognition model, an anomaly score for the voice information to be detected; according to the recognition result, detecting the voice abnormality of the voice information to be detected, specifically including: and detecting the voice abnormality of the voice information to be detected according to the abnormality score and the reconstruction score.

Optionally, the input module 302 is specifically configured to determine a confidence level corresponding to the anomaly score; and carrying out voice abnormality detection on the voice information to be detected according to the confidence level, the abnormality score and the reconstruction score.

Optionally, the input module 302 is specifically configured to determine a penalty weight corresponding to the confidence level according to the confidence level, where if the confidence level is lower, the penalty weight is higher; determining a compensated anomaly score according to the confidence level, the penalty weight and the anomaly score; and carrying out voice abnormality detection on the voice information to be detected according to the compensated abnormal score and the reconstruction score.

Optionally, the apparatus further comprises:

a speech reconstruction model training module 305, configured to obtain first sample speech information; inputting the first sample voice information into the voice reconstruction model to reconstruct the first sample voice information to obtain reconstructed voice information; and training the voice reconstruction model according to the deviation between the reconstructed voice information and the first sample voice information.

Optionally, the apparatus further comprises:

A speech anomaly recognition model training module 306 for obtaining second sample speech information; inputting the second sample voice information into the voice abnormality recognition model to obtain an abnormality score corresponding to the second sample voice information; and training the voice abnormality recognition model by taking the abnormal score corresponding to the minimized second sample voice information and the labeling information corresponding to the second sample voice information as optimization targets.

The present specification also provides a computer readable storage medium storing a computer program operable to perform a method of speech anomaly detection as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 1 shown in fig. 4. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 4, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the method for detecting the voice abnormality described in fig. 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method for detecting speech anomalies, comprising:

acquiring voice information to be detected;

determining voice information characteristics corresponding to the voice information to be detected, wherein the voice information characteristics comprise: at least one of voiceprint characteristics, data transmission characteristics and voice conversation request characteristics, wherein the data transmission characteristics are used for representing characteristics of the voice information to be detected on data transmission quantity, and the voice conversation request characteristics are used for representing conversation window characteristics of a voice conversation request corresponding to the voice information to be detected;

according to the recognition result, carrying out voice abnormality detection on the voice information to be detected;

if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the abnormal voice information under the condition of starting the standby service equipment, determining that the voice source has network voice attack behaviors, and processing the voice source through a preset abnormal processing strategy;

the voiceprint feature includes: at least one of a speech speed characteristic corresponding to the voice information to be detected and a volume characteristic corresponding to the voice information to be detected;

2. The method of claim 1, wherein the recognition model comprises a speech reconstruction model and a speech anomaly recognition model;

3. The method of claim 2, wherein performing voice anomaly detection on the voice information to be detected according to the anomaly score and the reconstruction score specifically comprises:

determining the confidence corresponding to the abnormal score;

4. The method of claim 3, wherein performing voice anomaly detection on the voice information to be detected according to the confidence level, the anomaly score and the reconstruction score specifically comprises:

5. The method of claim 2, wherein training the speech reconstruction model comprises:

acquiring first sample voice information;

6. The method of claim 2, wherein training the speech anomaly recognition model comprises:

Acquiring second sample voice information;

7. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the program.