CN113611329A

CN113611329A - Method and device for detecting abnormal voice

Info

Publication number: CN113611329A
Application number: CN202110750572.1A
Authority: CN
Inventors: 王喜
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-11-05
Anticipated expiration: 2041-07-02
Also published as: CN113611329B

Abstract

The specification discloses a method and a device for detecting voice anomaly, and particularly discloses that voice information to be detected is obtained, corresponding voice information characteristics are determined, then the voice information characteristics are input into a pre-trained recognition model, and a recognition result aiming at the voice information to be detected is obtained so as to detect the voice anomaly of the voice information to be detected. And then, if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition that the standby service equipment is started, determining that a network voice attack behavior exists, and processing the voice source through a preset abnormal processing strategy. Therefore, the abnormal detection is carried out aiming at the intelligent voice response window conversation, the attack behavior of the lawbreaker can be identified, so that the voice source can be processed by adopting a preset abnormal processing strategy, the attack behavior of the lawbreaker is prevented, and the stability of the system is maintained.

Description

Method and device for detecting abnormal voice

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting a speech anomaly.

Background

At present, the intelligent voice response technology is developed greatly and applied to the fields such as intelligent customer service, information search and the like, so that the execution efficiency of some services in the fields is effectively improved, and great convenience is brought to the daily life of a user.

In practical applications, there may be some lawbreakers attacking the platform providing the intelligent voice response, for example, the intelligent voice response window of the platform is continuously occupied, so that other users who need to execute the voice response service cannot smoothly execute the service. In order to deal with the adverse effect caused by such a situation, in the prior art, the platform generally adopts a capacity expansion mode, that is, the number of devices providing the voice response service is increased, however, this mode not only greatly increases the operation cost of the platform, but also cannot reduce the attack behavior of the lawbreakers on the platform.

Therefore, how to effectively prevent the attack action taken by the lawbreaker on the voice response service provided by the platform is an urgent problem to be solved.

Disclosure of Invention

The present specification provides a method and an apparatus for detecting speech anomaly, so as to partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a method of speech anomaly detection, comprising:

acquiring voice information to be detected;

determining voice information characteristics corresponding to the voice information to be detected, wherein the voice information characteristics comprise: at least one of a voiceprint feature, a data transmission feature and a voice callback request feature, wherein the data transmission feature is used for representing the feature of the voice information to be detected on the data transmission quantity, and the voice callback request feature is used for representing the callback window feature of the voice callback request corresponding to the voice information to be detected;

inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected;

performing voice anomaly detection on the voice information to be detected according to the identification structure;

if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition that the standby service equipment is started, determining that the voice source has a network voice attack behavior, and processing the voice source through a preset abnormal processing strategy.

Optionally, the voiceprint features comprise: at least one of a speech rate characteristic corresponding to the voice information to be detected and a volume characteristic corresponding to the voice information to be detected;

the data transmission features include: at least one of the data size of the data transmission packet corresponding to the voice information to be detected, the number of bytes per second of the voice information to be detected in the transmission process, and the number of the data transmission packet corresponding to the voice information to be detected;

the voice session request feature includes: at least one of the size of the conversation window of the voice conversation request corresponding to the voice information to be detected and the size of the header corresponding to the conversation window of the voice information to be detected.

Optionally, the recognition model includes a speech reconstruction model and a speech anomaly recognition model;

inputting the voice information features into a pre-trained recognition model to obtain a recognition result for the voice information to be detected, and specifically comprising:

inputting the voice information characteristics into a pre-trained voice reconstruction model and a voice abnormity recognition model, so as to determine a reconstruction score corresponding to the voice information to be detected through the voice reconstruction model and determine an abnormity score aiming at the voice information to be detected through the voice abnormity recognition model;

according to the recognition result, carrying out voice abnormity detection on the voice information to be detected, which specifically comprises the following steps:

and performing voice anomaly detection on the voice information to be detected according to the anomaly score and the reconstruction score.

Optionally, performing voice anomaly detection on the voice information to be detected according to the anomaly score and the reconstruction score, specifically including:

determining a confidence degree corresponding to the abnormal score;

and performing voice anomaly detection on the voice information to be detected according to the confidence coefficient, the anomaly score and the reconstruction score.

Optionally, performing voice anomaly detection on the voice information to be detected according to the confidence level, the anomaly score and the reconstruction score, and specifically includes:

determining a penalty weight corresponding to the confidence coefficient according to the size of the confidence coefficient, wherein if the confidence coefficient is lower, the penalty weight is higher;

determining a compensated abnormal score according to the confidence coefficient, the penalty weight and the abnormal score;

and performing voice anomaly detection on the voice information to be detected according to the compensated anomaly score and the reconstructed score.

Optionally, training the speech reconstruction model specifically includes:

acquiring first sample voice information;

inputting the first sample voice information into the voice reconstruction model to reconstruct the first sample voice information to obtain reconstructed voice information;

and training the voice reconstruction model according to the deviation between the reconstructed voice information and the first sample voice information.

Optionally, training the speech anomaly recognition model specifically includes:

acquiring second sample voice information;

inputting the second sample voice information into the voice abnormity recognition model to obtain an abnormity score corresponding to the second sample voice information;

and training the voice anomaly recognition model by taking the minimized anomaly score corresponding to the second sample voice information and the labeled information corresponding to the second sample voice information as optimization targets.

This specification provides a device for voice anomaly detection, comprising:

the acquisition module is used for acquiring the voice information to be detected;

a determining module, configured to determine a voice information feature corresponding to the voice information to be detected, where the voice information feature includes: at least one of a voiceprint feature, a data transmission feature and a voice session request feature, where the data transmission feature is used to characterize the feature of the voice information to be detected in terms of data transmission quantity, and the voice session request feature is used to characterize the session window feature of the voice session request corresponding to the voice information to be detected;

the input module is used for inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected;

the detection module is used for carrying out voice abnormity detection on the voice information to be detected according to the recognition result;

and the exception handling module is used for determining that the voice source has a network voice attack behavior if the voice information to be detected is determined to be the abnormal voice information and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still the abnormal voice information under the condition that the standby service equipment is started, and handling the voice source through a preset exception handling strategy.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of voice anomaly detection.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the above-mentioned method for detecting a speech anomaly.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the method for detecting a speech anomaly provided in this specification, first, speech information to be detected is obtained, and then, a speech information feature corresponding to the speech information to be detected is determined, where the speech information feature includes: the voice recognition method comprises the steps of inputting voice information characteristics to be detected into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected, and carrying out voice anomaly detection on the voice information to be detected according to the recognition result. And then, if the voice information to be detected is determined to be abnormal voice information, and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition that the standby service equipment is started, determining that the voice source has a network voice attack behavior, and processing the voice source through a preset abnormal processing strategy.

According to the method, the voice anomaly detection can be carried out on the voice information corresponding to the intelligent voice response service, the attack behavior of the lawbreaker can be identified, and the preset anomaly processing strategy is adopted for the attack behavior of the lawbreaker to carry out anomaly processing on the voice source corresponding to the abnormal voice information, so that the attack behavior of the lawbreaker is prevented, and the stability of the system is maintained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic flow chart of a method for detecting speech anomalies in accordance with the present disclosure;

FIG. 2 is a schematic diagram of a recognition model training process provided herein;

FIG. 3 is a schematic diagram of a speech anomaly detection apparatus provided herein;

fig. 4 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for detecting a speech anomaly in this specification, which includes the following steps:

and S100, acquiring the voice information to be detected.

In order to solve the problem that a platform providing intelligent voice response is attacked by lawbreakers, and an intelligent voice response window is continuously occupied, so that other users needing to execute voice response service cannot smoothly execute the service, the present specification provides a method for detecting voice anomaly, which realizes voice anomaly detection on ongoing voice conversation in the intelligent voice response window, and takes certain anomaly handling measures when determining that voice information is abnormal voice information, so as to reduce the influence of attack behavior of the lawbreakers on the platform of intelligent voice response, maintain system stability, and improve service execution efficiency of intelligent voice response service.

The main execution body of the method for detecting the voice anomaly provided in the specification can be a terminal device such as a desktop computer, and can also be a platform or a server for providing an intelligent voice response service. For convenience of description, the following description will be made by taking only a platform providing the smart voice response service as an execution subject.

In specific implementation, the platform can perform voice anomaly detection for each intelligent voice response service, so that the voice information to be detected can be the voice information related to each intelligent voice response service. In practical application, most of the intelligent voice response services are normal, and only a few of the intelligent voice response services are attack behaviors initiated by lawless persons, so that when the voice anomaly detection scheme in the specification is implemented specifically, the intelligent voice response services at each time can be monitored, and when the duration of the monitored intelligent voice response services reaches the set duration, voice anomaly detection of the intelligent voice response services is initiated.

Step S102, determining voice information characteristics corresponding to the voice information to be detected, wherein the voice information characteristics comprise: the voice conversation processing method comprises at least one of a voiceprint feature, a data transmission feature and a voice conversation request feature, wherein the data transmission feature is used for representing the feature of the voice information to be detected on data transmission quantity, and the voice conversation request feature is used for representing the conversation window feature of a voice conversation request corresponding to the voice information to be detected.

In specific implementation, after the platform acquires the voice information to be detected, voice information feature extraction is performed on the voice information to be detected. In practical applications, a general smart voice response is a voice conversation established between a platform and a terminal (such as a mobile phone, a tablet computer, or a notebook computer) owned by a client, and a user sends voice information to the platform through the terminal, wherein the voice information is usually sent by a person. When a lawbreaker attacks a platform providing intelligent voice response, the attack behaviors may be generally completed by a program, that is, the lawbreaker is the voice information of the robot based on the voice information sent by the terminal. Therefore, there should be a large difference between the voice information characteristics of the voice information corresponding to these attack behaviors and the voice information characteristics of the voice information of the person.

In an actual service scene, a platform aims at the intelligent voice response of a client, the client needs to carry out voice conversation with the platform, communicate service content and give out corresponding voice response according to the response of the platform, voice information related to each round of voice interaction in the conversation has larger difference in length, the voice speed and the voice size can also change along with the content of the conversation, and meanwhile, the client immediately finishes the conversation after the service related to the conversation content is finished and releases an intelligent voice response window.

However, the purpose of the attack behavior is not in dialogue communication, so the speed and the sound of the voice information of the attack behavior do not change along with the content of the dialogue, and in each round of voice interaction, the difference between the length of the voice information is relatively smaller than the voice information sent by the real client, and when the voice dialogue corresponding to the attack behavior occupies the intelligent voice response window for a long time, the platform allocates more resources for the voice dialogue to meet the requirement of the voice dialogue.

Thus, the abnormal voice information can be recognized from at least three aspects of the voiceprint feature, the data transmission feature and the voice conversation request feature in the present specification.

The voiceprint feature is used for representing a sonic spectrum characteristic of the voice information to be detected, and the voiceprint feature can include: the speech rate characteristics (e.g., the size of the speech rate, the variation range of the speech rate, etc.) corresponding to the speech information to be detected, the volume characteristics (the size of the volume, the maximum value of the volume, etc.) corresponding to the speech information to be detected, etc.

In practical application, the speech speed and volume of the voice information of the client will change with the conversation content, and the speech speed of the voice information of the robot will be relatively average, and the volume will be relatively stable. Therefore, the larger the variation range of the speech rate corresponding to the speech information to be detected is, the more likely the speech information to be detected is to be normal speech information, and otherwise, the more likely the speech information to be detected is to be abnormal speech information. The larger the maximum value of the volume corresponding to the voice information to be detected is, the more likely the voice information to be detected is to be normal voice information, and otherwise, the more likely the voice information to be detected is to be abnormal voice information.

The data transmission characteristics are used for characterizing the characteristics of the voice information to be detected on data transmission quantity, and the data transmission characteristics can include: the data Size of the data transmission Packet corresponding to the voice information to be detected (e.g., Average Packet Size (Average Packet Size), standard deviation of Packet Size in forward Flow (Fwd Packet Length Std), etc.), the number of bytes per second of the voice information to be detected in the transmission process (e.g., number of bytes flowing through per second (Flow Byte/s)), the number of data transmission Packets corresponding to the voice information to be detected (e.g., number of Packets flowing through per second (Flow Packets/s), the number of Packets in reverse Flow (total Bwd Packets)), etc.

In practical application, the voice information of a client changes along with the conversation content, the conversation content in each round of conversation has a long or short length, and after the communication of the service content related to the conversation content is completed, the voice conversation is immediately ended, and the intelligent voice response window is released. However, the data size corresponding to the voice information of the robot may be large, the standard deviation of the data size of the data transmission packet may be small, and the number of the data packets may be large.

Therefore, the larger the data volume of the data transmission packet corresponding to the voice information to be detected is, the more likely the voice information to be detected is to be abnormal voice information, and conversely, the more likely the voice information to be detected is to be normal voice information. The larger the standard deviation of the data quantity of the data transmission packet corresponding to the voice information to be detected is, the more likely the voice information to be detected is to be normal voice information. Conversely, the more likely the voice message to be detected is an abnormal voice message.

The voice conversation request feature is used for representing a conversation window feature of a voice conversation request corresponding to voice information to be detected, and the voice conversation request feature comprises: the size of a session window of a voice session request corresponding to the voice information to be detected (e.g., the size of a TCP window), the size of a packet header corresponding to the session window of the voice information to be detected (e.g., the size of a traffic packet header), and the like.

In an actual service scene, after the communication of service contents related to the conversation contents corresponding to the voice information of the client is completed, the voice conversation is immediately ended, and the intelligent voice response window is released. Therefore, the larger the TCP window corresponding to the voice information to be detected is, the more likely the voice information to be detected is to be abnormal voice information, and conversely, the more likely the voice information to be detected is to be normal voice information. The larger the traffic packet header corresponding to the voice information to be detected is, the more likely the voice information to be detected is to be abnormal voice information, and otherwise, the more likely the voice information to be detected is to be normal voice information.

And step S104, inputting the voice information characteristics into a pre-trained recognition model to obtain a recognition result aiming at the voice information to be detected.

And S106, performing voice abnormity detection on the voice information to be detected according to the recognition result.

In specific implementation, after the platform acquires the voice information characteristics corresponding to the voice information to be detected, the voice information characteristics are input into a pre-trained recognition model, so that the voice information to be detected is recognized through the recognition model, and whether the voice information to be detected is normal voice information or abnormal voice information is judged. The recognition model may include a speech reconstruction model and a speech anomaly recognition model.

When the platform identifies abnormal voice information through the identification model, the voice information characteristics are input into a pre-trained voice reconstruction model, and a reconstruction score corresponding to the voice information to be detected is determined through the voice reconstruction model. Meanwhile, the platform also inputs the voice information characteristics into a pre-trained voice abnormity recognition model, and determines the abnormity score aiming at the voice information to be detected through the voice abnormity recognition model. And then, the platform carries out voice anomaly detection on the voice information to be detected according to the obtained anomaly score and the reconstruction score.

In practical application, when determining a reconstruction score, the platform may input the speech information feature into a pre-trained speech reconstruction model, then the speech reconstruction model encodes the speech information feature by using a preset automatic encoder, and then decodes the encoded speech information feature to obtain reconstructed speech information, and then the platform compares the reconstructed speech information with the speech information to be detected according to the reconstructed speech information to determine the reconstruction score corresponding to the speech information to be detected. The more the difference between the reconstructed voice information and the voice information to be detected is, the larger the corresponding reconstruction score is, and the higher the possibility that the voice information to be detected is abnormal voice information is.

The difference between the reconstructed voice information and the voice information to be detected can be represented by the characterization vector corresponding to the reconstructed voice information characteristic corresponding to the reconstructed voice information and the distance between the characterization vectors corresponding to the voice information characteristic. The smaller the distance between the reconstructed voice information and the voice information to be detected, the smaller the difference between the reconstructed voice information and the voice information to be detected. In practical application, the distance between the reconstructed voice information characteristic corresponding to the reconstructed voice information and the voice information characteristic corresponding to the voice information to be detected can be directly used as the reconstruction score corresponding to the voice information to be detected.

In practical applications, the difference between the normal speech information and the reconstructed speech information before encoding and decoding is relatively small after the normal speech information is encoded and decoded by the speech reconstruction model, whereas the difference between the reconstructed speech information and the abnormal speech information before encoding and decoding is relatively large after the abnormal speech information is encoded and decoded by the speech reconstruction model. Therefore, in this specification, the speech anomaly detection may be performed on the speech information to be detected through the speech reconstruction model.

And when the platform determines the abnormal score, inputting the characteristics of the voice information to be detected into a pre-trained voice abnormal recognition model, and classifying the voice information to be detected by the voice abnormal recognition model so as to determine the abnormal score aiming at the voice information to be detected. The higher the probability that the voice information to be detected is abnormal voice information, the higher the abnormal score of the voice information to be detected is.

Furthermore, in this specification, a confidence corresponding to the abnormal score of the speech information to be detected is also set correspondingly, so as to reduce the influence of the recognition error of the speech abnormal recognition model on the recognition result. When the confidence corresponding to the abnormal score is high, the abnormal score is proved to have referential property, and the abnormal score can reflect the abnormal condition of the voice information to be detected. On the contrary, when the confidence corresponding to the abnormal score is low, no matter the value of the abnormal score is high or low, the classification result of the speech abnormal recognition model is very likely to be wrong, and the abnormal score does not have a reference condition. In this case, a penalty weight corresponding to the confidence level may be determined according to the confidence level, where the lower the confidence level, the higher the penalty weight.

And finally, the platform determines the compensated abnormal value according to the confidence coefficient, the punishment weight and the abnormal value, and carries out voice abnormal detection on the voice information to be detected according to the compensated abnormal value and the reconstructed value.

In this specification, the expression of the total abnormality score corresponding to the voice information to be detected may be represented as:

wherein,

scoring the total abnormality of the voice information i to be detected;

the abnormal value is the compensated abnormal value of the voice information i to be detected;

the abnormal value of the voice information i to be detected is obtained;

the confidence coefficient corresponding to the abnormal score of the voice information i to be detected;

eta is a weight coefficient corresponding to the abnormal score item of the voice information i to be detected;

e is a penalty weight corresponding to the confidence coefficient set according to the confidence coefficient, wherein the smaller the confidence coefficient is, the larger the value corresponding to the e is，

The larger the corresponding value, the higher the total anomaly score;

and the reconstructed value is corresponding to the voice information i to be detected.

When the total abnormal score corresponding to the voice information to be detected is higher, the voice information to be detected is more likely to be abnormal voice information. In this specification, when it is determined that the total abnormality score corresponding to the voice information to be detected is greater than the set abnormality score, it may be determined that the voice information to be detected is abnormal voice information.

Step S108, if the voice information to be detected is determined to be abnormal voice information and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition that the standby service equipment is started, determining that the voice source has a network voice attack behavior, and processing the voice source through a preset abnormal processing strategy.

In specific implementation, after determining that the voice information to be detected is abnormal voice information, the platform starts a standby service device, and the standby service device is used for executing voice interaction service. And then, under the condition that the standby service equipment is started, the platform acquires the voice information again from the voice source corresponding to the voice information to be detected, performs exception detection on the voice information, and processes the voice source corresponding to the exception voice information through a preset exception processing strategy if the voice information is determined to be still the exception voice information. Therefore, the platform can improve the service capability of the platform by starting the standby service equipment, so that the condition that normal voice information is identified as abnormal voice information due to the limitation of the service capability of the platform is avoided, the accuracy of voice abnormality detection is improved, and the service execution efficiency of intelligent voice response service is improved.

In an actual service scene, when the load of the platform is too large, the situation that the intelligent voice response service corresponding to the client continuously occupies the intelligent voice response window due to the fact that the platform cannot respond to the voice of the client in time may also occur, and at this time, the platform is highly likely to misjudge the voice information of the client as abnormal voice information. Based on the above, after the platform recognizes the abnormal voice information, the recognized abnormal voice information can be detected again through the above mode, the normal voice information which is misjudged as the abnormal voice information due to the overlarge load of the platform is eliminated, and the abnormal voice information corresponding to the attack behavior is obtained, so that the subsequent platform can perform abnormal processing aiming at the abnormal voice information.

In this specification, when the platform processes the voice source corresponding to the abnormal voice information, an Internet Protocol (IP) address of the voice source corresponding to the abnormal voice information may be added to the blacklist to prevent an attack behavior of a lawbreaker corresponding to the abnormal voice information, so as to maintain stability of the system.

Through the steps, the platform can carry out voice anomaly detection on the voice information corresponding to the intelligent voice response service, identify the attack behavior of the lawbreaker, adopt a preset anomaly processing strategy aiming at the attack behavior of the lawbreaker and carry out anomaly processing on the voice source corresponding to the abnormal voice information so as to achieve the purposes of preventing the attack behavior of the lawbreaker and maintaining the stability of the system.

For the abnormal voice detection scheme used for the voice abnormality detection, the present specification also provides a training process of a voice reconstruction model and a training process of an adopted voice abnormality recognition model when the method for detecting voice abnormality is implemented.

In specific implementation, when performing model training on a speech reconstruction model, a platform first acquires first sample speech information, then inputs the first sample speech information into the speech reconstruction model, encodes the first sample speech information by using a preset automatic encoder through the speech reconstruction model, then decodes the encoded first sample speech information to obtain reconstructed speech information, and then trains the speech reconstruction model according to a deviation between the reconstructed speech information and the first sample speech information.

The coding function implemented by the automatic encoder can be expressed as:

z_i＝f(x_i)；

wherein x is_iDenoted as the i-th speech information to be detected, and f () denotes the encoder.

Then, the reconstruction loss can be expressed as:

wherein g () represents a decoder corresponding to the encoder f (), and L () represents a distance function between reconstructed speech information corresponding to the ith speech information to be detected and the ith speech information to be detected.

The automatic encoder may include: a Stacked-layer depth auto-encoder (SAE), a variant auto-encoder (VAE), a noise reduction auto-encoder (Denoising auto-encoder), a Ladder Network (Ladder Network), and the like.

In addition, when the speech anomaly recognition model is trained, the platform firstly acquires second sample speech information, then inputs the second sample speech information into the speech anomaly recognition model to obtain an anomaly score corresponding to the second sample speech information, and finally trains the speech anomaly recognition model by taking the minimized anomaly score corresponding to the second sample speech information and the labeled information corresponding to the second sample speech information as optimization targets.

The first sample speech information used in the training of the speech anomaly recognition model and the second sample speech information used in the training of the speech reconstruction model may be the same or different.

In this specification, since the cost of obtaining the tag information of the sample voice information is high, a large amount of sample voice information without the tag information exists in the sample voice information obtained by the platform. Thus, the platform cannot train the model through labeled fully supervised learning because it is difficult to obtain enough sample voice information with label information. In response to this situation, a set of model training methods matching the above-described speech anomaly detection scheme is provided in this specification.

In specific implementation, before model training, the platform first needs to preprocess the acquired sample voice data. For example, after the platform acquires the sample voice data, the platform cleans the original data, removes the repeated data and removes the invalid data (e.g., the sample voice data with excessive data loss). And then, the platform can carry out model training according to the preprocessed sample voice information.

During model training, the platform firstly performs first-round training on the model according to sample voice information with label information by taking a minimum loss function as an optimization target to obtain a trained recognition model (comprising a voice reconstruction model and a voice abnormal recognition model). And then, the platform selects part of sample voice information from the sample voice information without the label information, inputs the part of sample voice information into the trained recognition model to obtain a corresponding classification result, and takes the obtained classification result as the label information corresponding to the part of sample voice information.

Then, the platform inputs all sample voice information with label information in the sample voice information into the trained recognition model again, takes the minimum loss function as an optimization target, carries out second round training on the recognition model to obtain the recognition model after secondary training, and selects part of sample voice information from the sample voice information without label information again to input the part of sample voice information into the recognition model after secondary training to obtain a corresponding classification result, and takes the obtained classification result as the label information corresponding to the part of sample voice information.

Secondly, the platform inputs all sample voice information with label information in the sample voice information into the trained recognition model again, and performs a third training on the recognition model by taking the minimum loss function as an optimization target to obtain a recognition model after three times of training; and selecting part of sample voice information from the sample voice information without the label information again, inputting the part of sample voice information into the recognition model after secondary training to obtain a corresponding classification result, and taking the obtained classification result as the label information corresponding to the part of sample voice information.

And repeating the steps in a circulating way until the preset training condition is met, and determining that the model training is finished.

In this specification, the preset training condition may have various forms, for example, when the number of model training rounds reaches a set number, it may be determined that the preset training condition is satisfied; for another example, the model is verified by using the verification sample after each round of training, and after the verification is determined to pass, it is determined that the preset training condition is met, and the like. Other ways are not illustrated in detail here.

In an actual service scene, the number of the abnormal voice information is far smaller than that of the normal voice information, so that when the platform trains the model, the obtained training samples are concentrated, and the number of the abnormal sample voice information is also far smaller than that of the normal sample voice information. In this specification, in order to avoid the situation that the trained recognition model has poor recognition accuracy due to excessive trained models and normal speech information, a focus loss function is introduced in the model training process, and the focus loss function can be expressed as:

PL(P_t)＝-(1-P_t)^γlog(P_t)；

wherein, P_tRepresenting the probability that the sample voice information t is abnormal voice information;

PL(P_t) Representing a focus loss function corresponding to the sample voice information t;

gamma is the hyper-parameter of the focal loss function.

Thus, when the number of abnormal voice information in the sample voice information is far smaller than that of normal voice information, namely when the distribution of the positive and negative sample voice information is extremely unbalanced, the influence of a large proportion (namely, normal voice information) on model training can be reduced.

Further, referring to fig. 2, in the present specification, the speech reconstruction model and the speech anomaly recognition model may be jointly trained.

During model training, the platform follows sample speech information X_iExtracting out voice information characteristic Y_iThen, the voice information is characterized by Y_iInputting the voice information into a voice reconstruction model, and performing the voice information characteristic Y by the voice reconstruction model_iCoding and decoding to obtain reconstructed voice information X'_iThen, from the reconstructed speech information X'_iAnd sample speech information X_iThe deviation between, the sample speech information X is determined_iCorresponding reconstruction loss.

At the same time, the platform will sample the speech information X_iCharacteristic Y of speech information_iInputting the sample speech information X into a speech anomaly recognition model to obtain the sample speech information X_iCorresponding classification result, and the sample speech information X_iAnd (4) corresponding to the confidence (confidence) of the classification result. Then, the platform according to the sample voice information X_iCorresponding classification result and the sample voice information X_iDetermining the sample voice information X corresponding to the deviation between the label information_iCorresponding classification loss, and, based on the sample speech information X_iThe confidence corresponding to the classification result, and the sample voice information X_iDetermining the sample voice information X corresponding to the label information of the confidence degree corresponding to the classification result_iThe corresponding confidence loss. And, further based on the sample voice information X_iCorresponding classification loss and the sample speech information X_iDetermining the sample speech information X according to the confidence loss_iCorresponding post-compensation classification loss.

Then, the platform according to the sample voice information X_iCorresponding reconstruction loss and the sample speech information X_iAnd correspondingly, classifying loss after compensation, constructing a loss function of model training, wherein the constructed loss function can be represented by the following formula:

L＝μL_rec+L_cls；

wherein L is_recRepresenting the corresponding reconstruction loss, L, of the sample speech information_clsRepresents the compensated classification loss corresponding to the sample speech information, mu is a hyper-parameter for adjusting the weight of the reconstruction loss and the classification loss,the larger μ, the greater the influence of reconstruction loss on the model.

And finally, the platform performs combined training on the voice reconstruction model and the voice anomaly recognition model by minimizing the loss function L to obtain the trained voice reconstruction model and the trained voice anomaly recognition model.

Wherein if the sample voice information X_iIf the sample speech information is the sample speech information with the label information in the training sample acquired by the platform, the confidence of the classification result corresponding to the sample speech information may be set to 1. If the sample speech information X_iIf the sample voice information is the sample voice information without the label information in the training sample acquired by the platform, the sample voice information X is determined through the trained recognition model_iAnd using the classification result as the sample voice information X_iAfter the label information is obtained, a confidence coefficient is given to the classification result in a manual checking mode.

It should be noted that the scheme for detecting voice anomaly provided in this specification can be applied to various services, such as voice collection, intelligent customer service, information search, and the like, and this specification does not limit specific use scenarios.

Based on the same idea, the present specification further provides a corresponding apparatus for detecting a speech anomaly, as shown in fig. 3.

Fig. 3 is a schematic diagram of a speech anomaly detection apparatus provided in this specification, including:

an obtaining module 300, configured to obtain voice information to be detected;

a determining module 301, configured to determine a voice information feature corresponding to the voice information to be detected, where the voice information feature includes: at least one of a voiceprint feature, a data transmission feature and a voice session request feature, where the data transmission feature is used to characterize the feature of the voice information to be detected in terms of data transmission quantity, and the voice session request feature is used to characterize the session window feature of the voice session request corresponding to the voice information to be detected;

an input module 302, configured to input the speech information features into a pre-trained recognition model, so as to obtain a recognition result for the speech information to be detected;

the detection module 303 is configured to perform voice anomaly detection on the voice information to be detected according to the recognition result;

an exception handling module 304, configured to determine that a network voice attack behavior exists in the voice source if it is determined that the voice information to be detected is abnormal voice information and the voice information re-acquired from the voice source corresponding to the voice information to be detected is still abnormal voice information under the condition that the standby service device is started, and handle the voice source through a preset exception handling policy.

Optionally, the recognition model includes a speech reconstruction model and a speech anomaly recognition model; the input module 302 is specifically configured to input the voice information features into a pre-trained voice reconstruction model and a voice anomaly recognition model, so as to determine a reconstruction score corresponding to the to-be-detected voice information through the voice reconstruction model, and determine an anomaly score for the to-be-detected voice information through the voice anomaly recognition model; according to the recognition result, carrying out voice abnormity detection on the voice information to be detected, which specifically comprises the following steps: and performing voice anomaly detection on the voice information to be detected according to the anomaly score and the reconstruction score.

Optionally, the input module 302 is specifically configured to determine a confidence level corresponding to the anomaly score; and performing voice anomaly detection on the voice information to be detected according to the confidence coefficient, the anomaly score and the reconstruction score.

Optionally, the input module 302 is specifically configured to determine a penalty weight corresponding to the confidence level according to the magnitude of the confidence level, where if the confidence level is lower, the penalty weight is higher; determining a compensated abnormal score according to the confidence coefficient, the penalty weight and the abnormal score; and performing voice anomaly detection on the voice information to be detected according to the compensated anomaly score and the reconstructed score.

Optionally, the apparatus further comprises:

a speech reconstruction model training module 305, configured to obtain first sample speech information; inputting the first sample voice information into the voice reconstruction model to reconstruct the first sample voice information to obtain reconstructed voice information; and training the voice reconstruction model according to the deviation between the reconstructed voice information and the first sample voice information.

Optionally, the apparatus further comprises:

a speech anomaly recognition model training module 306, configured to obtain second sample speech information; inputting the second sample voice information into the voice abnormity recognition model to obtain an abnormity score corresponding to the second sample voice information; and training the voice anomaly recognition model by taking the minimized anomaly score corresponding to the second sample voice information and the labeled information corresponding to the second sample voice information as optimization targets.

The present specification also provides a computer-readable storage medium storing a computer program, which can be used to execute a method of speech anomaly detection provided in fig. 1 above.

This specification also provides a schematic block diagram of an electronic device corresponding to that of figure 1, shown in figure 4. As shown in fig. 4, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and may also include hardware required for other services. The processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the method for detecting a voice anomaly as described in fig. 1 above. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of speech anomaly detection, comprising:

acquiring voice information to be detected;

determining voice information characteristics corresponding to the voice information to be detected, wherein the voice information characteristics comprise: at least one of a voiceprint feature, a data transmission feature and a voice session request feature, where the data transmission feature is used to characterize the feature of the voice information to be detected in terms of data transmission quantity, and the voice session request feature is used to characterize the session window feature of the voice session request corresponding to the voice information to be detected;

performing voice anomaly detection on the voice information to be detected according to the recognition result;

2. The method of claim 1, wherein the voiceprint features comprise: at least one of a speech rate characteristic corresponding to the voice information to be detected and a volume characteristic corresponding to the voice information to be detected;

3. The method of claim 1, wherein the recognition models comprise a speech reconstruction model and a speech anomaly recognition model;

4. The method according to claim 3, wherein performing speech anomaly detection on the speech information to be detected according to the anomaly score and the reconstruction score specifically includes:

determining a confidence degree corresponding to the abnormal score;

5. The method according to claim 4, wherein performing speech anomaly detection on the speech information to be detected according to the confidence level, the anomaly score and the reconstruction score specifically includes:

6. The method of claim 3, wherein training the speech reconstruction model comprises:

acquiring first sample voice information;

7. The method of claim 3, wherein training the speech anomaly recognition model specifically comprises:

acquiring second sample voice information;

8. An apparatus for speech anomaly detection, comprising:

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the program.