CN115565525A

CN115565525A - Audio anomaly detection method and device, electronic equipment and storage medium

Info

Publication number: CN115565525A
Application number: CN202211552884.2A
Authority: CN
Inventors: 张伟; 郑子强; 何得淮; 何行知; 姚佳; 唐怀都; 朱鑫海; 路浩
Original assignee: Sichuan Provincial Prison Administration; West China Hospital of Sichuan University
Current assignee: Sichuan Provincial Prison Administration; West China Hospital of Sichuan University
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-01-03

Abstract

The embodiment of the invention provides an audio anomaly detection method and device, electronic equipment and a storage medium, and relates to the field of data processing. The audio anomaly detection method provided by the application comprises the steps of constructing an initial detection model; processing the initial card punching audio data to generate an audio characteristic tensor; inputting the audio characteristic tensor into an initial detection model, and outputting a first random variable and a second random variable; training the initial detection model according to the optimization function to obtain a corrected detection model; inputting the first random variable and the second random variable into a correction detection model to generate a reconstruction tensor; carrying out anomaly evaluation calculation on the reconstruction tensor to obtain an anomaly score; and if the abnormal score is larger than or equal to the abnormal threshold value, determining that the initial card punching audio data is abnormal. The embodiment jointly encodes time and spatial data, can be used for monitoring the daily state of personnel, the running state of a machine and the like, gives early warning in time, and helps enterprises, institutions and the like to manage better.

Description

Audio anomaly detection method, device, electronic device and storage medium

技术领域technical field

本发明涉及数据处理技术领域，具体而言，涉及一种音频异常检测方法、装置、电子设备及存储介质。The present invention relates to the technical field of data processing, in particular to an audio abnormality detection method, device, electronic equipment and storage medium.

背景技术Background technique

在现有的音频异常检测任务中，主要是检测可疑活动，如车辆碰撞、叫喊或枪声检测等，用于提高安防系统的可靠性或监测设备状态。与图像文本不同，搭建音频实验环境的条件更加苛刻，对音频的标注成本更高，因此很少直接通过音频对人的异常状态进行检测。In the existing audio anomaly detection tasks, it is mainly to detect suspicious activities, such as vehicle collision, yelling or gunshot detection, etc., which are used to improve the reliability of security systems or monitor the status of equipment. Different from image text, the conditions for building an audio experiment environment are more stringent, and the cost of labeling audio is higher, so it is rare to directly detect abnormal states of people through audio.

目前已有的研究主要集中于通过单个音频进行情绪识别，音频数据集由专业演员通过情绪引导、回忆场景、环境改变等方式构建，并由专家进行数据标注。此类数据集主要存在以下两个问题：无法保证情绪的真实性，以及每个个体之间存在差异性。此外，人工标注音频数据需要大量的时间和人力，如何在大量未标记音频数据中找出异常音频，目前暂无研究。The existing research mainly focuses on emotion recognition through a single audio. The audio data set is constructed by professional actors through emotional guidance, recalling scenes, and environmental changes, and the data is annotated by experts. There are two main problems in this kind of data set: the authenticity of emotions cannot be guaranteed, and there are differences between each individual. In addition, manually labeling audio data requires a lot of time and manpower. How to find out abnormal audio in a large amount of unlabeled audio data has not been studied yet.

发明内容Contents of the invention

为了解决上述技术问题，本申请实施例提供了一种音频异常检测方法、装置、电子设备及存储介质。In order to solve the above technical problems, the embodiments of the present application provide an audio abnormality detection method, device, electronic equipment, and storage medium.

第一方面，本申请实施例提供了一种音频异常检测方法，所述方法包括：In the first aspect, the embodiment of the present application provides an audio abnormality detection method, the method comprising:

基于变分网络和生成网络构建初始检测模型；Build an initial detection model based on a variational network and a generative network;

基于初始打卡音频数据生成音频特征张量；Generate audio feature tensors based on the initial clock-in audio data;

将所述音频特征张量输入所述初始检测模型，通过所述初始检测模型输出第一随机变量和第二随机变量；The audio feature tensor is input into the initial detection model, and the first random variable and the second random variable are output by the initial detection model;

根据优化函数对所述初始检测模型进行训练，得到修正检测模型；training the initial detection model according to the optimization function to obtain a modified detection model;

将所述第一随机变量和所述第二随机变量输入所述修正检测模型，生成所述音频特征张量对应的重构张量；Inputting the first random variable and the second random variable into the modified detection model to generate a reconstruction tensor corresponding to the audio feature tensor;

对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数；Perform abnormal evaluation calculation on the reconstructed tensor to obtain an abnormal score corresponding to the audio feature tensor;

若所述异常分数大于或等于异常阈值，则确定所述初始打卡音频数据存在异常。If the abnormality score is greater than or equal to the abnormality threshold, it is determined that the initial clock-in audio data is abnormal.

在一实施方式中，所述基于初始打卡音频数据生成音频特征张量的步骤，包括：In one embodiment, the step of generating an audio feature tensor based on the initial clock-in audio data includes:

获取N₁个初始打卡音频数据；Obtain N ₁ initial check-in audio data;

对各所述初始打卡音频数据进行预处理，得到N₁个修正打卡音频数据；Preprocessing each of the initial clock-in audio data to obtain N ₁ revised clock-in audio data;

将各所述修正打卡音频数据转换为对应的N₂个特征数据，并将N₂个所述特征数据拼接为特征向量；Converting each of the corrected clock-in audio data into corresponding N ₂ feature data, and splicing the N ₂ feature data into feature vectors;

将N₁个所述特征向量拼接为音频特征张量。Concatenate the N1 feature vectors into _an audio feature tensor.

在一实施方式中，所述对多个所述初始打卡音频数据进行预处理的步骤，包括：In one embodiment, the step of preprocessing a plurality of the initial clock-in audio data includes:

去除各所述初始打卡音频数据的底噪，得到降噪打卡音频数据；Remove the background noise of each of the initial clock-in audio data to obtain the noise-reduced clock-in audio data;

按照预设频率对所述降噪打卡音频数据进行采样。The noise reduction punch-in audio data is sampled according to a preset frequency.

在一实施方式中，所述初始检测模型包括：预设卷积层、预设反卷积层、门控循环层、线性变换层和全连接层；In one embodiment, the initial detection model includes: a preset convolution layer, a preset deconvolution layer, a gated loop layer, a linear transformation layer, and a fully connected layer;

所述变分网络由预设卷积层、预设反卷积层和门控循环层构成；The variational network is composed of a preset convolution layer, a preset deconvolution layer and a gated loop layer;

所述生成网络由预设反卷积层、门控循环层、线性变换层和全连接层构成。The generation network is composed of a preset deconvolution layer, a gated recurrent layer, a linear transformation layer and a fully connected layer.

在一实施方式中，所述根据优化函数对所述初始检测模型进行训练的步骤，包括：In one embodiment, the step of training the initial detection model according to the optimization function includes:

所述优化函数为：The optimization function is:

其中，

表示训练损失，

表示所述音频特征张量的数学期望，

表示所述生成网络对所述音频特征张量的后验概率，

表示所述变分网络对所述音频特征张量的后验概率，

表示KL散度，

为常数，θ为所述生成网络的层参数，ϕ为所述变分网络的层参数；in,

represents the training loss,

represents the mathematical expectation of the audio feature tensor,

Represents the posterior probability of the generation network for the audio feature tensor,

Represents the posterior probability of the variational network for the audio feature tensor,

represents the KL divergence,

is a constant, θ is the layer parameter of the generation network, and ϕ is the layer parameter of the variational network;

通过随机梯度变分估计和重参数化对θ和ϕ进行调整，根据调整后的θ和ϕ计算

；当

小于损失阈值时，保存调整后的θ和ϕ。θ and ϕ are adjusted by stochastic gradient variational estimation and reparameterization, computed from the adjusted θ and ϕ

;when

When less than the loss threshold, save the adjusted θ and ϕ.

所述生成所述音频特征张量对应的重构张量的步骤，包括：The step of generating the reconstruction tensor corresponding to the audio feature tensor includes:

通过所述线性变换层对所述第一随机变量进行映射，得到映射结果；Mapping the first random variable through the linear transformation layer to obtain a mapping result;

将所述第二随机变量输入所述预设反卷积层，得到反卷积结果；inputting the second random variable into the preset deconvolution layer to obtain a deconvolution result;

将所述映射结果和所述反卷积结果进行连接，得到连接结果；connecting the mapping result and the deconvolution result to obtain a connection result;

通过所述全连接层对所述连接结果进行解码，得到所述重构张量。The connection result is decoded by the fully connected layer to obtain the reconstructed tensor.

在一实施方式中，所述对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数，的步骤，包括：In one embodiment, the step of performing abnormal evaluation calculation on the reconstructed tensor to obtain the abnormal score corresponding to the audio feature tensor includes:

对所述重构张量进行采样，得到L个重构样本；Sampling the reconstructed tensor to obtain L reconstructed samples;

对L个所述重构样本进行蒙特卡洛积分，得到重构概率；Performing Monte Carlo integration on the L reconstructed samples to obtain a reconstruction probability;

取所述重构概率的相反数，得到所述音频特征张量对应的异常分数。The inverse of the reconstruction probability is taken to obtain the abnormal score corresponding to the audio feature tensor.

第二方面，本申请实施例提供了一种音频异常检测装置，所述音频异常检测装置包括：In the second aspect, the embodiment of the present application provides an audio anomaly detection device, and the audio anomaly detection device includes:

构建模块，用于基于变分网络和生成网络构建初始检测模型；Building blocks for constructing an initial detection model based on variational and generative networks;

第一生成模块，用于基于初始打卡音频数据生成音频特征张量；The first generating module is used to generate audio feature tensors based on the initial clock-in audio data;

输入模块，用于将所述音频特征张量输入所述初始检测模型，通过所述初始检测模型输出第一随机变量和第二随机变量；An input module, configured to input the audio feature tensor into the initial detection model, and output a first random variable and a second random variable through the initial detection model;

训练模块，用于根据优化函数对所述初始检测模型进行训练，得到修正检测模型；A training module, configured to train the initial detection model according to an optimization function to obtain a revised detection model;

第二生成模块，用于将所述第一随机变量和所述第二随机变量输入所述修正检测模型，生成所述音频特征张量对应的重构张量；A second generating module, configured to input the first random variable and the second random variable into the modified detection model, and generate a reconstruction tensor corresponding to the audio feature tensor;

计算模块，用于对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数；A calculation module, configured to perform abnormal evaluation calculation on the reconstructed tensor, and obtain an abnormal score corresponding to the audio feature tensor;

确定模块，用于若所述异常分数大于或等于异常阈值，则确定所述初始打卡音频数据存在异常。A determining module, configured to determine that the initial clock-in audio data is abnormal if the abnormal score is greater than or equal to an abnormal threshold.

第三方面，本申请实施例提供了一种电子设备，包括存储器以及处理器，所述存储器用于存储计算机程序，所述计算机程序在所述处理器运行时执行第一方面提供的音频异常检测方法。In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, the memory is used to store a computer program, and the computer program executes the audio anomaly detection provided in the first aspect when the processor is running method.

第四方面，本申请实施例提供了一种计算机可读存储介质，其存储有计算机程序，所述计算机程序在处理器上运行时执行第一方面提供的音频异常检测方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program executes the audio anomaly detection method provided in the first aspect when running on a processor.

上述本申请提供的音频异常检测方法，采用变分自编码器构建了初始检测模型；对初始打卡音频数据进行处理，生成音频特征张量；将所述音频特征张量输入所述初始检测模型，通过所述初始检测模型输出第一随机变量和第二随机变量；根据优化函数对所述初始检测模型进行训练，得到修正检测模型；将所述第一随机变量和所述第二随机变量输入所述修正检测模型，生成所述音频特征张量对应的重构张量；对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数；若所述异常分数大于或等于异常阈值，则确定所述初始打卡音频数据存在异常。本申请实施例对时间和空间数据进行联合编码，首次对相同目标连续打卡音频进行异常检测，可用于监测人员每日状态、机器运行状态等，及时预警，帮助企业、机关单位等进行更好地管理。The audio anomaly detection method provided by the above-mentioned application uses a variational autoencoder to construct an initial detection model; the initial punch-in audio data is processed to generate an audio feature tensor; the audio feature tensor is input into the initial detection model, Outputting a first random variable and a second random variable through the initial detection model; training the initial detection model according to an optimization function to obtain a revised detection model; inputting the first random variable and the second random variable into the The modified detection model is used to generate the reconstructed tensor corresponding to the audio feature tensor; the abnormal evaluation calculation is performed on the reconstructed tensor to obtain the abnormal score corresponding to the audio feature tensor; if the abnormal score is greater than or is equal to the abnormality threshold, it is determined that the initial clock-in audio data is abnormal. The embodiment of this application jointly encodes the time and space data, and for the first time detects abnormalities in the continuous clock-in audio for the same target. manage.

附图说明Description of drawings

为了更清楚地说明本申请的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对本申请保护范围的限定。在各个附图中，类似的构成部分采用类似的编号。In order to illustrate the technical solution of the present application more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the application, and therefore should not be regarded It is regarded as a limitation on the scope of protection of the present application. In the respective drawings, similar components are given similar reference numerals.

图1示出了本申请实施例提供的音频异常检测方法的一流程示意图；FIG. 1 shows a schematic flow chart of an audio anomaly detection method provided by an embodiment of the present application;

图2示出了本申请实施例提供的初始检测模型的一结构示意图；Fig. 2 shows a schematic structural diagram of the initial detection model provided by the embodiment of the present application;

图3示出了本申请实施例提供的一维特征向量的一示意图；Fig. 3 shows a schematic diagram of the one-dimensional feature vector provided by the embodiment of the present application;

图4示出了本申请实施例提供的七日打卡音频特征张量的一示意图；Fig. 4 shows a schematic diagram of the seven-day check-in audio feature tensor provided by the embodiment of the present application;

图5示出了本申请实施例提供的时间序列的另一示意图；Fig. 5 shows another schematic diagram of the time series provided by the embodiment of the present application;

图6示出了本申请实施例提供的音频异常检测装置的一结构示意图。FIG. 6 shows a schematic structural diagram of an audio anomaly detection device provided by an embodiment of the present application.

图标：210-变分网络，220-生成网络；Icon: 210-variational network, 220-generated network;

510-时间序列上基频特征异常，520-时间序列上静音段百分比特征异常，530-时间序列上多特征异常；510-Fundamental frequency feature abnormality in time series, 520-Mute segment percentage feature abnormality in time series, 530-Multi-feature abnormality in time series;

600-音频异常检测装置，610-构建模块，620-第一生成模块，630-输入模块，640-训练模块，650-第二生成模块，660-计算模块，670-确定模块。600-audio anomaly detection device, 610-construction module, 620-first generation module, 630-input module, 640-training module, 650-second generation module, 660-calculation module, 670-determination module.

具体实施方式detailed description

下面将结合本申请实施例中附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, not all of them.

通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围，而是仅仅表示本申请的选定实施例。基于本申请的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本申请保护的范围。The components of the embodiments of the application generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without making creative efforts belong to the scope of protection of the present application.

在下文中，可在本申请的各种实施例中使用的术语“包括”、“具有”及其同源词仅意在表示特定特征、数字、步骤、操作、元件、组件或前述项的组合，并且不应被理解为首先排除一个或更多个其它特征、数字、步骤、操作、元件、组件或前述项的组合的存在或增加一个或更多个特征、数字、步骤、操作、元件、组件或前述项的组合的可能性。Hereinafter, the terms "comprising", "having" and their cognates that may be used in various embodiments of the present application are only intended to represent specific features, numbers, steps, operations, elements, components or combinations of the foregoing, And it should not be understood as first excluding the existence of one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing or adding one or more features, numbers, steps, operations, elements, components or a combination of the foregoing possibilities.

此外，术语“第一”、“第二”、“第三”等仅用于区分描述，而不能理解为指示或暗示相对重要性。In addition, the terms "first", "second", "third", etc. are only used for distinguishing descriptions, and should not be construed as indicating or implying relative importance.

除非另有限定，否则在这里使用的所有术语(包括技术术语和科学术语)具有与本申请的各种实施例所属领域普通技术人员通常理解的含义相同的含义。所述术语(诸如在一般使用的词典中限定的术语)将被解释为具有与在相关技术领域中的语境含义相同的含义并且将不被解释为具有理想化的含义或过于正式的含义，除非在本申请的各种实施例中被清楚地限定。Unless otherwise defined, all terms (including technical terms and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the application belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technical field and will not be interpreted as having an idealized meaning or an overly formal meaning, Unless clearly defined in the various embodiments of the present application.

实施例1Example 1

本公开实施例提供了一种音频异常检测方法。An embodiment of the present disclosure provides an audio anomaly detection method.

具体的，请参见图1，所述音频异常检测方法包括：Specifically, referring to Fig. 1, the audio anomaly detection method includes:

步骤S110，基于变分网络210和生成网络220构建初始检测模型；Step S110, constructing an initial detection model based on the variational network 210 and the generation network 220;

在一实施方式中，请参见图2，所述初始检测模型包括：预设卷积层Conv1D、预设反卷积层deConv1D、门控循环层GRU、线性变换层linear和全连接层dense；所述变分网络由预设卷积层Conv1D、预设反卷积层deConv1D和门控循环层GRU构成；所述生成网络由预设反卷积层Conv1D、门控循环层GRU、线性变换层linear和全连接层dense构成。其中变分网络为210，生成网络为220，为便于描述，后续的所有公式采用英文表达。In one embodiment, please refer to FIG. 2 , the initial detection model includes: a preset convolutional layer Conv1D, a preset deconvolution layer deConv1D, a gating cycle layer GRU, a linear transformation layer linear, and a fully connected layer dense; The variational network is composed of a preset convolutional layer Conv1D, a preset deconvolution layer deConv1D, and a gated cyclic layer GRU; the generated network is composed of a preset deconvolution layer Conv1D, a gated cyclic layer GRU, and a linear transformation layer linear and a fully connected layer dense. Among them, the variational network is 210, and the generation network is 220. For the convenience of description, all subsequent formulas are expressed in English.

步骤S120，基于初始打卡音频数据生成音频特征张量；Step S120, generating an audio feature tensor based on the initial clock-in audio data;

在一实施方式中，所述基于初始打卡音频数据生成音频特征张量的步骤，包括：获取N₁个初始打卡音频数据；In one embodiment, the step of generating an audio feature tensor based on the initial clock-in audio data includes: acquiring N ₁ initial clock-in audio data;

在一实施方式中，通过打卡机收集每日音频打卡数据作为初始打卡音频数据，打卡机内提前设置两个问题，每个问题后预留15s的回答时间，打卡人员在打卡机提问后回答问题，打卡机采集回答者的音频，共得到30s每人每天的打卡音频数据。在一实施方式中，可以连续采集一周的初始打卡音频数据，此刻N₁为7。In one embodiment, the daily audio clock-in data is collected by the clock-in machine as the initial clock-in audio data, and two questions are set in advance in the clock-in machine, and 15 seconds of answering time is reserved after each question, and the clock-in personnel answer the questions after asking the clock-in machine , the clock-in machine collects the audio of the respondent, and obtains a total of 30 seconds of clock-in audio data per person per day. In one embodiment, the initial clock-in audio data can be continuously collected for one week, and N ₁ is 7 at this moment.

在一实施方式中，所述对多个所述初始打卡音频数据进行预处理的步骤，包括：去除各所述初始打卡音频数据的底噪，得到降噪打卡音频数据；按照预设频率对所述降噪打卡音频数据进行采样。In one embodiment, the step of preprocessing a plurality of the initial clock-in audio data includes: removing the background noise of each of the initial clock-in audio data to obtain noise-reduced clock-in audio data; Sample the noise-reduced punch-in audio data described above.

在一实施方式中，音频降噪是通过滤波器将音频底噪去除。音频降采样是将音频采样率固定为16kHz，方便后续计算处理。In one embodiment, the audio noise reduction is to remove the audio floor noise through a filter. Audio downsampling is to fix the audio sampling rate to 16kHz, which is convenient for subsequent calculation and processing.

将各所述修正打卡音频数据转换为对应的N₂个特征数据，并将N₂个所述特征数据拼接为特征向量；将N₁个所述特征向量拼接为音频特征张量。Converting each of the corrected clock-in audio data into corresponding N ₂ feature data, and splicing the N ₂ feature data into a feature vector; splicing the N ₁ feature vectors into an audio feature tensor.

在一实施方式中，如图3所示，图3示出了本申请实施例提供的一维特征向量的一示意图。其中，N₂个特征数据包括1个基频、1个静音段百分比、1个平均能量值、40个梅尔光谱、13个梅尔倒谱、12个一阶梅尔倒谱；拼接得到的特征向量为长度为68 的一维特征向量，即此时N₂等于68。In an implementation manner, as shown in FIG. 3 , FIG. 3 shows a schematic diagram of a one-dimensional feature vector provided by the embodiment of the present application. Among them, the N ₂ feature data include 1 fundamental frequency, 1 silent segment percentage, 1 average energy value, 40 Mel spectra, 13 Mel cepstrums, and 12 first-order Mel cepstrums; The feature vector is a one-dimensional feature vector with a length of 68, that is, N ₂ is equal to 68 at this time.

将同一个人的每日打卡音频特征向量进行拼接，得到音频特征张量，用

表示，

，

表示特征维度，t表示时间长度，

。为了便于描述，此处的字母会延用到后文。在一实施方式中，如图4所示，图4示出了将同一个人连续七天的一维特征向量进行拼接，得到的七日打卡音频特征张量的一示意图。Splice the audio feature vectors of the same person’s daily check-in to get the audio feature tensor, use

express,

,

Represents the feature dimension, t represents the length of time,

. For ease of description, the letters here will be extended to the following. In one embodiment, as shown in FIG. 4 , FIG. 4 shows a schematic diagram of a seven-day clock-in audio feature tensor obtained by concatenating the one-dimensional feature vectors of the same person for seven consecutive days.

步骤S130，将所述音频特征张量输入所述初始检测模型，通过所述初始检测模型输出第一随机变量

和第二随机变量

；Step S130, input the audio feature tensor into the initial detection model, and output the first random variable through the initial detection model

and the second random variable

;

在本实施方式中，采用变分自编码器对初始检测模型进行构建和训练。变分网络可表示为

，

为输入的音频张量，

为变分网络的层参数，

、

为随机隐变量，

用来学习特征之间依赖信息嵌入，

用来学习特征之间时序嵌入。

由输入

经过预设卷积层得到，请参见公式1：In this embodiment, a variational autoencoder is used to construct and train the initial detection model. The variational network can be expressed as

,

is the input audio tensor,

is the layer parameter of the variational network,

,

is a random hidden variable,

Used to learn dependent information embedding between features,

Used to learn temporal embeddings between features.

input by

Obtained through the preset convolutional layer, please refer to formula 1:

，

,

其中k表示卷积运算之后

的长度，由卷积核的个数和滑窗步长大小决定。将

通过反卷积层恢复至原来的大小，为后续解码做准备。where k represents after the convolution operation

The length of is determined by the number of convolution kernels and the step size of the sliding window. Will

It is restored to its original size through the deconvolution layer to prepare for subsequent decoding.

步骤S140，根据优化函数对所述初始检测模型进行训练，得到修正检测模型；Step S140, training the initial detection model according to the optimization function to obtain a modified detection model;

本申请实施例通过优化证据下界ELBO的方式训练模型，在一实施方式中，所述根据优化函数对所述初始检测模型进行训练的步骤，包括：In the embodiment of the present application, the model is trained by optimizing the evidence lower bound ELBO. In one embodiment, the step of training the initial detection model according to the optimization function includes:

所述优化函数请参见公式2：For the optimization function, please refer to Formula 2:

将公式2展开，得到Expanding Equation 2, we get

其中，

表示训练损失，

表示所述音频特征张量的数学期望，

表示所述生成网络对所述音频特征张量的后验概率，

表示所述变分网络对所述音频特征张量的后验概率，

表示KL散度，

represents the training loss,

represents the mathematical expectation of the audio feature tensor,

represents the KL divergence,

通过随机梯度变分和重参数化对θ和ϕ进行调整，根据调整后的θ和ϕ计算

；当

小于损失阈值时，保存调整后的θ和ϕ。θ and ϕ are adjusted by stochastic gradient variation and reparameterization, computed from the adjusted θ and ϕ

;when

When less than the loss threshold, save the adjusted θ and ϕ.

其中，KL散度用来描述两个概率分布的差异，此处

作为正则项，作用是让变分分布具有一定的随机性。优化目标希望变分分布和后验分布尽可能相同，且通过

、

重建

的概率更大，因此可以采用随机梯度变分估计（SGVB）和重参数化对参数θ和ϕ进行优化，使得损失

最小。Among them, KL divergence is used to describe the difference between two probability distributions, here

As a regular term, the function is to make the variational distribution have a certain degree of randomness. The optimization objective hopes that the variational distribution and the posterior distribution are as identical as possible, and pass

,

reconstruction

The probability of is larger, so the parameters θ and ϕ can be optimized using stochastic gradient variational estimation (SGVB) and reparameterization, so that the loss

minimum.

具体地，可以先从

中采样若干个点，并对这些点通过蒙特卡洛积分

，但是采样得到的数据是离散的，换言之，采样得到的数据是不可导的，后续也无法反向梯度优化

，这时可以引入重参数化技巧，引入形式已知的参数，来使采样可导。Specifically, you can start with

Sampling several points in , and integrating these points by Monte Carlo

, but the sampled data is discrete, in other words, the sampled data is not derivable, and subsequent reverse gradient optimization cannot be performed

, then the reparameterization technique can be introduced to introduce parameters with known forms to make the sampling guideable.

步骤S150，将所述第一随机变量和所述第二随机变量输入所述修正检测模型，生成所述音频特征张量对应的重构张量；Step S150, inputting the first random variable and the second random variable into the modified detection model to generate a reconstruction tensor corresponding to the audio feature tensor;

通过所述线性变换层对所述第一随机变量进行映射，得到映射结果；将所述第二随机变量输入所述预设反卷积层，得到反卷积结果；将所述映射结果和所述反卷积结果进行连接，得到连接结果；通过所述全连接层对所述连接结果进行解码，得到所述重构张量。Mapping the first random variable through the linear transformation layer to obtain a mapping result; inputting the second random variable into the preset deconvolution layer to obtain a deconvolution result; combining the mapping result with the obtained The deconvolution result is connected to obtain a connection result; the connection result is decoded through the fully connected layer to obtain the reconstructed tensor.

如图2所示，输入的音频张量

经由预设卷积层得到第二随机变量

，因为在特征数据中可能会包含异常数据，在训练自编码器的过程中易出现过拟合。因此，为了防止模型对异常数据的过拟合，需要对第二随机变量

进行滑动平均处理，以消除异常特征点。将异常特征点消除后，输入门控循环层GRU进行编码，得到第一随机变量

，第一随机变量

学习的是特征之间的依赖信息嵌入，长度与输入一致，请参见公式3：As shown in Figure 2, the input audio tensor

Obtain the second random variable via the preset convolutional layer

, because abnormal data may be included in the feature data, it is prone to overfitting in the process of training the autoencoder. Therefore, in order to prevent the model from overfitting to abnormal data, the second random variable

Carry out moving average processing to eliminate abnormal feature points. After eliminating the abnormal feature points, input the gated recurrent layer GRU for encoding to obtain the first random variable

, the first random variable

What is learned is the dependent information embedding between features, and the length is consistent with the input, see formula 3:

，

,

其中

为

的维度，由门控循环层GRU的输出层维度决定。in

for

The dimension of is determined by the dimension of the output layer of the gated recurrent layer GRU.

生成网络可表示为

，

为生成网络层参数，输入为第一随机变量

和第二随机变量

，通过对第一随机变量

进行映射，得到映射结果；将第二随机变量

输入预设反卷积层，得到反卷积结果；将所述映射结果和所述反卷积结果通过连接函数（concat函数）进行连接，得到连接结果；通过全连接层对连接后的结果，即特征之间的依赖信息嵌入和时序嵌入共同解码，生成原始音频的重构张量

，大小与原始输入一致，请参见公式4：The generated network can be expressed as

,

To generate network layer parameters, the input is the first random variable

and the second random variable

, by the first random variable

Mapping is performed to obtain the mapping result; the second random variable

Input a preset deconvolution layer to obtain a deconvolution result; connect the mapping result and the deconvolution result through a connection function (concat function) to obtain a connection result; use a fully connected layer to connect the result, That is, the dependent information embedding and timing embedding between features are jointly decoded to generate the reconstructed tensor of the original audio

, the size is consistent with the original input, see Equation 4:

，

,

步骤S160，对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数；Step S160, performing anomaly evaluation calculation on the reconstructed tensor to obtain an anomaly score corresponding to the audio feature tensor;

所述对所述重构张量进行异常评估计算的步骤，包括：The step of performing abnormal evaluation calculation on the reconstructed tensor includes:

对所述重构张量进行采样，得到L个重构样本；对L个所述重构样本进行蒙特卡洛积分，得到重构概率；取所述重构概率的相反数，得到所述音频特征张量对应的异常分数。具体地，请参见公式5：Sampling the reconstructed tensor to obtain L reconstructed samples; performing Monte Carlo integration on the L reconstructed samples to obtain a reconstruction probability; taking the opposite number of the reconstruction probability to obtain the audio The anomaly score corresponding to the feature tensor. Specifically, see Equation 5:

其中，

为所述异常分数，异常分数的意义为重构张量

的异常值数学期望，

表示对L个重构样本进行蒙特卡洛积分，其中

是从

中采样得到。

)代表第l个重构样本的概率。in,

For the anomaly score, the meaning of the anomaly score is the reconstruction tensor

The outlier mathematical expectation of ,

Indicates that Monte Carlo integration is performed on L reconstructed samples, where

From

obtained by sampling.

) represents the probability of the lth reconstructed sample.

在异常检测时，将重构概率作为异常指标。假设输入为

，

为观测数据，

为缺失数据，假设

服从观测数据

的分布，即可以从

分布中对

进行采样，在给定

的情况下重构观测值以获得缺失值

，

满足观测数据

的正常模式，即接近

。令重构数据为

，重构概率可以通过取

个样本进行蒙特卡洛积分来计算，异常分数则是对重构概率取相反数，计算公式如上述公式5。In anomaly detection, the reconstruction probability is used as an anomaly indicator. Suppose the input is

,

For observation data,

For missing data, suppose

obedience to observational data

distribution, which can be obtained from

pair in distribution

is sampled, at a given

Refactor observations for missing values

,

Meet the observed data

The normal mode of , which is close to

. Let the reconstructed data be

, the reconstruction probability can be obtained by taking

The samples are calculated by Monte Carlo integration, and the abnormal score is the inverse of the reconstruction probability. The calculation formula is as in the above formula 5.

步骤S170，若所述异常分数大于或等于异常阈值，则确定所述初始打卡音频数据存在异常。设置异常阈值

，当计算异常分数大于阈值

时，提示初始打卡音频数据为异常。Step S170, if the abnormality score is greater than or equal to the abnormality threshold, it is determined that the initial clock-in audio data is abnormal. Set exception threshold

, when the computed anomaly score is greater than the threshold

, it prompts that the initial clock-in audio data is abnormal.

请参见图4和图5，在一具体实施例中，采集了10名志愿者连续7天的打卡音频数据，图4为一名存在异常的志愿者连续7天的打卡音频处理结果对应的空间序列，图3为该志愿者对应的时间序列上的一维特征向量。将连续7天的打卡数据转换为音频特征张量，然后在时间序列和空间序列上进行异常监测，模型能够监测出时间序列上明显异常的数据，并且能够监测到同一天音频中特征之间的异常，第一天的基频（图5中的510）和第六天的静音段百分比（图5中的520）的数据趋势与平时数据特征之间的趋势相反，如第四天特征（图5中的530）明显异常于前三天的数据。第四天打卡后，修正检测模型及时预警，在对该志愿者访谈后了解到，由于睡眠影响，在打卡时出现了厌烦抵触心理，进行心理辅导后，后续打卡数据恢复了正常。Please refer to Figure 4 and Figure 5. In a specific embodiment, the clock-in audio data of 10 volunteers were collected for 7 consecutive days. Figure 4 shows the space corresponding to the audio processing results of a volunteer with abnormalities for 7 consecutive days. sequence, Figure 3 shows the one-dimensional feature vector on the time series corresponding to the volunteer. Convert the clock-in data for 7 consecutive days into audio feature tensors, and then perform anomaly monitoring on time series and space series. The model can detect obviously abnormal data in time series, and can detect the difference between features in the audio on the same day. Abnormally, the data trend of the fundamental frequency on the first day (510 in Figure 5) and the percentage of silent segments on the sixth day (520 in Figure 5) is opposite to the trend between the usual data characteristics, such as the characteristics of the fourth day (Figure 5 530 out of 5) are significantly abnormal from the data of the previous three days. After clocking in on the fourth day, the detection model was corrected to give a timely warning. After interviewing the volunteer, it was learned that due to the impact of sleep, there was boredom and resistance when clocking in. After psychological counseling, the subsequent clocking data returned to normal.

本实施例提供的音频异常检测方法，结合变分自编码器，对时间和空间数据进行联合编码，首次对相同目标连续打卡音频进行异常检测，可用于监测人员每日状态、机器运行状态等，及时预警，帮助企业、机关单位等进行更好地管理。The audio anomaly detection method provided in this embodiment, combined with a variational autoencoder, jointly encodes time and space data, and for the first time performs anomaly detection on the same target continuous clock-in audio, which can be used to monitor the daily status of personnel, machine operating status, etc. Timely early warning to help enterprises, government agencies, etc. to manage better.

实施例2Example 2

此外，本公开实施例提供了一种音频异常检测装置。In addition, an embodiment of the present disclosure provides an audio anomaly detection device.

具体的，如图6所示，音频异常检测装置600包括：Specifically, as shown in FIG. 6, the audio anomaly detection device 600 includes:

构建模块610，用于基于变分网络和生成网络构建初始检测模型；A construction module 610, configured to construct an initial detection model based on a variational network and a generation network;

第一生成模块620，用于基于初始打卡音频数据生成音频特征张量；The first generating module 620 is used to generate an audio feature tensor based on the initial clock-in audio data;

输入模块630，用于将所述音频特征张量输入所述初始检测模型，通过所述初始检测模型输出第一随机变量和第二随机变量；An input module 630, configured to input the audio feature tensor into the initial detection model, and output a first random variable and a second random variable through the initial detection model;

训练模块640，用于根据优化函数对所述初始检测模型进行训练，得到修正检测模型；A training module 640, configured to train the initial detection model according to an optimization function to obtain a modified detection model;

第二生成模块650，用于将所述第一随机变量和所述第二随机变量输入所述修正检测模型，生成所述音频特征张量对应的重构张量；The second generating module 650 is configured to input the first random variable and the second random variable into the modified detection model, and generate a reconstructed tensor corresponding to the audio feature tensor;

计算模块660，用于对所述重构张量进行异常评估计算，得到所述音频特征张量对应的异常分数；Calculation module 660, configured to perform abnormal evaluation calculation on the reconstructed tensor to obtain an abnormal score corresponding to the audio feature tensor;

确定模块670，用于若所述异常分数大于或等于异常阈值，则确定所述初始打卡音频数据存在异常。The determination module 670 is configured to determine that the initial clock-in audio data is abnormal if the abnormal score is greater than or equal to an abnormal threshold.

本实施例提供的音频异常检测装置600可以实现实施例1所提供的音频异常检测方法，为避免重复，在此不再赘述。The audio anomaly detection device 600 provided in this embodiment can implement the audio anomaly detection method provided in Embodiment 1. To avoid repetition, details are not repeated here.

本实施例提供的音频异常检测装置，结合变分自编码器，对时间和空间数据进行联合编码，首次对相同目标连续打卡音频进行异常检测，可用于监测人员每日状态、机器运行状态等，及时预警，帮助企业、机关单位等进行更好地管理。The audio anomaly detection device provided in this embodiment, combined with a variational autoencoder, jointly encodes time and space data, and for the first time performs anomaly detection on the continuous clock-in audio of the same target, which can be used to monitor the daily status of personnel and the operating status of machines, etc. Timely early warning to help enterprises, government agencies, etc. to manage better.

实施例3Example 3

此外，本公开实施例提供了一种电子设备，包括存储器以及处理器，所述存储器存储有计算机程序，所述计算机程序在所述处理器上运行时执行实施例1所提供的音频异常检测方法。In addition, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program, and the computer program executes the audio anomaly detection method provided in Embodiment 1 when running on the processor .

本发明实施例提供的电子设备，可以执行上述方法实施例中的音频异常检测装置可以执行的步骤，不再赘述。The electronic device provided in the embodiment of the present invention can execute the steps that can be executed by the audio anomaly detection apparatus in the foregoing method embodiment, and details will not be repeated here.

本实施例提供的电子设备，结合变分自编码器，对时间和空间数据进行联合编码，首次对相同目标连续打卡音频进行异常检测，可用于监测人员每日状态、机器运行状态等，及时预警，帮助企业、机关单位等进行更好地管理。The electronic equipment provided in this embodiment, combined with a variational autoencoder, jointly encodes time and space data, and for the first time performs anomaly detection on the continuous clock-in audio of the same target, which can be used to monitor the daily status of personnel, machine operating status, etc., and give timely warnings , to help enterprises, government agencies, etc. to better manage.

实施例4Example 4

本申请还提供一种计算机可读存储介质，所述计算机可读存储介质上存储计算机程序，所述计算机程序被处理器执行时实现实施例1所提供的音频异常检测方法。The present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the audio anomaly detection method provided in Embodiment 1 is implemented.

在本实施例中，计算机可读存储介质可以为只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等。In this embodiment, the computer-readable storage medium may be a read-only memory (Read-Only Memory, ROM for short), a random access memory (Random Access Memory, RAM for short), a magnetic disk, or an optical disk.

本实施例提供的计算机可读存储介质可以实现实施例1所提供的音频异常检测方法，为避免重复，在此不再赘述。The computer-readable storage medium provided in this embodiment can implement the audio anomaly detection method provided in Embodiment 1, and details are not repeated here to avoid repetition.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者终端中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or terminal comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or terminal. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article or terminal comprising the element.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in various embodiments of the present application.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can also be made, all of which belong to the protection of this application.

Claims

1. A method of audio anomaly detection, the method comprising:

constructing an initial detection model based on a variation network and a generation network;

generating an audio feature tensor based on the initial card punching audio data;

inputting the audio feature tensor into the initial detection model, and outputting a first random variable and a second random variable through the initial detection model;

training the initial detection model according to an optimization function to obtain a corrected detection model;

inputting the first random variable and the second random variable into the correction detection model, and generating a reconstruction tensor corresponding to the audio feature tensor;

performing anomaly evaluation calculation on the reconstruction tensor to obtain an anomaly score corresponding to the audio characteristic tensor;

and if the abnormal score is larger than or equal to an abnormal threshold value, determining that the initial card punching audio data is abnormal.

2. The audio anomaly detection method according to claim 1, wherein the step of generating an audio feature tensor based on the initial card punching audio data comprises:

obtaining N ₁ Initial card punching audio data;

preprocessing each initial card punching audio data to obtain N ₁ Modifying the card punching audio data;

converting each of the modified punch-card audio data into a corresponding N ₂ A feature data and N ₂ Splicing the feature data into feature vectors;

n is to be ₁ And splicing the eigenvectors into audio feature tensors.

3. The method of claim 2, wherein the step of preprocessing each of the initial card punching audio data comprises:

removing the background noise of each initial card punching audio data to obtain noise-reduced card punching audio data;

and sampling the noise reduction card punching audio data according to a preset frequency.

4. The audio anomaly detection method according to claim 1, characterized in that said initial detection model comprises: presetting a convolution layer, a preset deconvolution layer, a gate control circulation layer, a linear transformation layer and a full connection layer;

the variational network consists of a preset convolution layer, a preset anti-convolution layer and a gate control circulation layer;

the generation network is composed of a preset deconvolution layer, a gate control circulation layer, a linear transformation layer and a full connection layer.

5. The audio anomaly detection method of claim 4, wherein said step of training said initial detection model according to an optimization function comprises:

the optimization function is:

wherein,

which is indicative of a loss of training,

the mathematical expectation that the tensor of audio features is represented,

representing a posterior probability of the generating network to the audio feature tensor,

representing a posterior probability of the variational network to the audio feature tensor,

the degree of divergence of the KL is expressed,

is a constant, theta is a layer parameter of the generation network, and theta is a layer parameter of the variation network, \981;

adjusting theta and \981byrandom gradient variation estimation and re-parameterization, and calculating according to the adjusted theta and \981

；

When the temperature is higher than the set temperature

When the loss is smaller than the loss threshold value, storing the adjusted theta and \981.

6. The method according to claim 5, wherein the step of generating the reconstruction tensor corresponding to the audio feature tensor comprises:

mapping the first random variable through the linear transformation layer to obtain a mapping result;

inputting the second random variable into the preset deconvolution layer to obtain a deconvolution result;

connecting the mapping result and the deconvolution result to obtain a connection result;

and decoding the connection result through the full connection layer to obtain the reconstruction tensor.

7. The method according to claim 1, wherein the step of performing anomaly evaluation calculation on the reconstruction tensor to obtain an anomaly score corresponding to the audio feature tensor comprises:

sampling the reconstruction tensor to obtain L reconstruction samples;

carrying out Monte Carlo integration on the L reconstruction samples to obtain reconstruction probability;

and taking the inverse number of the reconstruction probability to obtain the abnormal score.

8. An audio anomaly detection apparatus, the apparatus comprising:

the construction module is used for constructing an initial detection model based on the variation network and the generation network;

the first generation module is used for generating an audio feature tensor based on the initial card punching audio data;

the input module is used for inputting the audio feature tensor into the initial detection model and outputting a first random variable and a second random variable through the initial detection model;

the training module is used for training the initial detection model according to an optimization function to obtain a corrected detection model;

a second generating module, configured to input the first random variable and the second random variable into the modified detection model, and generate a reconstruction tensor corresponding to the audio feature tensor;

the calculation module is used for performing anomaly evaluation calculation on the reconstruction tensor to obtain an anomaly score corresponding to the audio characteristic tensor;

and the determining module is used for determining that the initial card punching audio data is abnormal if the abnormal score is greater than or equal to an abnormal threshold.

9. An electronic device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, performs the audio anomaly detection method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the audio anomaly detection method of any one of claims 1 to 7.