CN114202891A

CN114202891A - Method and device for sending alarm indication

Info

Publication number: CN114202891A
Application number: CN202111630626.7A
Authority: CN
Inventors: 袁欢; 黄凯明
Original assignee: Streamax Technology Co Ltd
Current assignee: Streamax Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-03-18

Abstract

The application provides a method and a device for sending an alarm indication, and relates to the technical field of electronics. Wherein the method comprises obtaining a first audio signal; determining a first audio characteristic corresponding to the first audio signal; inputting the first audio features into an alarm model, and determining whether the first audio signals comprise alarm signals; inputting the first audio features into a voiceprint model, and determining whether the first audio signals comprise audio signals of a preset user or not; and sending an alarm indication under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user. According to the method and the device, the alarm indication is sent by detecting that the audio signal comprises the alarm signal and the audio signal comprises the audio signal of the preset user, so that the problem that the preset user cannot manually dial the telephone to alarm is solved.

Description

Method and device for sending alarm indication

Technical Field

The application belongs to the technical field of electronics, and particularly relates to a method and a device for sending an alarm indication.

Background

In daily life, accidents are ubiquitous. A driver of a transportation device (e.g., a truck) may experience fatigue driving for a long time, which may lead to a traffic accident, and at this time, the driver on the truck may not be able to manually call an alarm to seek help, thereby delaying an optimal rescue time. In this regard, how to solve the above problems becomes a focus of attention.

Disclosure of Invention

The embodiment of the application provides a method and a device for sending an alarm instruction, and the method and the device can be used for sending the alarm instruction by detecting that an audio signal comprises an alarm signal and the audio signal comprises an audio signal of a preset user, so that the problem that the preset user cannot manually dial a call to alarm is solved.

In order to achieve the above object, in a first aspect, an embodiment of the present application provides a method for sending an alarm indication, where the method includes:

acquiring a first audio signal;

determining a first audio characteristic corresponding to the first audio signal;

inputting the first audio features into an alarm model, and determining whether the first audio signals comprise alarm signals;

inputting the first audio features into a voiceprint model, and determining whether the first audio signals comprise audio signals of a preset user or not;

and sending an alarm indication under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user.

In the above scheme, the first audio characteristic corresponding to the acquired first audio signal may be input to an alarm model for detection, and whether the first audio signal includes an alarm signal is determined; and inputting the first audio characteristic into a voiceprint model for detection, and determining whether the first audio signal comprises an audio signal of a preset user. That is, if the preset user needs to alarm in case of emergency, the preset user can initiate voice alarm; after the audio signal corresponding to the voice alarm is obtained, the alarm indication is sent under the condition that the first audio signal comprises the alarm signal and the first audio signal comprises the audio signal of the preset user, the alarm is given through the alarm indication, the problem that the preset user cannot manually dial a call to give an alarm is avoided, and therefore emergency rescue can be achieved.

Optionally, the alarm model and the voiceprint model are obtained by using the same training data and test data.

In the above scheme, since the alarm model and the voiceprint model are obtained by using the same training data and test data, the same audio features can be respectively input into different models to obtain different detection results, that is, the first audio feature is input into the alarm model to obtain whether the first audio signal includes the alarm signal, and the voiceprint model is input to obtain whether the first audio signal includes the audio signal of the preset user.

Optionally, the method further comprises:

inputting the first audio features into a human voice model, and determining whether the first audio signals comprise human voice signals;

in the method, when the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user, sending an alarm indication comprises:

and sending an alarm indication under the condition that the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user.

Optionally, inputting the first audio feature into an alarm model, and determining whether the first audio signal includes an alarm signal, includes:

if the first audio signal comprises a human voice signal, inputting the first audio characteristic into an alarm model, and determining whether the first audio signal comprises an alarm signal; and/or the presence of a gas in the gas,

inputting the first audio characteristic into a voiceprint model, and determining whether the first audio signal comprises an audio signal of a preset user, wherein the method comprises the following steps:

and if the first audio signal comprises the alarm signal, inputting the first audio characteristic into the voiceprint model, and determining whether the first audio signal comprises the audio signal of the preset user.

Optionally, inputting the first audio feature into a human voice model, and determining whether the first audio signal includes a human voice signal, includes:

inputting the first audio frequency characteristic into a human voice model to obtain a first probability value of the first audio frequency signal including the human voice signal;

if the first probability value is larger than a first preset probability threshold value, determining that the first audio signal comprises a human voice signal;

and if the first probability value is smaller than or equal to the preset probability threshold value, determining that the first audio signal does not comprise the human voice signal.

Optionally, the method further comprises:

acquiring a video signal of an environment corresponding to a first audio signal;

determining whether the environment corresponding to the first audio signal is in an abnormal state or not according to the video signal;

and sending an alarm indication under the condition that the environment corresponding to the first audio signal is in an abnormal state, the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and the first audio signal comprises an audio signal of a preset user.

In the above scheme, if the video signal of the environment corresponding to the first audio signal is not obtained; determining whether the environment is in an abnormal state through a video signal; and then sending an alarm indication, wherein the sending of the alarm indication is determined only by detecting that the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and the first audio signal comprises an audio signal of a preset user, and the alarm indication is triggered when the preset user accidentally inputs the alarm signal by voice, but the user does not want to trigger the alarm, so that false alarm can be avoided.

inputting the first audio features into an alarm model to obtain a second probability value of the first audio signals, wherein the second probability value comprises the alarm signals;

if the second probability value is larger than a second preset probability threshold value, determining that the first audio signal comprises an alarm signal;

and if the second probability value is smaller than or equal to a second preset probability threshold value, determining that the first audio signal does not comprise the alarm signal.

Optionally, inputting the first audio feature into a voiceprint model, and determining whether the first audio signal includes an audio signal of a preset user, including:

inputting the first audio features into a voiceprint model to obtain first embedded vectors of the voiceprint features corresponding to the first audio signals;

determining a first cosine distance value according to the first embedding vector and a second embedding vector, wherein the second embedding vector is obtained by inputting an audio signal of a preset user into a voiceprint model;

if the first cosine distance value is smaller than a first preset distance threshold value, determining that the first audio signal comprises an audio signal of a preset user;

and if the first cosine distance value is greater than or equal to a first preset distance threshold value, determining that the first audio signal does not comprise the audio signal of the preset user.

Optionally, determining a first audio characteristic corresponding to the first audio signal includes:

extracting a mel frequency characteristic of the first audio signal;

from the mel-frequency features, a first audio feature is determined.

In the above scheme, the extraction of the mel-frequency features may extract the most distinctive features in the first audio signal, for example, the mel-frequency features may be the voiceprint features of the speaker, and the irrelevant features such as background noise and the like are removed. Thus, not only the data complexity of the audio signal can be reduced, but also the accuracy of the identification of the audio signal can be improved.

Optionally, the mel-frequency features are input into a convolutional neural network structure, audio features including human voice features, alarm features and voiceprint features are extracted, and the audio features are determined as first audio features.

In a second aspect, an embodiment of the present application provides an apparatus for sending an alarm indication, where the apparatus includes:

the acquisition module is used for acquiring a first audio signal;

the determining module is used for determining a first audio characteristic corresponding to the first audio signal;

the determining module is further used for inputting the first audio characteristics into the alarm model and determining whether the first audio signals comprise alarm signals;

the determining module is further configured to input the first audio feature into the voiceprint model, and determine whether the first audio signal includes an audio signal of a preset user;

and the sending module is used for sending an alarm indication under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user.

Optionally, the determining module is further configured to input the first audio feature into a human voice model, and determine whether the first audio signal includes a human voice signal.

Optionally, the sending module is specifically configured to send an alarm indication when the first audio signal includes a human voice signal, the first audio signal includes an alarm signal, and the first audio signal includes an audio signal of a preset user.

Optionally, the determining module is specifically configured to, if the first audio signal includes a human voice signal, input the first audio feature into the alarm model, and determine whether the first audio signal includes an alarm signal; and/or the presence of a gas in the gas,

Optionally, the determining module is further specifically configured to input the first audio feature into a human voice model, so as to obtain a first probability value that the first audio signal includes a human voice signal;

Optionally, the obtaining module is further configured to obtain a video signal of an environment corresponding to the first audio signal.

Optionally, the determining module is further configured to determine whether an environment corresponding to the first audio signal is in an abnormal state according to the video signal.

Optionally, the sending module is further specifically configured to send an alarm indication when an environment corresponding to the first audio signal is in an abnormal state, the first audio signal includes a human voice signal, the first audio signal includes an alarm signal, and the first audio signal includes an audio signal of a preset user.

Optionally, the determining module is further specifically configured to input the first audio feature into the alarm model, so as to obtain a second probability value of the first audio signal including the alarm signal;

Optionally, the determining module is further specifically configured to input the first audio feature into the voiceprint model to obtain a first embedded vector of the voiceprint feature corresponding to the first audio signal;

Optionally, the determining module is further configured to determine the first audio feature according to the mel-frequency feature after extracting the mel-frequency feature of the first audio signal.

Optionally, the determining module is further specifically configured to input the mel-frequency features into a convolutional neural network structure, extract audio features including a human voice feature, an alarm feature, and a voiceprint feature, and determine the audio features as the first audio features.

In a third aspect, an embodiment of the present application provides an apparatus for sending an alarm indication, including a processor, coupled with a memory, where the processor is configured to execute a computer program or instructions stored in the memory to implement the method of the first aspect or any implementation manner of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: when a preset user needs to give an alarm in case of emergency, the preset user can initiate a voice alarm; according to the method and the device, after the audio signal corresponding to the voice alarm is obtained, the first audio signal is determined to comprise the alarm signal, and under the condition that the first audio signal comprises the audio signal of the preset user, the alarm indication is sent, the alarm is given through the alarm indication, the problem that the preset user cannot manually dial a call to give an alarm is avoided, and therefore emergency rescue can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic view of a transportation device according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for sending an alarm indication according to an embodiment of the present application;

fig. 3 is a schematic flowchart of obtaining a first audio feature from a first audio signal according to an embodiment of the present application;

fig. 4(a) is a schematic flowchart of a process for determining whether to send an alarm indication according to an embodiment of the present application;

FIG. 4(b) is a schematic view of another process for determining whether to send an alarm indication according to an embodiment of the present application;

FIG. 4(c) is a schematic view of another process for determining whether to send an alarm indication according to an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram illustrating another method for determining whether to send an alarm indication according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an apparatus for sending an alarm indication according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for sending an alarm indication according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described in detail below with reference to the embodiments of the present application.

It should be understood that the modes, situations, categories and divisions of the embodiments of the present application are for convenience only and do not limit the present application, and the features of the various modes, categories, situations and embodiments can be combined without contradiction.

It should also be understood that "first" and "second" in the embodiments of the present application are merely for distinction and do not constitute any limitation to the present application. It should also be understood that, in the embodiments of the present application, the size of the sequence number in each process does not mean the execution sequence of the steps, and the execution sequence of the steps is determined by the internal logic thereof, and does not form any limitation on the execution process of the embodiments of the present application.

In daily life, accidents are ubiquitous. For example, when a driver of a transportation device (e.g., a truck) shown in fig. 1 drives for a long time, fatigue driving occurs to cause the truck to roll over, the arm of the driver is pressed by a vehicle seat, and at this time, the driver cannot manually make a call to a police to seek help, thereby delaying an optimal rescue time. Therefore, how to solve the above problems becomes a focus of attention.

Based on the problems in the related art, the application provides a method and a device for sending an alarm indication, and a first audio signal is obtained; determining a first audio characteristic corresponding to the first audio signal; inputting the first audio features into an alarm model, and determining whether the first audio signals comprise alarm signals; inputting the first audio features into a voiceprint model, and determining whether the first audio signals comprise audio signals of a preset user or not; and sending an alarm indication under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user. That is, if the preset user needs to alarm in case of emergency, the preset user can initiate a voice alarm; according to the method and the device, after the audio signal corresponding to the voice alarm is obtained, the first audio signal is determined to comprise the alarm signal, and under the condition that the first audio signal comprises the audio signal of the preset user, the alarm indication is sent, the alarm is given through the alarm indication, the problem that the preset user cannot manually dial a call to give an alarm is avoided, and therefore emergency rescue can be obtained.

The technical solutions of the present application are described in detail below with specific embodiments, which may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Because the implementation of the scheme of the application is completed based on the deep learning network, and the deep learning network model needs to be trained and tested before being used, the process of acquiring the alarm model, the voiceprint model and the human voice model is explained first.

For a better understanding of the solution of the present application, an example of obtaining an alarm model is provided as follows:

before the trained and tested alarm models are not obtained, the model in the training process and the model in the testing process are called a first model.

The first model comprises a full connection layer, a modified Linear unit (ReLU) activation function, a random inactivation function and a Softmax function, wherein the ReLU activation function is used for adding a nonlinear factor to the first model and enhancing the learning capacity of the first model; the random inactivation function is used for reducing the parameter quantity of the first model and effectively changing the generalization ability of the first model; the Softmax function is used to map the output result of the first model to a value between (0,1), and the resulting sum of the result values is 1, satisfying the property of probability. The acquisition process of the first model is divided into a training process and a testing process. Before the first model is trained, audio data corresponding to an audio signal needs to be collected, after Mel frequency feature extraction is carried out on the obtained audio data, audio features including human voice features, alarm features and voice print features are extracted through a Convolutional Neural Network (CNN) structure, data corresponding to the audio features are divided into training data and testing data, the training data are used for model training, and the testing data are used for model testing.

For example, after extracting mel frequency features from audio data corresponding to 100 hours of collected audio signals, audio feature data including human voice features, alarm features and voiceprint features are extracted through a CNN network structure, wherein 90 hours of audio feature data are used as training data, and 10 hours of audio feature data are used as test data.

The first model training process: the method comprises the steps of marking the characteristic of alarm signals in training data, namely the alarm characteristic, recording the label as 1, inputting the training data into a full connection layer in a first model, training the training data to train the weight in the full connection layer, enabling the first model to obtain a second probability value of the audio signals including the alarm signals in the training process every time, and judging whether the training of the first model is finished in the training process according to a cross entropy Loss (Softmax Loss) function, wherein the method specifically comprises the following steps: if the value of the Softmax loss function is no longer decreasing, the first model training is complete.

For example, after the feature including the alarm signal in the training data is labeled and the label is labeled as 1, the training data is input into the full-link layer in the first model, and the first model can obtain a second probability value including the alarm signal in the audio signal every time in the training process, for example, the second probability value is 0.7 at the 200 th time, and the Softmax loss function-log 0.7 is 0.155; the second probability value at 210 th time is 0.8, and the Softmax loss function-log 0.8 is 0.097; the second probability value at time 211 is 0.8, and the Softmax loss function-log 0.8 is 0.097, and if the value of the Softmax loss function is found to be unchanged, then the training of the first model is completed.

Procedure of the first model test: inputting test data into the trained first model, testing the first model, analyzing a second probability value of the audio signal including the alarm signal obtained each time, obtaining a minimum probability value of the audio signal including the alarm signal from the second probability values, and taking the minimum probability value as a second preset probability threshold.

For example, the second probability values of the plurality of audio signals including the alarm signal are specifically 0.4, 0.5, 0.41 and 0.45, wherein the minimum probability value is 0.4, and 0.4 is taken as the second preset probability threshold.

And finally, taking the trained and tested first model as an alarm model for testing whether the first audio signal comprises an alarm signal.

For a better understanding of the solution of the present application, an example of obtaining a voiceprint model is provided as follows:

before the trained and tested voiceprint models are not obtained, the model in the training process and the model in the testing process are called a second model.

The second model comprises a Recurrent Neural Network (RNN), a modified Linear Unit (ReLU) activation function, a random inactivation function and a Softmax function, wherein the ReLU activation function is used for adding a nonlinear factor to the second model and enhancing the learning capability of the second model; the random inactivation function is used for reducing the parameter quantity of the second model and effectively changing the generalization ability of the second model; the Softmax function is used to map the output result of the second model to a value between (0,1), and the resulting sum of the result values is 1, satisfying the property of probability. The acquisition process of the second model is divided into a training process and a testing process. Before the second model is trained, audio data corresponding to the audio signals need to be collected, after Mel frequency feature extraction is carried out on the obtained audio data, audio features including human voice features, alarm features and voice print features are extracted through a CNN network structure, data corresponding to the audio features are divided into training data and testing data, the training data are used for training the model, and the testing data are used for testing the model. Wherein, the second model is trained to better extract the embedded vector of the corresponding voiceprint feature of the audio signal.

And the process of training the second model comprises the following steps: the voiceprint feature corresponding to the audio signal of the training user in the training data is marked, after the label is marked as 1, the training data is input into an RNN structure in a second model, the training data is used for training the weight in the RNN structure, the second model can obtain a third probability value of the audio signal of the training user every time in the training process, and whether the training of the second model is completed or not is judged according to a cross entropy loss function, specifically, the method comprises the following steps: if the value of the cross entropy loss function is no longer decreasing, then the second model training is complete.

For example, after a voiceprint feature corresponding to an audio signal of a first user in training data is marked, and the label is marked as 1, the training data is input into an RNN structure in a second model, and the second model can obtain a third probability value of the audio signal of the first user each time in a training process, for example, the third probability value is 0.8 at the 240 th time, and a Softmax loss function log0.8 is 0.097; the third probability value at 241 th time is 0.82, and the Softmax loss function-log 0.82 is 0.086; the third probability value at the 242 nd time is 0.82, and the Softmax loss function-log 0.82 is 0.086, and if the value of the Softmax loss function is found to be unchanged, the second model training is completed at this time.

Procedure of the second model test: inputting test data into a trained second model, testing the second model, solving a second cosine distance value from an embedded vector of a voiceprint feature corresponding to an audio signal of a test user in the audio signal obtained each time and an embedded vector of a voiceprint feature corresponding to the audio signal of the test user obtained before testing, obtaining a minimum distance value from a plurality of second cosine distance values, and taking the minimum distance value as a first preset distance threshold value.

For example, the plurality of second cosine distance values representing that the audio signal is the audio signal of the test user are specifically 0.32, 0.20, 0.31 and 0.24, and the minimum distance value of 0.20 is taken as the first preset distance threshold.

And finally, taking the second model after training and testing as a voiceprint model for detecting whether the first audio signal comprises the audio signal of a preset user.

For a better understanding of the solution of the present application, an example of obtaining a model of human voice is provided as follows:

before the trained and tested human voice models are not obtained, the model in the training process and the model in the testing process are called a third model.

The third model comprises a full connection layer, a modified Linear unit (ReLU) activation function, a random inactivation function and a Softmax function, wherein the ReLU activation function is used for adding a nonlinear factor into the third model and enhancing the learning capacity of the third model; the random inactivation function is used for reducing the parameter quantity of the third model and effectively changing the generalization ability of the third model; the Softmax function is used to map the output result of the third model to a value between (0,1), and the sum of the resulting values is 1, satisfying the property of probability. The acquisition process of the third model is divided into a training process and a testing process. Before the third model is trained, audio data corresponding to the audio signal needs to be collected, after Mel frequency feature extraction is carried out on the obtained audio data, audio features including human voice features, alarm features and voice print features are extracted through a CNN network structure, data corresponding to the audio features are divided into training data and testing data, the training data are used for training the model, and the testing data are used for testing the model.

And the process of training the third model: marking the characteristics of the training data including the vocal signals, recording the label as 1, inputting the training data into a full connection layer in a third model, wherein the training data is used for training the weight in the full connection layer, the third model can obtain a first probability value of the audio signals including the vocal signals each time in the training process, and judging whether the training of the third model is completed in the training process according to a cross entropy Loss (Softmax Loss) function, specifically comprising: if the value of the Softmax loss function is no longer decreasing, the third model training is complete.

For example, after the features of the vocal signals included in the training data, that is, the vocal features, are labeled and the label is labeled as 1, the training data is input into the full-link layer in the third model, and the third model can obtain the first probability value of the audio signals including the vocal signals each time in the training process, for example, the first probability value is 0.8 at the 219 th time, and the Softmax loss function-log 0.8 is 0.097; the first probability value at 220 th time was 0.8, the Softmax loss function-log 0.8 was 0.097, and if the value of the Softmax loss function was found to be unchanged, then the third model training was complete.

Procedure of the third model test: inputting test data into a trained third model, testing the third model, analyzing first probability values of the audio signals including the human voice signals obtained each time, obtaining the minimum probability value of the audio signals including the human voice signals from the plurality of first probability values, and taking the minimum probability value as a first preset probability threshold value.

For example, the first probability values of the plurality of representative audio signals including the human voice signal are specifically 0.54, 0.51, 0.65 and 0.56, wherein the minimum probability value is 0.51, and then 0.51 is taken as the first preset probability threshold.

And finally, taking the trained and tested third model as a human voice model for testing whether the first audio signal comprises a human voice signal.

It should be understood that the training data used in the training process for the above three models, i.e., the first model, the second model, and the third model, is the same, as is the test data used in the testing process for the three models. Also, the network structures in the three models are not limited, for example, the first model and the third model may not include a random deactivation function and a Softmax function, and the CNN structures in the two models may be replaced by an RNN structure or a Gated-recursive Unit (GRU) network structure or a long-short-term-memory (LSTM) network structure, etc.; the RNN network in the second model may be replaced with a GRU network structure or an LSTM network structure, etc., and the random deactivation function and the Softmax function may not be included in the model.

Fig. 2 is a schematic flow chart of a method for sending an alarm indication according to an embodiment of the present application, as shown in fig. 2, the method is applied to a first device, for example, the first device may be a transportation device, for example, a truck, and the method includes the following steps:

s210, the first device acquires a first audio signal.

Optionally, the first device acquires the first audio signal by a camera.

S220, the first device determines a first audio feature corresponding to the first audio signal.

In an implementation manner, a mel frequency feature of the first audio signal is extracted;

inputting the Mel frequency characteristics into a CNN network structure, extracting audio characteristics comprising human voice characteristics, alarm characteristics and voice print characteristics, and determining the audio characteristics as first audio characteristics.

Alternatively, the Mel-Frequency characteristic may be any one of Mel-Frequency Cepstral Coefficients (MFCCs), Filter banks (fbanks), and single-channel energy normalization (PCEN).

In the above scheme, extracting the mel-frequency features from the first audio signal may extract the most distinctive audio features in the first audio signal, such as extracting voiceprint features of a speaker, and removing irrelevant features, such as background noise. In this way, the data complexity of the audio signal can be reduced; and inputting the Mel frequency characteristics into a CNN network structure, acquiring the characteristics in the first audio signal according to requirements by using a multi-convolution kernel in the network structure, and performing pooling processing on the characteristics to reduce the dimension and data volume of the characteristics.

In the above scheme, the mel-frequency features are input into the CNN network structure, the audio features including the vocal features, the alarm features, and the voiceprint features are extracted by using multiple convolution kernels, and the audio features are determined as the first audio features, that is, the first audio features may include the vocal features, the alarm features, and the voiceprint features at the same time, that is, the first audio signal is subjected to primary feature extraction of the vocal features, the alarm features, and the voiceprint features, instead of being divided into three times to respectively extract the vocal features, the alarm features, and the voiceprint features, that is, feature extraction is performed three times. In the feature extraction, the weight number of the feature extraction in three times is three times of the weight number of the feature extraction in one time, and when the convolution is carried out by using multiple convolution kernels, the feature extraction in three times needs to occupy more resources to store the weight of the corresponding CNN network structure, so that the resource occupation can be reduced by one-time feature extraction, and the reasoning speed is increased.

For example, as shown in fig. 3, a flow chart for obtaining the first audio feature is given by taking the mel-frequency feature as an MFCC: after pre-emphasis, framing and windowing of the first audio signal; performing Fast Fourier Transform (FFT) on the obtained audio signal to obtain a power spectrum; then carrying out logarithm taking processing on a Mel cepstrum coefficient obtained by the power spectrum through a Mel filter bank; then Discrete Cosine Transform (DCT) conversion is carried out on the result of the logarithm processing to obtain MFCC; and finally, inputting the MFCC into a CNN network structure, and performing convolution by using the MFCC and multiple convolution kernels to obtain a first audio feature, wherein the first audio feature is in a specific form of a multi-channel two-dimensional matrix, and different weights in each convolution kernel in the multiple convolution kernels represent a human voice feature, an alarm feature and a voiceprint feature.

It should be understood that the present application is not limited to a specific type of mel-frequency features in "inputting mel-frequency features into a CNN network structure", which is only exemplified by taking the mel-frequency features as MFCCs, and the MFCCs may be changed into fbanks or PCENs, and the fbanks or PCENs may be input into the CNN network structure, and multiple convolution kernels are used to obtain the first audio features.

Optionally, after the audio features including the human voice feature, the alarm feature and the voiceprint feature are extracted, the audio features may be subjected to pooling, and the obtained features are determined as the first audio features.

Alternatively, pooling may be maximum pooling or average pooling.

S230, the first device inputs the first audio characteristic into an alarm model and determines whether the first audio signal comprises an alarm signal.

Optionally, inputting the first audio characteristic into an alarm model to obtain a second probability value of the first audio signal including the alarm signal;

S240, the first device inputs the first audio characteristic into a voiceprint model, and determines whether the first audio signal comprises an audio signal of a preset user.

Optionally, inputting the first audio feature into the voiceprint model to obtain a first embedded vector of the voiceprint feature corresponding to the first audio signal;

For example, the process of detecting whether the first audio signal includes the alarm signal using the voiceprint model is given as follows: before detection, a preset user needs to input an audio signal carrying voiceprint features into a voiceprint model, and the voiceprint model can extract a second embedded vector of the voiceprint features corresponding to the audio signal of the preset user; during detection, the voiceprint model extracts a first embedded vector of voiceprint features corresponding to an acquired first audio signal, obtains a first cosine distance between the first embedded vector and a second embedded vector, compares the first cosine distance with a first preset distance threshold, and determines that the first audio signal comprises an audio signal of a preset user if the first cosine distance is smaller than the first preset distance threshold; and if the first cosine distance value is greater than or equal to a first preset distance threshold value, determining that the first audio signal does not comprise the audio signal of the preset user.

And S250, under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user, the first equipment sends an alarm instruction.

Optionally, the method further comprises:

the first audio features are input into a human voice model, and whether the first audio signals comprise human voice signals or not is determined.

Optionally, step S230 specifically includes: if the first audio signal comprises a human voice signal, inputting the first audio characteristic into an alarm model, and determining whether the first audio signal comprises an alarm signal; and/or the presence of a gas in the gas,

step S240 specifically includes: and if the first audio signal comprises the alarm signal, inputting the first audio characteristic into the voiceprint model, and determining whether the first audio signal comprises the audio signal of the preset user.

Optionally, step S250 specifically includes: and sending an alarm indication under the condition that the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user.

It should be understood that the above-mentioned scheme of step S250 only describes how the first audio signal is satisfied, and the alarm indication is sent regardless of the detection sequence of whether the first audio signal includes the human voice signal, whether the first audio signal includes the alarm signal, and whether the first audio signal includes the audio signal of the preset user.

Based on the specific schemes in the steps S230, S240, and S250, fig. 4 lists three flowcharts for detecting whether the first audio signal includes a human voice signal, whether the first audio signal includes an alarm signal, and whether the first audio signal includes an audio signal of a preset user, and whether the first audio signal includes an alarm indication, as shown in fig. 4(a), first detecting whether the first audio signal includes a human voice signal, if so, detecting whether the first audio signal includes an alarm signal, if so, detecting whether the first audio signal includes an audio signal of a preset user, if so, sending an alarm indication, and if not, not sending an alarm indication; as shown in fig. 4(b), first detecting whether the first audio signal includes an alarm signal, if so, then detecting whether the first audio signal includes a human voice signal, if so, finally detecting whether the first audio signal includes an audio signal of a preset user, if so, sending an alarm indication, and otherwise, not sending the alarm indication; as shown in fig. 4(c), it is detected whether the first audio signal includes a human voice signal, and then it is detected whether the first audio signal includes an audio signal of a preset user, if so, it is finally detected whether the first audio signal includes an alarm signal, if so, an alarm indication is sent, and if not, the alarm indication is not sent.

It should be understood that fig. 4 only lists 3 detection sequences of whether the first audio signal includes a human voice signal, whether the first audio signal includes an alarm signal, and whether the first audio signal includes an audio signal of a preset user, and other detection sequences are not described herein again.

Optionally, acquiring a video signal of an environment corresponding to the first audio signal;

and if the environment corresponding to the first audio signal is in an abnormal state, the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and an alarm indication is sent under the condition that the first audio signal comprises an audio signal of a preset user.

Optionally, if the environment corresponding to the first audio signal is in a normal state, the first audio signal includes a human voice signal, the first audio signal includes an alarm signal, and the alarm indication is not sent when the first audio signal includes an audio signal of a preset user.

Alternatively, if the environment is a scene in which the freight car is traveling at a high speed, the abnormal state may be a car accident of the freight car or a failure of the train.

Alternatively, if the environment is a living room of a certain user, the abnormal state may be that the living room is in fire.

For example, as shown in fig. 5, a flowchart for determining whether to send an alarm instruction is given, a first audio feature corresponding to a first audio signal is respectively input into a human voice model, an alarm model, and a voiceprint model, whether the first audio signal includes a human voice signal, whether the first audio signal includes an alarm signal, and whether the first audio signal includes an audio signal of a preset user, respectively, and if the first audio signal includes a human voice signal, the first audio signal includes an alarm signal, and the first audio signal includes an audio signal of a preset user, a video signal of an environment corresponding to the first audio signal is obtained, whether the environment is in an abnormal state is detected according to the video signal, and if the environment is abnormal, an alarm instruction is sent; in other cases, as shown in fig. 5 in particular, no alarm indication is sent.

It should be understood that, for the detection sequence of fig. 5 as to whether the first audio signal includes a human voice signal, whether the first audio signal includes an alarm signal, and whether the first audio signal includes an audio signal of a preset user, the detection sequence may be a sequence as shown in fig. 4(a) or fig. 4(b) or fig. 4 (c).

Fig. 6 is a schematic structural diagram of an apparatus for sending an alarm indication according to an embodiment of the present application, and as shown in fig. 6, the apparatus according to the embodiment includes:

an obtaining module 610, configured to obtain a first audio signal;

a determining module 620, configured to determine a first audio feature corresponding to the first audio signal;

the sending module 630 is configured to send an alarm indication when the first audio signal includes an alarm signal and the first audio signal includes an audio signal of a preset user.

Optionally, the determining module is further specifically configured to input the mel-frequency feature into a CNN network structure, extract an audio feature including a human voice feature, an alarm feature, and a voiceprint feature, and determine the audio feature as the first audio feature.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Based on the same inventive concept, fig. 7 is a device for sending an alarm indication according to an embodiment of the present application, as shown in fig. 7, the device includes a processor, the processor is coupled with a memory, and the processor is configured to execute a computer program or instructions stored in the memory to implement the method according to the above embodiment. Alternatively, the device may be a passenger car, a freight car, or the like.

The integrated units described above may be stored in one device if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method of the embodiments described above can be implemented by a computer program, which can be stored in a chip of a computer and can implement the steps of the embodiments of the methods described above when being executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/device and method may be implemented in other ways. For example, the above-described apparatus/device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of transmitting an alarm indication, the method comprising:

acquiring a first audio signal;

inputting the first audio characteristic into an alarm model, and determining whether an alarm signal is included in the first audio signal;

inputting the first audio features into a voiceprint model, and determining whether the first audio signals comprise audio signals of a preset user;

2. The method of claim 1, further comprising:

inputting the first audio characteristic into a human voice model, and determining whether the first audio signal comprises a human voice signal;

wherein, under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user, sending an alarm indication, comprising:

the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and the alarm indication is sent under the condition that the first audio signal comprises an audio signal of a preset user.

3. The method of claim 2, wherein inputting the first audio characteristic into an alarm model to determine whether the first audio signal includes an alarm signal comprises:

the inputting the first audio feature into a voiceprint model and determining whether the first audio signal includes an audio signal of a preset user includes:

and if the first audio signal comprises an alarm signal, inputting the first audio characteristic into a voiceprint model, and determining whether the first audio signal comprises an audio signal of a preset user.

4. The method of claim 2, wherein said inputting the first audio feature into a human voice model, determining whether the first audio signal includes a human voice signal, comprises:

inputting the first audio features into the human voice model to obtain a first probability value of the first audio signals including human voice signals;

and if the first probability value is smaller than or equal to the first preset probability threshold value, determining that the first audio signal does not include a human voice signal.

5. The method of claim 2, wherein the method further comprises:

acquiring a video signal of an environment corresponding to the first audio signal;

wherein, under the condition that the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and the first audio signal comprises an audio signal of a preset user, the sending of the alarm indication comprises:

the method comprises the steps that an environment corresponding to a first audio signal is in an abnormal state, the first audio signal comprises a human voice signal, the first audio signal comprises an alarm signal, and the alarm indication is sent under the condition that the first audio signal comprises an audio signal of a preset user.

6. The method of any of claims 1-5, wherein said inputting the first audio characteristic into an alarm model to determine whether an alarm signal is included in the first audio signal comprises:

and if the second probability value is smaller than or equal to the second preset probability threshold value, determining that the first audio signal does not comprise an alarm signal.

7. The method according to any one of claims 1-5, wherein said inputting the first audio feature into a voiceprint model and determining whether the first audio signal comprises an audio signal of a predetermined user comprises:

determining a first cosine distance value according to the first embedding vector and a second embedding vector, wherein the second embedding vector is obtained by inputting the audio signal of the preset user into the voiceprint model;

and if the first cosine distance value is greater than or equal to the first preset distance threshold value, determining that the first audio signal does not include the audio signal of the preset user.

8. The method of any of claims 1-5, wherein the determining the corresponding first audio characteristic of the first audio signal comprises:

extracting a mel frequency feature of the first audio signal;

determining the first audio feature according to the Mel frequency feature.

9. An apparatus for transmitting an alarm indication, the apparatus comprising;

the acquisition module is used for acquiring a first audio signal;

a determining module, configured to determine a first audio feature corresponding to the first audio signal;

the determining module is further configured to input the first audio feature into an alarm model, and determine whether an alarm signal is included in the first audio signal;

the determining module is further configured to input the first audio feature into a voiceprint model, and determine whether the first audio signal includes an audio signal of a preset user;

and the sending module is used for sending an alarm instruction under the condition that the first audio signal comprises an alarm signal and the first audio signal comprises an audio signal of a preset user.

10. An apparatus for sending an alarm indication, comprising a processor coupled with a memory, the processor when executing a computer program or instructions stored in the memory to implement the method of any one of claims 1-8.