CN112562727A

CN112562727A - Audio scene classification method, device and equipment applied to audio monitoring

Info

Publication number: CN112562727A
Application number: CN202011506902.4A
Authority: CN
Inventors: 黄真明; 陆春亮; 王毅
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-26
Anticipated expiration: 2040-12-18
Also published as: CN112562727B

Abstract

The invention discloses an audio scene classification method, device and equipment applied to audio monitoring, which aim at the problems of time consumption and resource waste of the existing audio scene classification mode and solve the problems from two aspects, wherein on one hand, the method is based on processing timeliness, carries out quality judgment on each audio segment in real time, ensures that only effective audio segments are detected, and takes the audio segments as processing units, and once the classification result is detected, the processing of other available segments is stopped, so that the unnecessary detection process can be greatly reduced; on the other hand, by using the characteristics of the RNN framework, information of each audio clip does not need to be stored in the scene type detection process, and only the processing result of the previous step needs to be used as the input of subsequent processing, so that the resource space can be fully saved. Therefore, the method can ensure the processing timeliness and lighten the system, thereby being flexibly applicable to audio monitoring application environments with various scales.

Description

Audio scene classification method, device and equipment applied to audio monitoring

Technical Field

The present invention relates to the field of audio monitoring, and in particular, to an audio scene classification method, apparatus, and device for audio monitoring.

Background

Through years of development in the field of audio monitoring, the importance of the audio monitoring system in a high-quality intelligent security system is highlighted. Because the technical characteristics of the audio are not interfered by the conditions of sight line obstruction, illumination disadvantage and the like, the visual deficiency can be made up, various information is provided for judging the situation, and the effect that the video cannot be replaced is achieved. The audio scene analysis is mainly used for analyzing, deciding and early warning abnormal behaviors occurring in the current monitoring environment. The core technology is based on the characteristics of various abnormal audios in time domain and frequency domain, and combines a classification method of pattern recognition to alarm abnormal events, and intelligently extracts and analyzes information carried in audio signals, so that the method is a key link of an audio scene classification technology.

For the field of audio monitoring, the currently adopted method is to perform fragmentation processing on input audio signals, extract classification information for each audio signal segment based on sparse coding, and calibrate the extracted classification information by using a calibration model. And then, pre-classifying each audio signal segment through a classification model, fusing pre-classification results to obtain fusion classification information, and finally performing statistical analysis on the fusion classification information of all the audio segments to obtain a target classification result.

However, it is found through analysis that the existing scene classification process in the field of audio monitoring consumes both processing time and processing resources, and is difficult to adapt to the requirements of some specific application environments.

Disclosure of Invention

In view of the foregoing, the present invention is directed to an audio scene classification method, apparatus and device applied to audio monitoring, and accordingly provides a computer-readable storage medium and a computer program product, which can effectively solve the problem that the scene classification process required in the field of audio monitoring is time-consuming and resource-consuming.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides an audio scene classification method applied to audio monitoring, including:

carrying out real-time segmentation on the acquired current audio data to obtain a plurality of audio segments;

judging whether the current audio clip is available or not according to the quality of the audio clip in real time in the segmentation process;

extracting audio features of the audio segments judged to be available in real time, and carrying out scene classification detection according to a scene classification model trained on the basis of a recurrent neural network architecture in advance;

when a scene classification result is determined based on at least one of the audio clips, the processing of the current audio data is ended.

In at least one possible implementation manner, the scene classification detection process includes the following steps:

converting the available audio clips to obtain corresponding spectrogram;

extracting auxiliary acoustic features of the audio segment;

obtaining matching degrees of the audio clips and a plurality of preset scene audio types on an acoustic level based on the spectrogram, the N-order difference spectrum and the auxiliary acoustic features;

and determining a scene classification result according to the matching degree.

In at least one possible implementation, the auxiliary acoustic feature includes: ambient noise features and/or speech articulation features.

In at least one possible implementation thereof, the method further comprises:

if the classification result is not detected based on the currently input audio clip, judging whether the detection process exceeds a preset first timing;

if yes, terminating the processing of the current audio data and acquiring new audio data again;

and if not, continuously detecting the next section of the audio clip.

In at least one possible implementation manner, after the ending of the processing of the current audio data, the method further includes:

and after the preset second timing, continuously acquiring new audio data and carrying out the processing.

In at least one possible implementation manner, the determining whether the current audio segment is available according to the quality of the audio segment in real time in the segmentation process includes:

calculating the short-time energy value of the current audio clip in real time;

and judging whether the current audio clip is available or not according to the relation between the short-time energy value and a preset energy threshold value.

In a second aspect, the present invention provides an audio scene classification apparatus for audio monitoring, wherein the apparatus comprises:

the segmentation module is used for segmenting the acquired current audio data in real time to obtain a plurality of audio segments;

the audio clip screening module is used for judging whether the current audio clip is available or not according to the quality of the audio clip in real time in the segmentation process;

the scene type detection module is used for extracting audio features of the audio clips which are judged to be available in real time and carrying out scene classification detection according to a scene classification model trained on the basis of a recurrent neural network architecture in advance;

and the processing termination module is used for finishing the processing of the current audio data when a scene classification result is determined based on at least one audio fragment.

In at least one possible implementation manner, the scene type detection module includes:

the frequency spectrum characteristic acquisition unit is used for converting the available audio frequency segments to obtain corresponding frequency spectrum diagrams;

the auxiliary acoustic feature acquisition unit is used for extracting auxiliary acoustic features of the audio segments;

the matching degree calculation unit is used for obtaining the matching degree of the audio clip and the preset scene audio types on the acoustic level based on the spectrogram, the N-order difference spectrum and the auxiliary acoustic features;

and the classification result determining unit is used for determining a scene classification result according to the matching degree.

In at least one possible implementation manner, the apparatus further includes a processing aging monitoring module, where the processing aging monitoring module specifically includes:

a detection timeout determining unit configured to determine whether a detection process exceeds a preset first timing if a classification result is not detected based on the currently input audio clip;

the loop processing unit is used for terminating the processing of the current audio data and acquiring new audio data again when the output of the detection overtime judging unit is yes;

and the continuation processing unit is used for continuously detecting the next section of the audio clip when the output of the detection timeout judging unit is negative.

In at least one possible implementation manner, the apparatus further includes: a cyclic processing module;

and the circular processing module is used for continuously acquiring new audio data and carrying out the processing after the current audio data is processed and the preset second timing.

In at least one possible implementation manner, the audio segment filtering module includes:

the energy calculation unit is used for calculating the short-time energy value of the current audio clip in real time;

and the audio clip screening unit is used for judging whether the current audio clip is available according to the relation between the short-time energy value and a preset energy threshold value.

In a third aspect, the present invention provides an audio scene classification device applied to audio monitoring, wherein the audio scene classification device comprises:

one or more processors, memory which may employ a non-volatile storage medium, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform at least the method as described in the first aspect or any of its possible implementations.

In a fifth aspect, the present invention also provides a computer program product for performing at least the method of the first aspect or any of its possible implementations, when the computer program product is executed by a computer.

In at least one possible implementation manner of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.

The invention is conceived in that the problem that the existing audio scene classification mode is time-consuming and resource-wasting is solved from two aspects, on one hand, the quality judgment is carried out on each audio segment in real time based on the processing timeliness, so that only effective audio segments meeting the quality standard can be ensured to be detected, and in order to ensure the timeliness, the audio segments are taken as processing units, once the classification result is detected in real time based on the current audio segments, the processing on other available segments of the input audio is terminated, and the segment quality screening and unit processing strategies can greatly reduce the unnecessary detection process, thereby solving the problem of time consumption of detection; on the other hand, by using the characteristics of the RNN framework, information of each audio clip does not need to be stored in the scene type detection process, and only the processing result of the previous step needs to be used as the input of subsequent processing, so that the resource space can be fully saved. Therefore, the invention can greatly shorten the whole detection time, ensure the processing timeliness and lighten the system, thereby being flexibly applicable to audio monitoring application environments with various scales.

Furthermore, in some embodiments of the present invention, starting from an acoustic aspect, acoustic correlations between the input audio and specific various types of scene audio are matched without identifying specific contents of the audio, so that detection burden can be reduced, and processing time effectiveness can be improved.

Furthermore, in some embodiments of the present invention, a timing mechanism is further provided for the detection link, so as to ensure that the processing time can better meet the predetermined aging requirement, thereby better adapting to various audio monitoring requirements.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an embodiment of an audio scene classification method applied to audio monitoring according to the present invention;

FIG. 2 is a flow chart of a preferred embodiment of a scene classification method provided by the present invention;

FIG. 3 is a schematic diagram of an embodiment of an audio scene classification apparatus for audio monitoring according to the present invention;

fig. 4 is a schematic diagram of an embodiment of an audio scene classification device applied to audio monitoring provided in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

Before the specific technical scheme of the invention is introduced, the disadvantages of the existing processing thought are specifically explained, as mentioned above, the existing method can cause the whole processing process to be time-consuming, for example, the whole audio is divided into 10 segments, the number of the segments is 1-10, but the segments actually containing specific audio scene information are only 4-8, and the existing processing mode needs to delay the processing of the 10 th segment to finish and then output the final scene classification result.

Moreover, the existing scene classification method needs to store the pre-classification information of each audio clip, because the pre-classification information needs to be collected and then the fusion classification information can be subjected to statistical processing, and because each fragment needs to be processed, the existing scheme inevitably occupies a large amount of memory and CPU resources, so that the scheme is difficult to operate on a small-sized edge computing node.

In addition, it can be pointed out that some existing scene classification methods may also cause the accuracy of the final classification result to be affected due to incomplete audio segmentation, for example, the input audio is divided into 100 segments, which are numbered 1 to 100, and the segments containing specific audio scene information are 3 to 6, and if every 5 segments are taken as processing units during scene type detection, for example, 1 to 5 segments are cut into one processing unit, and 6 to 10 segments are cut into one processing unit, the prediction result may be biased due to incomplete audio information.

In view of this, the inventor considers integrating the two trial ideas, so as to avoid the problems of complicated solution, high cost and information loss. Specifically, the present invention provides an embodiment of an audio scene classification method applied to audio monitoring, which, referring to fig. 1, may specifically include:

and step S1, carrying out real-time segmentation on the acquired current audio data to obtain a plurality of audio segments.

The audio segmentation method related to this step may refer to the prior art, and is not described in detail herein, but it should be noted that in order to provide a fast processing scheme for audio monitoring, the consideration of time efficiency is embodied in performing segmentation in real time in this step, that is, once the continuous audio data at the input end is collected, the input current audio data may be segmented according to a predetermined segmentation length while the collection process is performed.

And step S2, judging whether the current audio clip is available according to the quality of the audio clip in real time in the segmentation process.

In order to ensure the timeliness of the whole processing process, on the one hand, the step is embodied in that the quality evaluation is carried out on the segmented audio segments synchronously in the process of segmenting the audio segments; on the other hand, the purpose of evaluating the quality of each segmented audio clip is to eliminate some invalid audio information before performing type detection, so that the number of subsequent detection objects can be ensured to be sufficiently reduced, and the accuracy of the detection result can be higher. For example, in some embodiments of the present invention, a short-time energy value of a current audio segment after being divided is calculated in real time, and then, according to a magnitude relationship between the short-time energy value and a preset energy threshold, whether the current audio segment is available is determined, that is, whether the audio segment meets a preset quality requirement is determined in real time while the audio is divided, and audio segments that do not meet the requirement can be discarded.

It should be noted that, regarding the energy estimation itself, for example, for calculating the short-term energy, reference may be made to the related art, and the detailed description of the present invention is omitted here.

And step S3, extracting audio features of the audio segments judged to be available in real time, and carrying out scene classification detection according to a scene classification model trained on the basis of a recurrent neural network architecture in advance.

Similarly, in order to ensure the time efficiency of the whole processing process, it is shown in this step that once the current audio segment is determined to be valid through the foregoing steps, the current audio segment can be sent to the detection section in real time, and features such as frequency, amplitude, phase, tone, loudness, timbre, hearing and the like are extracted from the acoustic level, and then the scene type detection is performed by the scene classification model of the RNN framework. It should be further noted that, the present embodiment emphasizes that the RNN model is not randomly selected, that is, is not arbitrarily adopted in various neural network architectures, but based on the problem of time and resource consumption of the existing scenario classification schemes as set forth above, by analyzing the specifically selected model architecture, since the RNN model is characterized in that all inputs and operation results are not recorded, but only intermediate results of previous operations are used as inputs of subsequent operations, resource pressure can be sufficiently relieved, and in actual operation, a person skilled in the art can understand that the RNN referred to in the present embodiment is a generic concept, which can cover RNN or modified forms or variant forms based on RNN ideas, such as LSTM, GRU, and the like, and thus the present invention does not limit the specific form of the circular neural network actually used.

It should be added that some preferred implementation manners refer to, for example, the scene classification detection method shown in fig. 2, which may include the following steps:

step S31, converting the available audio frequency segments to obtain corresponding spectrogram;

step S32, extracting auxiliary acoustic features of the audio clips;

step S33, obtaining matching degrees of the audio clip and a plurality of preset scene audio types on an acoustic level based on the spectrogram, the N-order difference spectrum thereof and the auxiliary acoustic features;

and step S34, determining a scene classification result according to the matching degree.

Preferred audio features are given in this embodiment to be referred to: the spectrogram and its N-order differential features, as well as other acoustic features as an auxiliary, it should be noted that, the present invention does not limit the type of the input current audio data, and may be any known audio, including but not limited to human voice, because the application environment related to the field of audio monitoring is various, some is based on a voice scene and some is based on a specific audio scene, therefore, in practical operation, the spectrogram mentioned in this embodiment may be adjusted according to the application environment, for example, in the case of audio monitoring according to a voice environment, the spectrogram may be converted into a spectrogram, and the auxiliary acoustic features may also be set as needed actually, for example, but not limited to, the spectrogram and/or voice pronunciation features, that is, if the application environment is non-voice, the environmental noise features are not considered, but only auxiliary acoustic information for classifying environmental noise features as scenes, if the voice application environment is adopted, the environmental noise characteristic and the voice pronunciation characteristic can be considered at the same time, and the voice pronunciation characteristic can be considered independently to be used as auxiliary classification information; of course, in some specific applications, the auxiliary acoustic feature is not available, and the auxiliary acoustic feature may default to null, which is not a limitation of the present invention.

But to facilitate an understanding of the preferred embodiment, only one specific implementation is exemplified here:

receiving the t-th effective fixed-length voice segment (marked as D)_t) And further may be to D_tPurifying and filtering out interference factors to obtain pure speech segment E_tFollowed by extraction of E_tSpeech pronunciation feature V of_t(e.g., but not limited to, prosody, rhythm, tone, emotion, etc.) and for E_tPerforming frequency domain transformation to obtain spectrogram T_tSimultaneously calculating N-order differential spectrum T'_tWill feature spectrum T_tN-order differential spectrum T'_tAnd phonetic pronunciation characteristics V_tIs stacked as U_t＝[T_t,T’_t,V_t]Finally RNN is based on U_tA classification result vector R is found. The finding process is related to model training, the specific model training mode is not the focus of the present invention, and it should be emphasized that the classification logic adopted in the preferred embodiment is to analyze and judge the acoustic information of the input audio, match and associate the acoustic information with the pre-marked classical audio characteristics in a plurality of specific scene types, sort the matching results meeting the established standards of relevance, and select one or more final scene classification results.

It should be noted that, regarding the feature extraction, the spectrogram conversion, the calculation of the N-order difference, the model training, and the like, which are related above, reference may be made to the existing mature technical solution, which is not described in detail herein.

Step S4, when a scene classification result is determined based on at least one of the audio clips, ending the processing of the current audio data.

According to the same design context, the concept of ensuring timeliness of the invention is still embodied in the step, and as the step is a real-time detection link and detects the sent segment in real time, once the result meeting the type matching standard can be obtained by detecting the current audio segment, the audio segmentation, quality screening and subsequent other available audio segments can not be carried out any more, namely the classification processing of the current audio data is terminated, so that the problems of long processing time and consumption of operation resources caused by some unnecessary processing can be avoided.

Of course, in order to further enhance the aging problem concerned by the present invention, in some other preferred embodiments, the present invention further inserts a timing mechanism in the whole process. Examples include, but are not limited to, the following: if the final classification result is not detected based on the current audio segment, judging whether the whole detection process exceeds a preset first timing time, once the detection process is overtime, terminating the processing of the current audio data, and re-acquiring new audio data, namely, not consuming excessive time and resources on the current audio data, and continuously acquiring the audio data (such as voice continuously provided by a speaker) of the current application environment; if the time does not exceed the preset time, the next available audio segment of the current audio data can be continuously received, the scene type detection is continuously carried out, and the time-out judgment can be carried out again after the audio segment is detected and the classification result is not obtained.

Finally, it can be further supplemented that, in other embodiments, the invention considers that, after obtaining the final classification result and before continuing to acquire the audio data after the final classification result, a static state is entered, that is, a buffer and an identifier are provided for the audio classification detection of the next stage, and a specific implementation means may use the aforementioned timing policy, that is, after completing the classification detection of the current audio data, a preset second timing may be passed, and then the acquisition of new audio data is continued and the aforementioned embodiments and preferred solutions are executed, which is not limited herein.

In summary, the concept of the present invention is to solve the problem of time consumption and resource waste of the existing audio scene classification method, and on one hand, the quality of each audio segment is determined in real time based on processing timeliness, so that only valid audio segments meeting the quality standard can be detected, and in order to ensure timeliness, the audio segments are used as processing units, and once the classification result is detected in real time based on the current audio segment, the processing of other available segments of the input audio is terminated, and the above-mentioned segment quality screening and unit processing strategies can greatly reduce unnecessary detection processes, thereby solving the problem of time consumption of detection; on the other hand, by using the characteristics of the RNN framework, information of each audio clip does not need to be stored in the scene type detection process, and only the processing result of the previous step needs to be used as the input of subsequent processing, so that the resource space can be fully saved. Therefore, the invention can greatly shorten the whole detection time, ensure the processing timeliness and lighten the system, thereby being flexibly applicable to audio monitoring application environments with various scales.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of an audio scene classification apparatus applied to audio monitoring, as shown in fig. 3, which may specifically include the following components:

the segmentation module 1 is used for segmenting the acquired current audio data in real time to obtain a plurality of audio segments;

the audio segment screening module 2 is used for judging whether the current audio segment is available or not according to the quality of the audio segment in real time in the segmentation process;

the scene type detection module 3 is used for extracting audio features of the audio segments judged to be available in real time and carrying out scene classification detection according to a scene classification model trained on a recurrent neural network architecture in advance;

and the processing termination module 4 is configured to terminate the processing of the current audio data when a scene classification result is determined based on at least one of the audio segments.

It should be understood that the above-mentioned division of each component in the audio scene classification apparatus applied to audio monitoring shown in fig. 3 is merely a division of a logic function, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and preferred embodiments thereof, it will be appreciated by those skilled in the art that, in practice, the technical idea underlying the present invention may be applied in a variety of embodiments, the present invention being schematically illustrated by the following vectors:

(1) an audio scene classification device applied to audio monitoring. The device may specifically include: one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or an equivalent implementation.

Fig. 4 is a schematic structural diagram of an embodiment of an audio scene classification device applied to audio monitoring provided in the present invention, where the device may be a server, a desktop PC, a notebook computer, an intelligent terminal, and the like.

As shown in fig. 4 in particular, the audio scene classification device 900 applied to audio monitoring includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition to this, in order to make the functions of the audio scene classification device 900 applied to audio monitoring more complete, the device 900 may further include one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, and the like, which may further include a speaker 982, a microphone 984, and the like. The display unit 970 may include a display screen, among others.

Further, the apparatus 900 may also include a power supply 950 for providing power to various devices or circuits within the apparatus 900.

It should be understood that the operation and/or function of the various components of the apparatus 900 can be referred to in the foregoing description with respect to the method, system, etc., and the detailed description is omitted here as appropriate to avoid repetition.

It should be understood that the processor 910 in the audio scene classification device 900 applied to audio monitoring shown in fig. 4 may be a system on a chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium, on which a computer program or the above-mentioned apparatus is stored, which, when executed, causes the computer to perform the steps/functions of the above-mentioned embodiments or equivalent implementations.

In the several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.

(3) A computer program product (which may include the above apparatus) when running on a terminal device, causes the terminal device to execute the audio scene classification method applied to audio monitoring of the foregoing embodiment or an equivalent embodiment.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program products may include, but are not limited to, refer to APP; in the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), Random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other. In particular, for embodiments of devices, apparatuses, etc., since they are substantially similar to the method embodiments, reference may be made to some of the descriptions of the method embodiments for their relevant points. The above-described embodiments of devices, apparatuses, etc. are merely illustrative, and modules, units, etc. described as separate components may or may not be physically separate, and may be located in one place or distributed in multiple places, for example, on nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. An audio scene classification method applied to audio monitoring is characterized by comprising the following steps:

2. The audio scene classification method applied to audio monitoring of claim 1, wherein the scene classification detection is performed as follows:

converting the available audio clips to obtain corresponding spectrogram;

extracting auxiliary acoustic features of the audio segment;

and determining a scene classification result according to the matching degree.

3. The audio scene classification method applied to audio monitoring of claim 2, wherein the auxiliary acoustic features comprise: ambient noise features and/or speech articulation features.

4. The method for audio scene classification applied to audio monitoring of claim 1, further comprising:

and if not, continuously detecting the next section of the audio clip.

5. The audio scene classification method applied to audio monitoring of claim 1, wherein after the ending of the processing of the current audio data, the method further comprises:

6. The audio scene classification method applied to audio monitoring as claimed in any one of claims 1 to 5, wherein the determining whether the current audio segment is available according to the quality of the audio segment in real time during the segmentation process comprises:

calculating the short-time energy value of the current audio clip in real time;

7. An audio scene classification device applied to audio monitoring is characterized by comprising:

8. The audio scene classification device applied to audio monitoring of claim 7, wherein the scene type detection module comprises:

9. The audio scene classification device applied to audio monitoring according to claim 7 or 8, wherein the device further comprises a processing aging monitoring module, and the processing aging monitoring module specifically comprises:

10. An audio scene classification device applied to audio monitoring, comprising:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the audio scene classification method applied to audio monitoring of any of claims 1 to 6.