CN112562727B

CN112562727B - Audio scene classification method, device and equipment applied to audio monitoring

Info

Publication number: CN112562727B
Application number: CN202011506902.4A
Authority: CN
Inventors: 黄真明; 陆春亮; 王毅
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2024-04-26
Anticipated expiration: 2040-12-18
Also published as: CN112562727A

Abstract

The invention discloses an audio scene classification method, a device and equipment applied to audio monitoring, and the invention aims at solving the problems of time consumption and resource waste of the existing audio scene classification mode, and on one hand, the invention is based on processing timeliness, and carries out quality research and judgment on each audio segment in real time, so that only effective audio segments are ensured to be detected, and the audio segments are taken as processing units, and processing on other available segments is stopped once classification results are detected, so that unnecessary detection processes can be greatly reduced; on the other hand, the characteristic of the RNN architecture is utilized, the information of each audio fragment is not required to be stored in the scene type detection process, and only the result of the previous step is required to be used as the input of the subsequent processing, so that the resource space can be fully saved. Therefore, the invention can ensure the processing time effect and lighten the system, thereby being flexibly applicable to various scale audio monitoring application environments.

Description

Audio scene classification method, device and equipment applied to audio monitoring

Technical Field

The present invention relates to the field of audio monitoring, and in particular, to a method, an apparatus, and a device for classifying audio scenes applied to audio monitoring.

Background

The audio monitoring field has been developed for many years, and the importance of the audio monitoring field in a high-quality intelligent security system is highlighted. The technical characteristics of the audio cannot be interfered by the conditions of sight obstruction, illumination disadvantage and the like, so that the visual deficiency can be made up, various information is provided for judging the situation, and the video cannot be replaced. The audio scene analysis is mainly used for analyzing, deciding and early warning abnormal behaviors occurring in the current monitoring environment. The key technology is based on the characteristics of various abnormal audios in time domain and frequency domain, and the classification method combining pattern recognition is used for alarming abnormal events, and intelligent extraction and analysis of information carried in audio signals are key links of the audio scene classification technology.

For the field of audio monitoring, the currently adopted method is to perform slicing processing on an input audio signal, extract classification information from each audio signal slice based on sparse coding, and calibrate the extracted classification information by using a calibration model. And pre-classifying each audio signal segment through a classification model, fusing the pre-classifying results to obtain fused classification information, and finally carrying out statistical analysis on the fused classification information of all the audio segments to obtain a target classification result.

However, it is found through analysis that the existing scene classification process in the audio monitoring field consumes processing time and processing resources, and is difficult to adapt to the requirements of certain specific application environments.

Disclosure of Invention

In view of the foregoing, the present invention aims to provide an audio scene classification method, apparatus and device applied to audio monitoring, and accordingly provides a computer readable storage medium and a computer program product, which can effectively solve the problem that scene classification processing required in the audio monitoring field is time-consuming and resource-consuming.

The technical scheme adopted by the invention is as follows:

In a first aspect, the present invention provides an audio scene classification method applied to audio monitoring, including:

Segmenting the acquired current audio data in real time to obtain a plurality of audio fragments;

judging whether the current audio fragment is available or not according to the quality of the audio fragment in real time in the segmentation process;

extracting audio features of the audio fragments which are judged to be available in real time, and performing scene classification detection according to a scene classification model which is trained on the basis of a cyclic neural network architecture in advance;

And ending the processing of the current audio data when a scene classification result is determined based on at least one audio fragment.

In at least one possible implementation manner, the scene classification detection process is as follows:

converting the available audio fragments to obtain corresponding spectrograms;

Extracting auxiliary acoustic features of the audio segment;

Obtaining the matching degree of the audio fragment and a plurality of preset scene audio types on an acoustic level based on the spectrogram, the N-order differential spectrum thereof and the auxiliary acoustic characteristics;

And determining a scene classification result according to the matching degree.

In at least one possible implementation thereof, the auxiliary acoustic feature includes: ambient noise characteristics and/or phonetic pronunciation characteristics.

In at least one possible implementation thereof, the method further comprises:

If the classification result is not detected based on the currently input audio fragment, judging whether the detection process exceeds a preset first timing;

if the current audio data is overtime, the processing of the current audio data is terminated, and new audio data is collected again;

if not, continuing to detect the next section of the audio fragment.

In at least one possible implementation thereof, after the ending of the processing of the current audio data, the method further comprises:

and after the preset second timing, continuing to collect new audio data and performing the processing.

In at least one possible implementation manner, the determining whether the current audio segment is available in real time according to the quality of the audio segment during the segmentation includes:

Calculating the short-time energy value of the current audio fragment in real time;

and judging whether the current audio fragment is available according to the relation between the short-time energy value and a preset energy threshold value.

In a second aspect, the present invention provides an audio scene classification apparatus for audio monitoring, including:

the segmentation module is used for segmenting the acquired current audio data in real time to obtain a plurality of audio fragments;

The audio fragment screening module is used for judging whether the current audio fragment is available or not according to the quality of the audio fragment in real time in the segmentation process;

the scene type detection module is used for extracting the audio characteristics of the audio fragments which are judged to be available in real time and carrying out scene classification detection according to a scene classification model which is trained on the basis of a cyclic neural network architecture in advance;

and the processing termination module is used for ending the processing of the current audio data when the scene classification result is determined based on at least one audio fragment.

In at least one possible implementation manner, the scene type detection module includes:

The frequency spectrum characteristic acquisition unit is used for converting the available audio clips to obtain corresponding frequency spectrograms;

an auxiliary acoustic feature acquisition unit for extracting auxiliary acoustic features of the audio piece;

The matching degree calculation unit is used for obtaining the matching degree of the audio fragment and a plurality of preset scene audio types on an acoustic level based on the spectrogram, the N-order differential spectrum thereof and the auxiliary acoustic characteristics;

and the classification result determining unit is used for determining scene classification results according to the matching degree.

In at least one possible implementation manner, the device further comprises a treatment aging monitoring module, and the treatment aging monitoring module specifically comprises:

a detection timeout determination unit configured to determine whether a detection process exceeds a preset first timing if a classification result is not detected based on the currently input audio clip;

A circulation processing unit for terminating the processing of the current audio data and re-collecting new audio data when the detection timeout judging unit outputs yes;

And the continuation processing unit is used for continuing to detect the next section of the audio fragment when the detection timeout judging unit outputs no.

In at least one possible implementation manner, the apparatus further includes: a cyclic processing module;

And the circulation processing module is used for continuously acquiring new audio data and performing the processing after the preset second timing after the processing of the current audio data is finished.

In at least one possible implementation manner, the audio clip screening module includes:

an energy calculating unit for calculating the short-time energy value of the current audio clip in real time;

And the audio fragment screening unit is used for judging whether the current audio fragment is available according to the relation between the short-time energy value and a preset energy threshold value.

In a third aspect, the present invention provides an audio scene classification apparatus for audio monitoring, comprising:

One or more processors, a memory, and one or more computer programs, the memory may employ a non-volatile storage medium, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the device, cause the device to perform the method as in the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform at least the method as in the first aspect or any of the possible implementations of the first aspect.

In a fifth aspect, the invention also provides a computer program product for performing at least the method of the first aspect or any of the possible implementations of the first aspect, when the computer program product is executed by a computer.

In at least one possible implementation manner of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.

The invention aims at solving the problems of time consumption and resource waste of the existing audio scene classification mode, on one hand, the invention is based on processing timeliness, and carries out quality research and judgment on each audio segment in real time, thus ensuring that only effective audio segments meeting quality standards are detected, and in order to ensure that the timeliness is to take the audio segments as processing units, processing of other available segments of input audio is stopped once classification results are detected in real time based on the current audio segments, and the segment quality screening and unit processing strategies can greatly reduce unnecessary detection processes, thereby solving the problem of detection time consumption; on the other hand, the characteristic of the RNN architecture is utilized, the information of each audio fragment is not required to be stored in the scene type detection process, and only the result of the previous step is required to be used as the input of the subsequent processing, so that the resource space can be fully saved. Therefore, the invention can greatly shorten the whole detection time, ensure the treatment time effect, and lighten the system, thereby being flexibly applicable to various scale audio monitoring application environments.

Further, in some embodiments of the present invention, the acoustic correlation between the input audio and the specific audio of various scenes is considered to be matched from the acoustic level, without recognizing the specific content of the audio, so that the detection burden can be reduced, and the processing timeliness can be improved.

Furthermore, in some embodiments of the present invention, a timing mechanism is further provided for the detection link, so as to ensure that the processing time can better meet the predetermined aging requirement, so as to better adapt to various audio monitoring requirements.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

Fig. 1 is a flowchart of an embodiment of an audio scene classification method applied to audio monitoring according to the present invention;

FIG. 2 is a flow chart of a preferred embodiment of a scene classification method provided by the present invention;

fig. 3 is a schematic diagram of an embodiment of an audio scene classification apparatus for audio monitoring according to the present invention;

fig. 4 is a schematic diagram of an embodiment of an audio scene classification apparatus applied to audio monitoring.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Before introducing the specific technical scheme of the invention, the defects of the existing processing thought are specifically described, and as mentioned above, the existing method can cause the whole processing process to be time-consuming, for example, the whole audio is divided into 10 segments with the number of 1-10, but the segments actually containing the specific audio scene information are only 4-8, and the existing processing mode needs to delay until the 10 th segment is processed and then output the final scene classification result.

Furthermore, the existing scene classification method needs to store the pre-classification information of each audio segment, because it needs to collect the pre-classification information and then perform statistical processing on the fused classification information, and because each segment needs to be processed, the existing scheme inevitably needs to occupy a large amount of memory and CPU resources, and thus, it is difficult to run the scheme on a small-sized computing node such as an edge computing node.

In addition, it may be noted that some existing scene classification methods may also be affected by the accuracy of the final classification result due to incomplete audio segmentation, for example, the input audio is divided into 100 segments, numbered 1-100, and the segments containing specific audio scene information are 3-6, and when scene type detection is performed, if every 5 segments are used as processing units, for example, 1-5 segments are cut into one processing unit, and 6-10 segments are cut into one processing unit, deviation occurs in the prediction result due to incomplete audio information.

In view of this, the inventor considers integrating the two attempts to solve the problems of complex solution, high cost and information loss. Specifically, the present invention provides an embodiment of an audio scene classification method applied to audio monitoring, and referring to fig. 1, the method specifically may include:

s1, segmenting the acquired current audio data in real time to obtain a plurality of audio fragments.

The audio segmentation method related to the step can be referred to the prior art, and the invention is not repeated, but the invention needs to be pointed out that in order to provide a rapid processing scheme of audio monitoring, the step is to segment the audio data in real time in consideration of timeliness, namely once the continuous audio data at one input end is collected, the current audio data is segmented according to a preset segmentation length while the collection process is carried out.

And step S2, judging whether the current audio fragment is available or not according to the quality of the audio fragment in real time in the segmentation process.

Likewise, in order to ensure the timeliness of the entire process, this step is embodied on the one hand in that the quality evaluation of the audio segments already divided is carried out synchronously during the aforesaid division of the audio segments; on the other hand, the purpose of designing the quality evaluation of each segmented audio segment is to eliminate some invalid audio information before performing type detection, so that the number of subsequent detection objects can be ensured to be sufficiently reduced, and the accuracy of the detection result can be higher. The specific quality evaluation mode and quality inspection index may be determined according to requirements, such as, but not limited to, signal-to-noise ratio, reverberation factor, audio energy, etc., for example, in some embodiments of the present invention, a short-time energy value of the segmented current audio segment is calculated in real time, and then according to the relationship between the short-time energy value and a preset energy threshold, whether the current audio segment is available is determined, that is, whether the audio segment meets the preset quality requirement is determined in real time while the audio is segmented, and those audio segments that do not meet the requirement may be discarded.

In addition, it should be noted that, regarding the foregoing energy evaluation itself, for example, for the calculation of the short-time energy, reference may be made to the existing related art, which is not described in detail in the present invention.

And S3, extracting the audio characteristics of the audio fragments which are judged to be available in real time, and performing scene classification detection according to a scene classification model which is trained on the basis of the cyclic neural network architecture in advance.

Likewise, to ensure the timeliness of the whole processing procedure, this step is embodied in that once the current audio clip is determined to be valid through the foregoing steps, this detection link may be sent in real time, features such as frequency, amplitude, phase, pitch, loudness, timbre, hearing, etc. are extracted from the acoustic level, and then scene type detection is performed by the scene classification model of the RNN architecture. It should be further noted that, in this embodiment, the RNN model is emphasized that, instead of being arbitrarily selected, that is, instead of arbitrarily adopting multiple kinds of neural network architectures, the problem that the prior scene classification scheme consumes time and resources is based on the foregoing description, by analyzing a model architecture specifically selected, since the RNN model is characterized in that all inputs and operation results are not recorded, but only intermediate results of previous operations are used as inputs of subsequent operations, the resource pressure can be fully relieved, and in actual operation, it will be understood by those skilled in the art that the RNN is referred to as an upper concept, which may cover RNN or modified forms or variant forms based on the RNN concept, such as LSTM, GRU, etc., and thus the present invention does not limit the specific forms of the cyclic neural network actually used.

While it should be noted that some preferred implementations refer to an example of a scene classification detection method shown in fig. 2, the method may include the following steps:

Step S31, converting the available audio clips to obtain corresponding spectrograms;

step S32, extracting auxiliary acoustic features of the audio fragment;

Step S33, obtaining the matching degree of the audio fragment and a plurality of preset scene audio types on an acoustic level based on the spectrogram, the N-order differential spectrum thereof and the auxiliary acoustic features;

and step S34, determining a scene classification result according to the matching degree.

In this embodiment, reference is made to preferred audio features: the spectrogram and the N-level difference feature thereof and other acoustic features as assistance are described herein, and the invention is not limited to the type of the input current audio data, but can be any known audio, including but not limited to human voice, because the application environment involved in the audio monitoring field is various, some is based on the voice scene, some is based on the specific audio scene, so in actual operation, the spectrogram mentioned in the embodiment can be adjusted according to the application environment, for example, when the audio monitoring is performed on the voice environment, the spectrogram can be converted into a spectrogram, and the assistance acoustic features can be set as actually needed, including but not limited to the environmental noise feature and/or the voice pronunciation feature, i.e. if the voice application environment is non-voice application environment, the voice pronunciation feature is not needed to be considered, but only the environmental noise feature is used as assistance acoustic information of scene classification, and if the voice application environment is voice application environment, the environmental noise feature and the voice pronunciation feature are considered simultaneously, and the voice pronunciation feature is also considered separately as assistance classification information; of course, in certain specific applications, there is no such auxiliary acoustic feature available, and the auxiliary acoustic feature may default to null, as the invention is not limited in this regard.

And to facilitate an understanding of the preferred embodiments, it is merely illustrated in one specific implementation:

The T effective fixed-length voice segment (marked as D _t) is received in real time, the D _t can be further purified, interference factors in the voice segment can be filtered out to obtain a pure voice segment E _t, voice pronunciation characteristics V _t (such as but not limited to rhythm, tone, emotion and the like) of E _t are extracted, frequency domain transformation is performed on E _t to obtain a spectrogram T _t, meanwhile, an N-level difference spectrum T '_t is calculated, the characteristic spectrum T _t, the N-level difference spectrum T' _t and the voice pronunciation characteristics V _t are stacked to be U _t＝[T_t,T'_t,V_t, and finally the RNN obtains a classification result vector R based on U _t. The process of solving is related to model training, and the specific model training mode is not the focus of the invention, and it is emphasized that the classification logic adopted in the preferred embodiment is to analyze and judge through the acoustic information of the input audio, match and correlate with classical audio features in a plurality of specific scene types marked in advance, sort matching results meeting the established correlation standard, and select one or more final scene classification results.

It should be noted that, regarding the feature extraction, spectrogram transformation, calculation of the N-order difference, model training, etc. that are referred to above, reference may be made to the existing mature technical scheme, which is not repeated in the present invention.

And step S4, when a scene classification result is determined based on at least one audio fragment, ending the processing of the current audio data.

According to the same design context, the concept of ensuring timeliness of the present invention is still presented in this step, and since the foregoing step is a real-time feeding detection procedure and detects the fed segments in real time, once the result meeting the type matching standard can be obtained through detection of the current audio segment, the foregoing audio segmentation, quality screening and processing of other available audio segments later, that is, terminating the classification processing of the current audio data, the problem of long processing time and the problem of consuming operation resources due to some unnecessary processing can be avoided.

Of course, further to enhance the aging problem of interest to the present invention, in some other preferred embodiments, the present invention also incorporates a timing mechanism throughout the foregoing process. Such as, but not limited to, the following examples: if a final classification result is not detected based on the current audio fragment, judging whether the whole detection process exceeds a preset first timing time, and stopping processing the current audio data and collecting new audio data again once the time is overtime, namely, not consuming excessive time and resources in the current audio data, and continuously acquiring the audio data (such as voice continuously provided by a speaker) of the current application environment; if the time-out is not over, the scene type detection can be continued by continuing to receive the next available audio segment of the current audio data, and the time-out determination can be performed again after the detection of the segment of the audio segment assuming that the classification result is not yet obtained.

Finally, it may be further added that, in other embodiments, the present invention considers that the method enters a stationary state before the audio data after the final classification result is obtained and after the collection is continued, that is, the method provides buffering and identification for the audio classification detection in the next stage, and the specific implementation means may refer to the aforementioned timing strategy, that is, after the current audio data classification detection is completed, the method may further collect new audio data after a preset second timing, and execute the foregoing embodiments and preferred schemes, which is not limited to the present invention.

In summary, the present invention is designed to solve the problems of time consuming and resource wasting in the existing audio scene classification method, on the one hand, based on processing timeliness, performing quality research and judgment on each audio segment in real time, so as to ensure that only valid audio segments meeting quality standards are detected, and in order to ensure that the timeliness is to use the audio segments as processing units, processing of other available segments of input audio is terminated once classification results are detected in real time based on current audio segments, and the segment quality screening and unit processing strategies can greatly reduce unnecessary detection processes, thereby solving the detection time consuming problem; on the other hand, the characteristic of the RNN architecture is utilized, the information of each audio fragment is not required to be stored in the scene type detection process, and only the result of the previous step is required to be used as the input of the subsequent processing, so that the resource space can be fully saved. Therefore, the invention can greatly shorten the whole detection time, ensure the treatment time effect, and lighten the system, thereby being flexibly applicable to various scale audio monitoring application environments.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of an audio scene classification device applied to audio monitoring, as shown in fig. 3, which may specifically include the following components:

the segmentation module 1 is used for segmenting the acquired current audio data in real time to obtain a plurality of audio fragments;

an audio fragment screening module 2, configured to determine whether the current audio fragment is available according to the quality of the audio fragment in real time during the segmentation process;

The scene type detection module 3 is used for extracting the audio characteristics of the audio fragments which are judged to be available in real time and carrying out scene classification detection according to a scene classification model which is trained on the basis of a cyclic neural network architecture in advance;

and the processing termination module 4 is used for ending the processing of the current audio data when the scene classification result is determined based on at least one audio fragment.

It should be understood that the above division of the respective components in the audio scene classification apparatus for audio monitoring shown in fig. 3 is only a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these components may all be implemented in software in the form of a call through a processing element; or can be realized in hardware; it is also possible that part of the components are implemented in the form of software called by the processing element and part of the components are implemented in the form of hardware. For example, some of the above modules may be individually set up processing elements, or may be integrated in a chip of the electronic device. The implementation of the other components is similar. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors (DIGITAL SINGNAL Processor (DSP), or one or more field programmable gate arrays (Field Programmable GATE ARRAY; FPGA), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and preferred embodiments thereof, it will be appreciated by those skilled in the art that in actual operation, the technical concepts of the present invention may be applied to various embodiments, and the present invention is schematically illustrated by the following carriers:

(1) An audio scene classification device for audio monitoring. The device may specifically include: one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 4 is a schematic structural diagram of an embodiment of an audio scene classification device applied to audio monitoring, where the device may be a server, a desktop PC, a notebook computer, an intelligent terminal, etc.

As shown in fig. 4 in particular, the audio scene classification apparatus 900 applied to audio monitoring includes a processor 910 and a memory 930. Wherein the processor 910 and the memory 930 may communicate with each other via an internal connection, and transfer control and/or data signals, the memory 930 is configured to store a computer program, and the processor 910 is configured to call and execute the computer program from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, more commonly referred to as separate components, and the processor 910 is configured to execute program code stored in the memory 930 to perform the functions described above. In particular implementations, the memory 930 may also be integrated within the processor 910 or separate from the processor 910.

In addition, in order to further improve the functions of the audio scene classification apparatus 900 applied to audio monitoring, the apparatus 900 may further include one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, etc., which may further include a speaker 982, a microphone 984, etc. Wherein the display unit 970 may include a display screen.

Further, the apparatus 900 may also include a power supply 950 for providing electrical power to various devices or circuits in the apparatus 900.

It should be appreciated that the operation and/or function of the various components in the apparatus 900 may be found in particular in the foregoing description of embodiments of the method, system, etc., and detailed descriptions thereof are omitted here as appropriate to avoid redundancy.

It should be understood that the processor 910 in the audio scene classification apparatus 900 applied to audio monitoring shown in fig. 4 may be a system on a chip SOC, and the processor 910 may include a central processing unit (Central Processing Unit; hereinafter referred to as a CPU), and may further include other types of processors, for example: an image processor (Graphics Processing Unit; hereinafter referred to as GPU) or the like, as will be described in detail below.

In general, portions of the processors or processing units within the processor 910 may cooperate to implement the preceding method flows, and corresponding software programs for the portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium having stored thereon a computer program or the above-mentioned means, which when executed, causes a computer to perform the steps/functions of the foregoing embodiments or equivalent implementations.

In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, certain aspects of the present invention may be embodied in the form of a software product as described below, in essence, or as a part of, contributing to the prior art.

(3) A computer program product (which may comprise the apparatus described above) which, when run on a terminal device, causes the terminal device to perform the audio scene classification method of the previous embodiment or equivalent implementation applied to audio monitoring.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the above-described computer program product may include, but is not limited to, an APP; in connection with the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may specifically further include: at least one processor, at least one communication interface, at least one memory and at least one communication bus; the processor, the communication interface and the memory can all communicate with each other through a communication bus. The processor may be a central Processing unit CPU, DSP, microcontroller or digital signal processor, and may further include a GPU, an embedded neural network processor (Neural-network Process Units; hereinafter referred to as NPU) and an image signal processor (IMAGE SIGNAL Processing; hereinafter referred to as ISP), where the processor may further include an ASIC (application specific integrated circuit) or one or more integrated circuits configured to implement embodiments of the present invention, and the processor may further have a function of operating one or more software programs, where the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage medium may include: nonvolatile Memory (nonvolatile Memory), such as a non-removable magnetic disk, a USB flash disk, a removable hard disk, an optical disk, and so forth, and Read-Only Memory (ROM), random access Memory (Random Access Memory; RAM), and so forth.

In the embodiments of the present invention, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, units, and method steps described in the embodiments disclosed herein can be implemented in electronic hardware, computer software, and combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

And, each embodiment in the specification is described in a progressive manner, and the same and similar parts of each embodiment are mutually referred to. In particular, for embodiments of the apparatus, device, etc., as they are substantially similar to method embodiments, the relevance may be found in part in the description of method embodiments. The above-described embodiments of apparatus, devices, etc. are merely illustrative, in which modules, units, etc. illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed across multiple places, e.g., nodes of a system network. In particular, some or all modules and units in the system can be selected according to actual needs to achieve the purpose of the embodiment scheme. Those skilled in the art will understand and practice the invention without undue burden.

The construction, features and effects of the present invention are described in detail according to the embodiments shown in the drawings, but the above is only a preferred embodiment of the present invention, and it should be understood that the technical features of the above embodiment and the preferred mode thereof can be reasonably combined and matched into various equivalent schemes by those skilled in the art without departing from or changing the design concept and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, but is intended to be within the scope of the invention as long as changes made in the concept of the invention or modifications to the equivalent embodiments do not depart from the spirit of the invention as covered by the specification and drawings.

Claims

1. An audio scene classification method applied to audio monitoring, comprising:

Judging whether the current audio fragment is available according to the quality of the audio fragment in real time in the segmentation process, including calculating a short-time energy value of the current audio fragment in real time, and judging whether the current audio fragment is available according to the relation between the short-time energy value and a preset energy threshold;

2. The method for classifying audio scenes for audio monitoring according to claim 1, wherein the scene classification detection process is as follows:

converting the available audio fragments to obtain corresponding spectrograms;

Extracting auxiliary acoustic features of the audio segment;

And determining a scene classification result according to the matching degree.

3. The method for audio scene classification applied to audio monitoring as claimed in claim 2, wherein said auxiliary acoustic features comprise: ambient noise characteristics and/or phonetic pronunciation characteristics.

4. The method for audio scene classification applied to audio monitoring according to claim 1, further comprising:

if not, continuing to detect the next section of the audio fragment.

5. The audio scene classification method applied to audio monitoring according to claim 1, characterized in that after said ending the processing of the current audio data, the method further comprises:

6. An audio scene classification device for audio monitoring, comprising:

the audio fragment screening module is used for judging whether the current audio fragment is available according to the quality of the audio fragment in real time in the segmentation process, and comprises the steps of calculating the short-time energy value of the current audio fragment in real time and judging whether the current audio fragment is available according to the relation between the short-time energy value and a preset energy threshold value;

7. The audio scene classification device applied to audio monitoring according to claim 6, wherein the scene type detection module comprises:

8. The audio scene classification device for audio monitoring according to claim 6 or 7, further comprising a process aging monitor module, the process aging monitor module specifically comprising:

9. An audio scene classification device for audio monitoring, comprising:

one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the audio scene classification method of any of claims 1-5 that is applied to audio monitoring.