CN114283841B

CN114283841B - Audio classification method, system, device and storage medium

Info

Publication number: CN114283841B
Application number: CN202111560886.1A
Authority: CN
Inventors: 王伟
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2023-06-06
Anticipated expiration: 2041-12-20
Also published as: CN114283841A

Abstract

The invention discloses an audio classification method, a system, a device and a storage medium, wherein the method comprises the following steps: acquiring a first audio signal to be classified, and framing the first audio signal to obtain a second audio signal; endpoint detection is carried out on the second audio signal, and low-energy audio segments positioned at the head part and the tail part of the second audio signal are removed to obtain a third audio signal; determining a short-time average zero-crossing rate of each audio frame in the third audio signal, and determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate; the first audio signal is classified according to the number of first audio frames and the fluctuation condition. The invention can classify the audio signals by carrying out framing treatment on the audio signals, detecting the end points and determining the short-time average zero-crossing rate of the audio frames, and identify pure music audio, pure voice audio and mixed audio, thereby improving the accuracy of audio classification and being widely applied to the technical field of audio classification.

Description

Audio classification method, system, device and storage medium

Technical Field

The invention relates to the technical field of audio classification, in particular to an audio classification method, an audio classification system, an audio classification device and a storage medium.

Background

All sounds that the human ear can hear are called audio, which can be classified into speech, music, silence, environmental sound, and noise, according to the expression form of audio, speech and music being two of the most important audio data types.

In large-scale media databases, there are both pure speech audio and pure music audio, such as broadcast sounds and piano songs, and audio where speech and music are mixed, such as emotion recitations and songs with background music. In the prior art, when the extracted characteristic parameters are more, music and voice can be accurately classified, but the classification effect on mixed audio containing voice and music is poorer.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent.

Therefore, an object of the embodiments of the present invention is to provide an audio classification method, which performs frame segmentation processing on an audio signal, endpoint detection, and determines a short-time average zero-crossing rate of audio frames, so as to determine a first audio frame number with a short-time average zero-crossing rate greater than or equal to a preset first threshold and a fluctuation condition of the short-time average zero-crossing rate, thereby classifying the audio signal, identifying pure music audio, pure voice audio and mixed audio, and improving accuracy of audio classification.

It is another object of an embodiment of the present invention to provide an audio classification system.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides an audio classification method, including the steps of:

acquiring a first audio signal to be classified, and carrying out framing treatment on the first audio signal to obtain a second audio signal;

performing endpoint detection on the second audio signal, and removing low-energy audio segments positioned at the head and tail of the second audio signal to obtain a third audio signal;

determining a short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate;

classifying the first audio signal according to the first audio frame number and the fluctuation condition.

Further, in one embodiment of the present invention, the step of framing the first audio signal to obtain a second audio signal specifically includes:

carrying out framing treatment on the first audio signal to obtain a plurality of second audio frames, wherein two adjacent second audio frames have overlapping parts;

a second audio signal is determined from a plurality of the second audio frames.

Further, in one embodiment of the present invention, the step of performing endpoint detection on the second audio signal and removing low energy audio segments located at the head and tail of the second audio signal to obtain a third audio signal specifically includes:

determining first short-time frame energies of a plurality of second audio frames located at the head of the second audio signal and second short-time frame energies of a plurality of second audio frames located at the tail of the second audio signal;

comparing the first short time frame energy and the second short time frame energy with a preset second threshold;

and when the energy of the first short-time frame is smaller than the second threshold value, removing a second audio frame corresponding to the energy of the first short-time frame, and when the energy of the second short-time frame is smaller than the second threshold value, removing the second audio frame corresponding to the energy of the second short-time frame, so as to obtain a third audio signal.

Further, in an embodiment of the present invention, the step of determining a short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining a first number of audio frames with a short-time average zero-crossing rate greater than or equal to a preset first threshold and a fluctuation condition of the short-time average zero-crossing rate specifically includes:

determining a short-time average zero-crossing rate for each audio frame in the third audio signal;

comparing the short-time average zero-crossing rate with the first threshold value, and determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to the first threshold value;

and performing curve fitting according to the short-time average zero crossing rate to obtain a short-time average zero crossing rate fluctuation curve, and further determining the change rate of the short-time average zero crossing rate fluctuation curve at each moment.

Further, in one embodiment of the present invention, the step of classifying the first audio signal according to the first audio frame number and the fluctuation condition specifically includes:

comparing the change rate with a preset third threshold value, and determining that the first audio signal is pure music audio when the change rate at each moment is smaller than or equal to the third threshold value, or determining that the first audio signal is non-pure music audio otherwise;

comparing the first audio frame number with a preset fourth threshold value, and determining that the first audio signal is pure voice audio when the first audio frame number is smaller than or equal to the fourth threshold value, or determining that the first audio signal is non-pure voice audio when the first audio frame number is smaller than or equal to the fourth threshold value;

and when the first audio signal is the impure music audio and the impure voice audio at the same time, determining that the first audio signal is the mixed audio.

Further, in one embodiment of the present invention, when the first audio signal is determined to be mixed audio, the audio classification method further includes the steps of:

the first audio signal is segmented to obtain a plurality of first audio segments, the impure audio frequency part of the first audio segment is determined, and then the audio deviation of the first audio signal is determined according to the audio length of the impure audio frequency part.

Further, in one embodiment of the present invention, the step of segmenting the first audio signal to obtain a plurality of first audio segments, and determining an impure audio portion of the first audio segments, and further determining an audio bias of the first audio signal according to an audio length of the impure audio portion specifically includes:

dividing the first audio signal into a plurality of first audio segments with the same audio length;

performing end point detection on the first audio segment by a double-threshold method, determining a pure audio frequency part and an impure audio frequency part of each first audio segment, and further determining a first audio frequency length of the impure audio frequency part;

comparing the first audio length with a preset fifth threshold value, and determining that the corresponding first audio segment is an audio segment when the first audio length is smaller than or equal to the fifth threshold value, otherwise, determining that the corresponding first audio segment is a voice audio segment;

and determining the audio bias of the first audio signal according to the number of the audio frequency segments and the voice audio frequency segments.

In a second aspect, an embodiment of the present invention provides an audio classification system, including:

the first feature set determining module is used for obtaining a first user structure diagram of a first video user, and traversing the first user structure diagram through a graph traversing algorithm to obtain a first feature set of the first video user;

the second feature set determining module is used for acquiring a first video structure diagram of the first video user, and performing traversal processing on the first video structure diagram through a graph traversal algorithm to obtain a second feature set of the first video user;

the association degree determining module is used for determining association degrees between all the features in the first feature set and the second feature set;

the model training and audio classifying module is used for determining the supervision data according to the association degree, determining the training sample according to the first characteristic set and the second characteristic set, inputting the supervision data and the training sample into a pre-constructed neural network model to obtain a trained audio classifying model, and further determining an audio classifying result according to the audio classifying model.

In a third aspect, an embodiment of the present invention provides an audio classification apparatus, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement an audio classification method as described above.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is configured to perform an audio classification method as described above.

The advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

The embodiment of the invention obtains the first audio signal to be classified, carries out framing treatment on the first audio signal to obtain a second audio signal, then carries out endpoint detection on the second audio signal, removes low-energy audio segments positioned at the head part and the tail part of the second audio signal to obtain a third audio signal, and then determines the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determines the number of the first audio frames with the short-time average zero-crossing rate being more than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate, thereby classifying the first audio signal according to the number of the first audio frames and the fluctuation condition. According to the embodiment of the invention, the audio signals are subjected to framing treatment, end point detection and short-time average zero-crossing rate of the audio frames is determined, so that the number of the first audio frames with the short-time average zero-crossing rate being more than or equal to the preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate are determined, the audio signals can be classified, pure music audio, pure voice audio and mixed audio are identified, and the accuracy of audio classification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will refer to the drawings that are needed in the embodiments of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity to describe some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for those skilled in the art.

Fig. 1 is a flowchart of steps of an audio classification method according to an embodiment of the present invention;

fig. 2 is a block diagram of an audio classification system according to an embodiment of the present invention;

fig. 3 is a block diagram of an audio classification device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present invention, the plurality means two or more, and if the description is made to the first and second for the purpose of distinguishing technical features, it should not be construed as indicating or implying relative importance or implicitly indicating the number of the indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

Referring to fig. 1, an embodiment of the present invention provides an audio classification method, which specifically includes the following steps:

s101, acquiring a first audio signal to be classified, and carrying out framing processing on the first audio signal to obtain a second audio signal.

In particular, the audio signal varies with time both in its characteristics and in its parameters characterising its essential characteristics as a whole, so that it is a non-stationary process that cannot be analysed by digital signal processing techniques that process stationary signals. However, since different voices are responses generated by a certain shape of the vocal tract constituted by the movements of the oral muscles of a person, which are very slow with respect to the frequency of the voices, the characteristics of the voice signal are maintained substantially unchanged, i.e. relatively stable, in a short time range (generally considered to be between 10 and 30 ms) and can be regarded as a quasi-stationary process, i.e. the voice signal has a short-time stationarity, from the other hand, although the voice signal has time-varying characteristics. The analysis and processing of any speech signal must be based on "short-time", i.e. a "short-time analysis", in which the speech signal is divided into segments, each of which is called a "frame", and the frame length is typically taken to be 10-30 Ms. Thus, for the entire speech signal, a time series of feature parameters consisting of the feature parameters of each frame is analyzed.

Further as an optional implementation manner, the step of framing the first audio signal to obtain the second audio signal specifically includes:

a1, carrying out framing treatment on a first audio signal to obtain a plurality of second audio frames, wherein two adjacent second audio frames have overlapping parts;

a2, determining a second audio signal according to the plurality of second audio frames.

Specifically, the framing of the embodiment of the present invention adopts an overlapping segmentation method, that is, an overlapping portion is provided between adjacent audio frames, so as to make a smooth transition between frames before the frames, and maintain continuity of the frames. The overlapping portion of the previous frame and the next frame is called a frame shift, and in the embodiment of the present invention, the value range of the ratio of the frame shift to the frame length is (0, 1/2).

S102, performing end point detection on the second audio signal, and removing low-energy audio segments positioned at the head and tail of the second audio signal to obtain a third audio signal.

In particular, endpoint detection, also called audio activity detection (Voice Activity Detection, VAD), is aimed at distinguishing between audio and non-audio regions. In popular sense, endpoint detection is to accurately locate the starting point and the ending point of audio from audio with noise, remove the mute part and the noise part, and find a piece of real effective content of audio. The step S102 specifically includes the following steps:

s1021, determining first short-time frame energies of a plurality of second audio frames positioned at the head of the second audio signal and second short-time frame energies of a plurality of second audio frames positioned at the tail of the second audio signal;

s1022, comparing the first short-time frame energy and the second short-time frame energy with a preset second threshold;

s1023, when the energy of the first short time frame is smaller than a second threshold value, removing a second audio frame corresponding to the energy of the first short time frame, and when the energy of the second short time frame is smaller than the second threshold value, removing a second audio frame corresponding to the energy of the second short time frame, and obtaining a third audio signal.

Specifically, the embodiment of the invention adopts an endpoint detection method based on short-time energy to detect the second audio signal, and removes the audio segment with lower head-tail energy of the second audio signal to obtain a third audio signal.

S103, determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of the first audio frames with the short-time average zero-crossing rate being larger than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate.

Specifically, because the voice is alternately composed of unvoiced sound and voiced sound, the short-time average zero-crossing rate of the voiced sound is smaller than that of the unvoiced sound, so that the fluctuation of the short-time average zero-crossing rate in the voice audio signal is larger. Step S103 specifically includes the following steps:

s1031, determining a short-time average zero-crossing rate of each audio frame in the third audio signal;

s1032, comparing the short-time average zero-crossing rate with a first threshold value, and determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to the first threshold value;

s1033, performing curve fitting according to the short-time average zero-crossing rate to obtain a short-time average zero-crossing rate fluctuation curve, and further determining the change rate of the short-time average zero-crossing rate fluctuation curve at each moment.

Specifically, after the short-time average zero-crossing rate of each audio frame is determined, curve fitting is performed according to the time sequence relation of the audio frames, so that a short-time average zero-crossing rate fluctuation curve which changes along with time is obtained, and therefore the change rate of the short-time average zero-crossing rate fluctuation curve at each moment can be determined, and the change rate can reflect the fluctuation condition of the short-time average zero-crossing rate.

And processing each section of audio signal, intercepting the section with the same parameter adjustment, setting a first threshold value for distinguishing the voice signal from the music signal, and calculating the number of frames of audio frames with short-time average zero-crossing rate larger than the first threshold value in each section of audio, namely the number of the first audio frames.

S104, classifying the first audio signals according to the number of the first audio frames and the fluctuation condition.

Further as an alternative embodiment, step S104 specifically includes the steps of:

s1041, comparing the change rate with a preset third threshold value, and determining that the first audio signal is pure music audio when the change rate at each moment is smaller than or equal to the third threshold value, otherwise, determining that the first audio signal is non-pure music audio;

s1042, comparing the number of the first audio frames with a preset fourth threshold, and determining that the first audio signal is pure voice audio when the number of the first audio frames is smaller than or equal to the fourth threshold, otherwise, determining that the first audio signal is non-pure voice audio;

s1043, when the first audio signal is the impure music audio and the impure voice audio at the same time, determining that the first audio signal is the mixed audio.

Specifically, for pure voice audio and mixed audio, as both the pure voice audio and the mixed audio contain voice signals, the fluctuation of the short-time average energy zero crossing rate of the pure voice audio and the mixed audio is larger, the non-pure music audio can be determined by comparing the change rate obtained in the previous step with a preset third threshold value, and the fluctuation of the short-time average energy zero crossing rate of the pure music audio is smaller.

For pure music audio and mixed audio, as the audio signals are contained, the short-time average zero-crossing rate of most audio frames is larger than a preset threshold, namely the number of audio frames with the short-time average zero-crossing rate larger than a first threshold is larger than a fourth threshold, the number of the first audio frames obtained in the previous step is compared with the preset fourth threshold, so that the audio can be determined to be non-pure voice audio, and the number of the first audio frames of the pure voice audio is smaller than or equal to the fourth threshold.

By classifying the first audio signal twice, when the first audio signal is both impure music audio and impure speech audio, it can be determined to be mixed audio.

Further as an alternative embodiment, when determining that the first audio signal is mixed audio, the audio classification method further comprises the steps of:

Further, as an alternative embodiment, the step of segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure audio portion of the first audio segments, and determining an audio bias of the first audio signal according to an audio length of the impure audio portion specifically includes:

b1, dividing the first audio signal into a plurality of first audio segments with the same audio length;

b2, performing end point detection on the first audio segments by a double-threshold method, determining pure audio frequency parts and non-pure audio frequency parts of each first audio segment, and further determining first audio frequency lengths of the non-pure audio frequency parts;

b3, comparing the first audio frequency length with a preset fifth threshold value, and determining that the corresponding first audio frequency segment is an audio frequency segment when the first audio frequency length is smaller than or equal to the fifth threshold value, otherwise, determining that the corresponding first audio frequency segment is a voice audio frequency segment;

b4, determining the audio bias of the first audio signal according to the number of the audio frequency segments and the voice audio frequency segments.

Specifically, when the first audio signal is mixed audio, the first audio signal is subjected to segmentation processing, so that the first audio signal is equally divided into odd first audio segments with the same audio length; and detecting the end points of the first audio segments by adopting a double-threshold method, determining pure audio frequency parts and non-pure audio frequency parts of each first audio segment, taking the pure audio frequency parts as noise, removing the intervals between the non-pure audio frequency parts, and further determining the first audio length of the non-pure audio frequency parts.

The method comprises the steps of presetting a fifth threshold value for judging that a first audio frequency segment belongs to a music frequency segment which is biased to music or a voice frequency segment which is biased to voice, comparing the first audio frequency length of an impure audio frequency part of each first audio frequency segment with the fifth threshold value, wherein the first audio frequency segment which is smaller than or equal to (smaller than when the segments are odd numbers) the fifth threshold value is the music audio frequency segment, and otherwise, the first audio frequency segment is the voice audio frequency segment.

When the number of audio frequency segments in the first audio signal is larger than that of the voice audio frequency segments, determining that the first audio signal is mixed audio frequency biased to music; otherwise, the first audio signal is determined to be a mixed audio biased towards speech.

It should be appreciated that the embodiments of the present invention may take a plurality of known classes of audio signals as samples, perform a plurality of tests to determine each threshold, and adjust each threshold according to the accuracy of the classification until a preset test accuracy is achieved.

The method steps of the embodiments of the present invention are described above. It can be understood that, in the embodiment of the invention, the audio signal is subjected to framing processing, endpoint detection and short-time average zero-crossing rate determination of the audio frames, so that the number of the first audio frames with the short-time average zero-crossing rate greater than or equal to the preset first threshold and the fluctuation condition of the short-time average zero-crossing rate are determined, thereby classifying the audio signal, identifying pure music audio, pure voice audio and mixed audio, and improving the accuracy of audio classification.

In addition, the embodiment of the invention not only can recognize and classify pure voice or pure music, but also can list the voice and the audio mixed with the music separately and classify the bias of the mixed audio. For example, selecting broadcast audio, performance audio of a single instrument, emotion reading with background music and four pieces of audio of a song, the embodiment of the invention processes each piece of audio, firstly identifies the performance audio of the single instrument according to fluctuation change of short-time average energy zero-crossing rate, classifies the performance audio into pure audio, secondly classifies the broadcast audio as pure voice audio according to comparison of the short-time zero-crossing rate and a threshold value, and singly lists emotion reading with background music and songs and performs bias classification.

Referring to fig. 2, an embodiment of the present invention provides an audio classification system, including:

the framing processing module is used for acquiring a first audio signal to be classified, and framing the first audio signal to obtain a second audio signal;

the end point detection module is used for carrying out end point detection on the second audio signal and removing low-energy audio segments positioned at the head part and the tail part of the second audio signal to obtain a third audio signal;

the short-time average zero-crossing rate determining module is used for determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of the first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate;

and the classification module is used for classifying the first audio signals according to the number of the first audio frames and the fluctuation condition.

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

Referring to fig. 3, an embodiment of the present invention provides an audio classification apparatus, including:

at least one processor;

at least one memory for storing at least one program;

The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.

The embodiment of the present invention also provides a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is for performing an audio classification method as described above.

The computer readable storage medium of the embodiment of the invention can execute the audio classification method provided by the embodiment of the method of the invention, and can execute the steps of any combination of the embodiment of the method, thereby having the corresponding functions and beneficial effects of the method.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the present invention has been described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features described above may be integrated in a single physical device and/or software module or one or more of the functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or a part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium upon which the program described above is printed, as the program described above may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. An audio classification method, comprising the steps of:

classifying the first audio signal according to the number of the first audio frames and the fluctuation condition, determining whether the first audio signal is pure audio and pure voice audio, and determining that the first audio signal is mixed audio when the first audio signal is both impure music audio and impure voice audio;

segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure audio part of the first audio segment, and further determining the audio deviation of the first audio signal according to the audio length of the impure audio part;

the step of determining a short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining a first audio frame number with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and a fluctuation condition of the short-time average zero-crossing rate specifically includes:

performing curve fitting according to the short-time average zero crossing rate to obtain a short-time average zero crossing rate fluctuation curve, and further determining the change rate of the short-time average zero crossing rate fluctuation curve at each moment;

the step of classifying the first audio signal according to the number of the first audio frames and the fluctuation condition, and determining whether the first audio signal is pure audio or not and whether the first audio signal is pure voice audio or not specifically includes:

the step of segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure audio portion of the first audio segment, and determining an audio bias of the first audio signal according to an audio length of the impure audio portion, includes:

2. The method for audio classification according to claim 1, wherein the step of framing the first audio signal to obtain a second audio signal comprises:

3. The method of audio classification according to claim 2, wherein the step of performing endpoint detection on the second audio signal and removing low-energy audio segments located at the head and tail of the second audio signal to obtain a third audio signal comprises:

4. An audio classification system, comprising:

the short-time average zero-crossing rate determining module is used for determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate;

the classification module is used for classifying the first audio signals according to the number of the first audio frames and the fluctuation condition, determining whether the first audio signals are pure audio and pure voice audio, and determining that the first audio signals are mixed audio when the first audio signals are both non-pure music audio and non-pure voice audio;

the audio deviation determining module is used for segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure audio frequency part of the first audio segment and further determining the audio deviation of the first audio signal according to the audio length of the impure audio frequency part;

the short-time average zero-crossing rate determining module is specifically configured to:

the classification module is specifically configured to:

the audio deviation determining module is specifically configured to:

5. An audio classification device, comprising:

at least one processor;

at least one memory for storing at least one program;

when said at least one program is executed by said at least one processor, said at least one processor is caused to implement an audio classification method as claimed in any one of claims 1 to 3.

6. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program, when being executed by a processor, is for performing an audio classification method according to any of claims 1 to 3.