CN114283841A

CN114283841A - Audio classification method, system, device and storage medium

Info

Publication number: CN114283841A
Application number: CN202111560886.1A
Authority: CN
Inventors: 王伟
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-05
Anticipated expiration: 2041-12-20
Also published as: CN114283841B

Abstract

The invention discloses an audio classification method, a system, a device and a storage medium, wherein the method comprises the following steps: acquiring a first audio signal to be classified, and performing framing processing on the first audio signal to obtain a second audio signal; carrying out endpoint detection on the second audio signal, and removing low-energy audio segments positioned at the head and the tail of the second audio signal to obtain a third audio signal; determining the short-time average zero crossing rate of each audio frame in the third audio signal, and determining the number of first audio frames with the short-time average zero crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero crossing rate; the first audio signal is classified according to the first audio frame number and the fluctuation condition. The audio signal can be classified by performing framing processing and end point detection on the audio signal and determining the short-time average zero crossing rate of the audio frame, so that pure music audio, pure voice audio and mixed audio can be identified, the accuracy of audio classification is improved, and the method can be widely applied to the technical field of audio classification.

Description

Audio classification method, system, device and storage medium

Technical Field

The present invention relates to the field of audio classification technologies, and in particular, to an audio classification method, system, device, and storage medium.

Background

All sounds audible to the human ear are called audio, and the audio can be classified into voice, music, silence, ambient sound, and noise according to the representation form of the audio, and the voice and the music are two most important audio data types.

In a large-scale media database, there are both pure speech audio and pure music audio, such as broadcast sounds and piano tunes, and mixed speech and music audio, such as emotional recitations and songs with background music. In the prior art, when more feature parameters are extracted, music and voice can be accurately classified, but the classification effect of mixed audio containing voice and music is poor.

Disclosure of Invention

The present invention aims to solve at least to some extent one of the technical problems existing in the prior art.

Therefore, an object of the embodiments of the present invention is to provide an audio classification method, which performs framing processing on an audio signal, detects an endpoint, and determines a short-time average zero crossing rate of an audio frame, thereby determining the number of first audio frames with the short-time average zero crossing rate being greater than or equal to a preset first threshold and a fluctuation condition of the short-time average zero crossing rate, so as to classify the audio signal, identify a pure music audio, a pure speech audio, and a mixed audio, and improve accuracy of audio classification.

It is another object of embodiments of the present invention to provide an audio classification system.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides an audio classification method, including the following steps:

acquiring a first audio signal to be classified, and performing framing processing on the first audio signal to obtain a second audio signal;

carrying out endpoint detection on the second audio signal, and removing low-energy audio segments positioned at the head and the tail of the second audio signal to obtain a third audio signal;

determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate;

and classifying the first audio signal according to the number of the first audio frames and the fluctuation condition.

Further, in an embodiment of the present invention, the step of performing framing processing on the first audio signal to obtain a second audio signal specifically includes:

performing framing processing on the first audio signal to obtain a plurality of second audio frames, wherein two adjacent second audio frames have overlapping parts;

a second audio signal is determined from a plurality of the second audio frames.

Further, in an embodiment of the present invention, the step of performing endpoint detection on the second audio signal and removing the low-energy audio segments located at the head and the tail of the second audio signal to obtain a third audio signal specifically includes:

determining first short-time frame energies of a number of the second audio frames located at the head of the second audio signal and second short-time frame energies of a number of the second audio frames located at the tail of the second audio signal;

comparing the first short-time frame energy and the second short-time frame energy with a preset second threshold;

and when the energy of the first short-time frame is smaller than the second threshold, removing the second audio frame corresponding to the energy of the first short-time frame, and when the energy of the second short-time frame is smaller than the second threshold, removing the second audio frame corresponding to the energy of the second short-time frame to obtain a third audio signal.

Further, in an embodiment of the present invention, the step of determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold and the fluctuation condition of the short-time average zero-crossing rate specifically includes:

determining a short-time average zero crossing rate of each audio frame in the third audio signal;

comparing the short-time average zero-crossing rate with the first threshold value, and determining the number of first audio frames of which the short-time average zero-crossing rate is greater than or equal to the first threshold value;

and performing curve fitting according to the short-time average zero-crossing rate to obtain a short-time average zero-crossing rate fluctuation curve, and further determining the change rate of the short-time average zero-crossing rate fluctuation curve at each moment.

Further, in an embodiment of the present invention, the step of classifying the first audio signal according to the number of first audio frames and the fluctuation condition specifically includes:

comparing the change rate with a preset third threshold, determining that the first audio signal is a pure music audio when the change rate at each moment is less than or equal to the third threshold, otherwise, determining that the first audio signal is a non-pure music audio;

comparing the number of the first audio frames with a preset fourth threshold, when the number of the first audio frames is less than or equal to the fourth threshold, determining that the first audio signal is a pure voice audio, otherwise, determining that the first audio signal is a non-pure voice audio;

and when the first audio signal is the non-pure music audio and the non-pure voice audio at the same time, determining that the first audio signal is the mixed audio.

Further, in an embodiment of the present invention, when it is determined that the first audio signal is mixed audio, the audio classification method further includes the steps of:

segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure music audio part of the first audio segments, and further determining the audio deviation of the first audio signal according to the audio length of the impure music audio part.

Further, in an embodiment of the present invention, the step of segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure music audio part of the first audio segments, and determining an audio bias of the first audio signal according to an audio length of the impure music audio part specifically includes:

dividing the first audio signal into a plurality of first audio segments with the same audio length;

carrying out end point detection on the first audio segments by a double-threshold method, determining a pure music audio part and an impure music audio part of each first audio segment, and further determining a first audio length of the impure music audio part;

comparing the first audio length with a preset fifth threshold, when the first audio length is less than or equal to the fifth threshold, determining that the corresponding first audio segment is a music audio segment, otherwise, determining that the corresponding first audio segment is a voice audio segment;

determining an audio bias of the first audio signal based on the number of music audio segments and the number of speech audio segments.

In a second aspect, an embodiment of the present invention provides an audio classification system, including:

the first feature set determining module is used for acquiring a first user structure diagram of a first video user, and performing traversal processing on the first user structure diagram through a diagram traversal algorithm to obtain a first feature set of the first video user;

the second feature set determining module is used for acquiring a first video structure diagram of the first video user, and performing traversal processing on the first video structure diagram through a diagram traversal algorithm to obtain a second feature set of the first video user;

a relevancy determination module, configured to determine relevancy between each feature in the first feature set and each feature in the second feature set;

and the model training and audio classification module is used for determining supervision data according to the relevance, determining a training sample according to the first characteristic set and the second characteristic set, inputting the supervision data and the training sample into a pre-constructed neural network model to obtain a trained audio classification model, and further determining an audio classification result according to the audio classification model.

In a third aspect, an embodiment of the present invention provides an audio classification apparatus, including:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement a method of audio classification as described above.

In a fourth aspect, the present invention also provides a computer-readable storage medium, in which a program executable by a processor is stored, and the program executable by the processor is used for executing the audio classification method.

Advantages and benefits of the present invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention:

the method comprises the steps of obtaining a first audio signal to be classified, performing framing processing on the first audio signal to obtain a second audio signal, performing endpoint detection on the second audio signal, removing low-energy audio segments located at the head and the tail of the second audio signal to obtain a third audio signal, determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate, so that the first audio signal can be classified according to the number of the first audio frames and the fluctuation condition. According to the embodiment of the invention, the audio signals are subjected to framing processing and end point detection, the short-time average zero-crossing rate of the audio frames is determined, and then the number of the first audio frames with the short-time average zero-crossing rate being more than or equal to the preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate are determined, so that the audio signals can be classified, pure music audio, pure voice audio and mixed audio are identified, and the accuracy of audio classification is improved.

Drawings

In order to more clearly illustrate the technical solution in the embodiment of the present invention, the following description is made on the drawings required to be used in the embodiment of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solution of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of an audio classification method according to an embodiment of the present invention;

fig. 2 is a block diagram of an audio classification system according to an embodiment of the present invention;

fig. 3 is a block diagram of an audio classification apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the description of the present invention, the meaning of a plurality is two or more, if there is a description to the first and the second for the purpose of distinguishing technical features, it is not understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

Referring to fig. 1, an embodiment of the present invention provides an audio classification method, which specifically includes the following steps:

s101, obtaining a first audio signal to be classified, and performing framing processing on the first audio signal to obtain a second audio signal.

Specifically, the characteristics of the audio signal as a whole and the parameters characterizing its essential features are time-varying, so that it is a non-stationary process that cannot be analyzed by digital signal processing techniques for processing stationary signals. However, since different voices are responses generated by a certain shape of the vocal tract formed by oral muscle movements of a human, and the oral muscle movements are very slow relative to voice frequency, on the other hand, although voice signals have time-varying characteristics, in a short time range (generally considered to be 10-30 ms), the characteristics of the voice signals basically keep unchanged, namely, the voice signals are relatively stable, and therefore, the voice signals can be regarded as a quasi-steady-state process, namely, the voice signals have short-time stationarity. Therefore, the analysis and processing of any speech signal must be based on "short-time", i.e., "short-time analysis" is performed, and the speech signal is divided into multiple sections to analyze its characteristic parameters, wherein each section is called a "frame", and the frame length is generally 10-30 Ms. Thus, for the whole speech signal, the analyzed characteristic parameter time sequence is composed of the characteristic parameters of each frame.

As a further optional implementation manner, the step of performing framing processing on the first audio signal to obtain a second audio signal specifically includes:

a1, framing the first audio signal to obtain a plurality of second audio frames, wherein two adjacent second audio frames have overlapping parts;

a2, determining a second audio signal according to the plurality of second audio frames.

In particular, the framing of the embodiment of the present invention adopts an overlapping segmentation method, i.e. there is an overlapping portion between adjacent audio frames, which is to make the frames smoothly transition from the previous frames and maintain their continuity. The overlapping part of the previous frame and the next frame is called frame shift, and in the embodiment of the invention, the ratio of the frame shift to the frame length is (0, 1/2).

And S102, carrying out endpoint detection on the second audio signal, and removing low-energy audio segments positioned at the head and the tail of the second audio signal to obtain a third audio signal.

In particular, endpoint Detection, also called Voice Activity Detection (VAD), aims to distinguish between audio and non-audio regions. It is commonly understood that the end point detection is to accurately locate the start point and the end point of audio from noisy audio, remove the mute part and the noise part, and find a piece of audio really effective. Step S102 specifically includes the following steps:

s1021, determining first short-time frame energy of a plurality of second audio frames positioned at the head of the second audio signal and second short-time frame energy of a plurality of second audio frames positioned at the tail of the second audio signal;

s1022, comparing the first short-time frame energy and the second short-time frame energy with a preset second threshold;

and S1023, when the energy of the first short-time frame is smaller than a second threshold value, removing the second audio frame corresponding to the energy of the first short-time frame, and when the energy of the second short-time frame is smaller than the second threshold value, removing the second audio frame corresponding to the energy of the second short-time frame to obtain a third audio signal.

Specifically, the embodiment of the present invention detects the second audio signal by using an endpoint detection method based on short-time energy, and removes audio segments with low head-to-tail energy from the second audio signal to obtain a third audio signal.

S103, determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of the first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate.

Specifically, since the speech is composed of unvoiced sound and voiced sound alternately, and the short-time average zero-crossing rate of voiced sound is smaller than that of unvoiced sound, which results in large fluctuation of the short-time average zero-crossing rate in the speech audio signal, the embodiment of the present invention determines the short-time average zero-crossing rate of each audio frame in the third audio signal, and then can determine whether the short-time average zero-crossing rate is greater than or equal to a preset threshold value and the fluctuation condition of the short-time average zero-crossing rate, thereby facilitating subsequent classification of the audio signals. Step S103 specifically includes the following steps:

s1031, determining the short-time average zero crossing rate of each audio frame in the third audio signal;

s1032, comparing the short-time average zero-crossing rate with a first threshold value, and determining the number of first audio frames of which the short-time average zero-crossing rate is greater than or equal to the first threshold value;

s1033, performing curve fitting according to the short-time average zero-crossing rate to obtain a short-time average zero-crossing rate fluctuation curve, and further determining the change rate of the short-time average zero-crossing rate fluctuation curve at each moment.

Specifically, after the short-time average zero-crossing rate of each audio frame is determined, curve fitting is performed according to the time sequence relation of the audio frames to obtain a short-time average zero-crossing rate fluctuation curve which changes along with time, so that the change rate of the short-time average zero-crossing rate fluctuation curve at each moment can be determined, and the change rate can reflect the fluctuation condition of the short-time average zero-crossing rate.

Processing each section of audio signals, intercepting the sections with the same parameter adjustment, setting a first threshold value for distinguishing the voice signals from the music signals, and calculating the number of audio frames with the short-time average zero crossing rate larger than the first threshold value in each section of audio, namely the number of first audio frames.

And S104, classifying the first audio signal according to the number of the first audio frames and the fluctuation condition.

Further as an optional implementation manner, step S104 specifically includes the following steps:

s1041, comparing the change rate with a preset third threshold, when the change rate at each moment is less than or equal to the third threshold, determining that the first audio signal is a pure music audio, otherwise, determining that the first audio signal is a non-pure music audio;

s1042, comparing the number of the first audio frames with a preset fourth threshold, when the number of the first audio frames is less than or equal to the fourth threshold, determining that the first audio signal is a pure voice audio, otherwise, determining that the first audio signal is a non-pure voice audio;

and S1043, when the first audio signal is the non-pure music audio and the non-pure voice audio at the same time, determining that the first audio signal is the mixed audio.

Specifically, for pure speech audio and mixed audio, both contain speech signals, and the short-time average energy zero-crossing rate of the two fluctuates greatly, the change rate obtained in the previous step is compared with a preset third threshold value to determine that the pure speech audio is not pure music audio, and the fluctuation of the short-time average energy zero-crossing rate of the pure music audio is small.

For pure music audio and mixed audio, because both contain music signals, the short-time average zero-crossing rate of most audio frames is greater than the preset threshold, that is, the number of frames of audio frames with the short-time average zero-crossing rate greater than the first threshold is greater than the fourth threshold, the non-pure speech audio can be determined by comparing the number of first audio frames obtained in the previous steps with the preset fourth threshold, and the number of first audio frames of the pure speech audio is less than or equal to the fourth threshold.

By classifying the first audio signal twice, it can be determined that the first audio signal is a mixed audio when the first audio signal is a non-pure music audio and a non-pure speech audio at the same time.

As a further optional implementation manner, when the first audio signal is determined to be mixed audio, the audio classification method further includes the following steps:

the method comprises the steps of segmenting a first audio signal to obtain a plurality of first audio segments, determining a non-pure music audio part of the first audio segments, and determining the audio deviation of the first audio signal according to the audio length of the non-pure music audio part.

As a further optional implementation, the step of segmenting the first audio signal to obtain a plurality of first audio segments, determining an impure music audio portion of the first audio segment, and determining an audio bias of the first audio signal according to an audio length of the impure music audio portion specifically includes:

b1, dividing the first audio signal into a plurality of first audio segments with the same audio length;

b2, performing endpoint detection on the first audio segments by a double-threshold method, determining a pure music audio part and an impure music audio part of each first audio segment, and further determining a first audio length of the impure music audio part;

b3, comparing the first audio length with a preset fifth threshold, when the first audio length is less than or equal to the fifth threshold, determining that the corresponding first audio segment is a music audio segment, otherwise, determining that the corresponding first audio segment is a voice audio segment;

b4, determining the audio bias of the first audio signal based on the number of music audio pieces and speech audio pieces.

Specifically, when the first audio signal is mixed audio, the first audio signal is segmented and divided into odd first audio segments with the same audio length; and performing endpoint detection on the first audio segments by adopting a double-threshold method, determining pure music audio parts and non-pure music audio parts of the first audio segments, taking the pure music audio parts as noise, and removing intervals among the non-pure music audio parts so as to determine the first audio length of the non-pure music audio parts.

And presetting a fifth threshold value for judging whether the first audio frequency segment belongs to a music audio frequency segment biased to music or a voice audio frequency segment biased to voice, comparing the first audio frequency length of the non-pure music audio frequency part of each first audio frequency segment with the fifth threshold value, wherein the first audio frequency segment which is less than or equal to (is less than when the segment is an odd number) the fifth threshold value is a music audio frequency segment, and otherwise, the first audio frequency segment is a voice audio frequency segment.

When the number of the music frequency bands in the first audio signal is larger than that of the voice frequency bands, determining that the first audio signal is a mixed audio biased to music; otherwise, the first audio signal is determined to be the mixed audio with biased voice.

It should be appreciated that the embodiment of the present invention may take a plurality of audio signals of known classes as samples, perform a plurality of tests to determine each threshold, and adjust each threshold according to the classification accuracy until a preset test accuracy is reached.

The method steps of the embodiments of the present invention are described above. It can be understood that, in the embodiment of the present invention, the audio signal is subjected to framing processing, endpoint detection, and the short-time average zero crossing rate of the audio frame is determined, so as to determine the number of the first audio frames with the short-time average zero crossing rate being greater than or equal to the preset first threshold value and the fluctuation condition of the short-time average zero crossing rate, thereby classifying the audio signal, identifying pure music audio, pure speech audio, and mixed audio, and improving the accuracy of audio classification.

In addition, the embodiment of the invention not only can identify and classify pure voice or pure music, but also can list mixed audio of the voice and the music separately and classify the bias of the mixed audio. For example, four audio segments including broadcast audio, performance audio of a single musical instrument, emotion reading with background music and song are selected, each audio segment is processed according to the embodiment of the invention, the performance audio of the single musical instrument is firstly identified according to fluctuation change of a short-time average energy zero crossing rate and is classified into pure music audio, the broadcast audio is classified as pure voice audio according to comparison of the short-time zero crossing rate and a threshold value, and the emotion reading with background music and the song are listed separately and are subjected to biased classification.

Referring to fig. 2, an embodiment of the present invention provides an audio classification system, including:

the frame processing module is used for acquiring a first audio signal to be classified and performing frame processing on the first audio signal to obtain a second audio signal;

the endpoint detection module is used for carrying out endpoint detection on the second audio signal and removing low-energy audio segments positioned at the head and the tail of the second audio signal to obtain a third audio signal;

the short-time average zero-crossing rate determining module is used for determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of the first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold value and the fluctuation condition of the short-time average zero-crossing rate;

and the classification module is used for classifying the first audio signal according to the number of the first audio frames and the fluctuation condition.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

Referring to fig. 3, an embodiment of the present invention provides an audio classification apparatus, including:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one program causes the at least one processor to implement the audio classification method.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

Embodiments of the present invention also provide a computer-readable storage medium, in which a program executable by a processor is stored, and the program executable by the processor is used for executing the audio classification method.

The computer-readable storage medium of the embodiment of the invention can execute the audio classification method provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the above-described functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The above functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer readable medium could even be paper or another suitable medium upon which the above described program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of audio classification, comprising the steps of:

2. The audio classification method according to claim 1, wherein the step of performing framing processing on the first audio signal to obtain a second audio signal specifically comprises:

3. The audio classification method according to claim 2, wherein the step of performing endpoint detection on the second audio signal and removing the low-energy audio segments located at the head and tail of the second audio signal to obtain a third audio signal specifically comprises:

4. The audio classification method according to claim 1, wherein the step of determining the short-time average zero-crossing rate of each audio frame in the third audio signal, and further determining the number of first audio frames with the short-time average zero-crossing rate greater than or equal to a preset first threshold and the fluctuation of the short-time average zero-crossing rate specifically comprises:

5. The audio classification method according to claim 4, wherein the step of classifying the first audio signal according to the number of first audio frames and the fluctuation condition specifically comprises:

6. The audio classification method according to claim 5, wherein when the first audio signal is determined to be mixed audio, the audio classification method further comprises the steps of:

7. The audio classification method according to claim 6, wherein the step of segmenting the first audio signal into a plurality of first audio segments, determining the non-pure musical audio portion of the first audio segments, and determining the audio bias of the first audio signal according to the audio length of the non-pure musical audio portion comprises:

8. An audio classification system, comprising:

a short-time average zero-crossing rate determining module, configured to determine a short-time average zero-crossing rate of each audio frame in the third audio signal, and further determine a number of first audio frames with the short-time average zero-crossing rate being greater than or equal to a preset first threshold and a fluctuation condition of the short-time average zero-crossing rate;

9. An audio classification apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method of audio classification according to any one of claims 1 to 7.

10. A computer readable storage medium in which a processor executable program is stored, wherein the processor executable program, when executed by a processor, is for performing an audio classification method as claimed in any one of claims 1 to 7.