CN111613213B

CN111613213B - Audio classification method, device, equipment and storage medium

Info

Publication number: CN111613213B
Application number: CN202010358102.6A
Authority: CN
Inventors: 吕俊领; 卢传泽; 邱威
Original assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Current assignee: Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2023-07-04
Anticipated expiration: 2040-04-29
Also published as: CN111613213A

Abstract

The application discloses a method, a device, equipment and a storage medium for audio classification, and belongs to the technical field of computers. The method comprises the following steps: acquiring audio data to be classified; acquiring audio data of unit duration in time sequence in a target audio stream; each time audio data of a unit duration is obtained, determining an audio type corresponding to the audio data based on an audio classification model; when the first audio data is detected to be of a human voice type and the previous audio data of the first audio data is detected to be of a non-human voice type, the first audio data is determined to be starting audio data of the human voice, and when the second audio data is detected to be of a non-human voice type and the previous audio data of the second audio data is detected to be of a human voice type, the second audio data is determined to be end audio data of the human voice; and determining a voice audio segment in the target audio stream based on the voice start audio data and the voice end audio data, and executing target processing on the voice audio segment. The accuracy of audio classification can be improved through the method and the device.

Description

Audio classification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for audio classification.

Background

With the development of network technology, people send voice or directly talk with each other to become a communication mode which is most common for people's life, and this derives the need for voice content detection, and when voice detection is performed, firstly, audio data is intercepted, so the following settings are performed for the technicians:

the computer equipment intercepts audio data with unit time length in a target audio stream, acquires the frequency range of an audio frame in the audio data, detects whether the frequency range of the audio frame belongs to a human voice frequency range, if so, the audio frame is of a human voice type, and if not, the audio frame is of a non-human voice type, and then intercepts the audio frame of the human voice type in the audio data with unit time length, so as to obtain intercepted audio data.

In carrying out the present application, the inventors have found that the prior art has at least the following problems:

in the prior art, audio is classified by detecting the frequency range of an audio frame, but in reality, many sounds are very close to or even identical to the frequency range of human voice, so that the accuracy of capturing the human voice frequency band in an audio stream is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for audio classification, which can solve the problem of poor accuracy of intercepting human voice frequency bands in audio streams. The technical scheme is as follows:

in one aspect, a method of audio classification is provided, the method comprising:

acquiring audio data of unit duration in time sequence in a target audio stream;

each time audio data of a unit duration is obtained, determining an audio type corresponding to the audio data based on the audio classification model, wherein the audio type comprises a human voice type and a non-human voice type, and the audio classification model comprises a full-connection layer and a long-short-time memory layer;

when it is detected that the first audio data is of a human voice type and that the preceding audio data of the first audio data is of a non-human voice type, determining that the first audio data is human voice origin audio data,

when the second audio data is detected to be of a non-voice type and the previous audio data of the second audio data is detected to be of a voice type, determining that the second audio data is voice terminal audio data;

and determining a voice audio segment in the target audio stream based on the voice starting point audio data and the voice ending point audio data, and executing target processing on the voice audio segment.

Optionally, the determining, based on the audio classification model, the audio type corresponding to the audio data includes:

determining that the audio data corresponds to first intermediate data based on the audio data and an input layer of an audio classification model;

determining second intermediate data corresponding to the audio data based on the first intermediate data and a full connection layer of the audio classification model;

determining third intermediate data corresponding to the audio data based on the second intermediate data and a long-short-time memory layer of the audio classification model;

and determining the audio type corresponding to the audio data based on the third intermediate data and the output layer of the audio classification model.

Optionally, the determining, based on the first intermediate data and the full connection layer of the audio classification model, second intermediate data corresponding to the audio data includes:

determining fourth intermediate data corresponding to the audio data based on the first intermediate data and a convolution layer of the audio classification model;

and determining second intermediate data corresponding to the audio data based on the fourth intermediate data and the full connection layer of the audio classification model.

Optionally, the audio data includes sub-audio data corresponding to a plurality of audio frames, the first intermediate data includes first sub-intermediate data corresponding to a plurality of audio frames, the second intermediate data includes second sub-intermediate data corresponding to a plurality of audio frames, and the fourth intermediate data includes fourth sub-intermediate data corresponding to a plurality of audio frames.

Optionally, the determining, based on the audio data and the input layer of the audio classification model, that the audio data corresponds to the first intermediate data includes: respectively inputting each piece of sub-audio data into an input layer of an audio classification model to obtain first sub-intermediate data corresponding to a plurality of audio frames;

the determining, based on the first intermediate data and the convolution layer of the audio classification model, fourth intermediate data corresponding to the audio data includes: respectively inputting each first sub intermediate data into a convolution layer of an audio classification model to obtain fourth sub intermediate data corresponding to a plurality of audio frames;

the determining, based on the fourth intermediate data and the full connection layer of the audio classification model, second intermediate data corresponding to the audio data includes: respectively inputting each fourth sub intermediate data into a full connection layer of the audio classification model to obtain second sub intermediate data corresponding to a plurality of audio frames;

the determining, based on the second intermediate data and the long-short-time memory layer of the audio classification model, third intermediate data corresponding to the audio data includes: combining the plurality of second sub-intermediate data according to the time sequence of the corresponding audio frames, and inputting the second sub-intermediate data into a long-short-time memory layer of the audio classification model to obtain third intermediate data corresponding to the audio data;

The determining, based on the third intermediate data and the output layer of the audio classification model, the audio type corresponding to the audio data includes: splitting the third intermediate data into third sub-intermediate data corresponding to a plurality of audio frames; and respectively inputting third sub intermediate data corresponding to a plurality of audio frames into an output layer of the audio classification model, and determining the audio type corresponding to each audio frame.

Optionally, the audio types include a first audio type and a second audio type, the third sub-intermediate data corresponding to the plurality of audio frames is input into an output layer of the audio classification model respectively, and determining the audio type corresponding to each audio frame includes:

respectively inputting third sub intermediate data corresponding to a plurality of audio frames into an output layer of the audio classification model to obtain the probability that each audio frame is of a first audio type;

and determining the audio frame with the corresponding probability value larger than the preset threshold value as a first audio type, and determining the audio frame with the corresponding probability value smaller than the preset threshold value as a second audio type.

In another aspect, an apparatus for audio classification is provided, the apparatus comprising:

the acquisition module is used for acquiring the audio data of the unit time length in the target audio stream according to the time sequence;

The determining module is used for determining an audio type corresponding to the audio data based on the audio classification model, wherein the audio type comprises a voice type and a non-voice type, and the audio classification model comprises a full-connection layer and a long-short-time memory layer;

a detection module for determining that the first audio data is voice start audio data when the first audio data is detected to be voice type and the previous audio data of the first audio data is non-voice type,

the detection module is further used for determining that the second audio data is voice terminal audio data when the second audio data is detected to be of a non-voice type and the previous audio data of the second audio data is detected to be of a voice type;

and the processing module is used for determining a voice audio segment in the target audio stream based on the voice starting point audio data and the voice ending point audio data and executing target processing on the voice audio segment.

Optionally, the determining module is configured to:

acquiring audio data to be classified;

Optionally, the determining module is configured to:

Optionally, the determining module is configured to: respectively inputting each piece of sub-audio data into an input layer of an audio classification model to obtain first sub-intermediate data corresponding to a plurality of audio frames;

The determining module is used for: respectively inputting each first sub intermediate data into a convolution layer of an audio classification model to obtain fourth sub intermediate data corresponding to a plurality of audio frames;

the determining module is used for: respectively inputting each fourth sub intermediate data into a full connection layer of the audio classification model to obtain second sub intermediate data corresponding to a plurality of audio frames;

the determining module is used for: combining the plurality of second sub-intermediate data according to the time sequence of the corresponding audio frames, and inputting the second sub-intermediate data into a long-short-time memory layer of the audio classification model to obtain third intermediate data corresponding to the audio data;

the determining module is used for: splitting the third intermediate data into third sub-intermediate data corresponding to a plurality of audio frames; and respectively inputting third sub intermediate data corresponding to a plurality of audio frames into an output layer of the audio classification model, and determining the audio type corresponding to each audio frame.

Optionally, the audio type includes a first audio type and a second audio type, and the determining module is configured to:

In yet another aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to perform operations performed by the method of audio classification.

In yet another aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to perform operations performed by the method of audio classification is provided.

The beneficial effects that technical scheme that this application embodiment provided brought are:

according to the method, the audio data of unit time length are acquired in the target audio stream in time sequence, the audio data are classified through the audio classification model, the voice starting point audio data and the voice terminal audio data are determined, and further, the voice audio segments in the target audio stream are determined according to the voice starting point audio data and the voice terminal audio data. The method for detecting the voice starting point audio data and the voice ending point audio data by adopting the audio classification model is higher in accuracy than the method for detecting the voice starting point audio data and the voice ending point audio data by using the voice frequency range in the prior art, so that the accuracy of capturing the voice frequency band in the audio stream is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an environmental schematic diagram of a method of audio classification provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method for audio classification provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method for audio classification provided by an embodiment of the present application;

FIG. 4 is a process flow diagram of an audio classification model of a method of audio classification provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for audio classification according to an embodiment of the present application;

fig. 6 is a schematic diagram of a terminal structure provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a server structure according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Embodiments of the present application provide a method of audio classification, which may be implemented by a computer device. The computer device may be a terminal, which may be a mobile phone, a desktop computer, a tablet computer, an intelligent wearable device, etc. The terminal may have a computing function, a function of receiving data, and the terminal may be installed with an application such as a map application, a chat application, a take-away application, a live application. The embodiment is described by taking a chat application as an example, and will not be described in detail. The computer device may also be a server, which may be a background server of the chat application, where the server may be a separate server or a server group, and if the server is a separate server, the server may be responsible for all the processes that need to be performed by the server in the following schemes, and if the server is a server group, different servers in the server group may be respectively responsible for different processes in the following schemes, and specific process allocation situations may be set arbitrarily by a technician according to actual requirements, which will not be repeated herein.

When a user uses a chat application program, the user can establish a virtual chat room, then invite other users to join the virtual chat room, in the virtual chat room, all users can receive or send audio through their terminals, in the process of receiving or sending audio, the terminals used by each user or the background server of the chat application program can detect the audio, as shown in fig. 1, the terminals used by each user or the background server of the chat application program can process the audio through three mechanisms, namely VAD (Voice Activity Detection, active audio detection), ASR (Automatic Speech Recognization, automatic audio recognition) and MDD (Mispronunciation Detection And Diagnosis, pronunciation detection and correction), so that the playing effect of the audio can be improved, and each user can obtain better hearing experience.

Fig. 2 is a flowchart of a method for audio classification according to an embodiment of the present application. Referring to fig. 2, the process includes:

step 201, audio data of a unit duration is acquired in time sequence in a target audio stream.

Step 202, determining an audio type corresponding to the audio data based on an audio classification model, wherein the audio type comprises a human voice type and a non-human voice type, and the audio classification model comprises a full-connection layer and a long-short-time memory layer.

Step 203, when it is detected that the first audio data is of a human voice type and the previous audio data of the first audio data is of a non-human voice type, determining that the first audio data is of a human voice starting point audio data.

Step 204, when it is detected that the second audio data is of a non-voice type and the previous audio data of the second audio data is of a voice type, determining that the second audio data is voice endpoint audio data.

Step 205, determining a voice audio segment in the target audio stream based on the voice start audio data and the voice end audio data, and executing target processing on the voice audio segment.

Fig. 3 is a flowchart of a method for audio classification according to an embodiment of the present application. Referring to fig. 3, the process includes:

Step 301, obtaining audio data to be classified.

In implementation, when a user uses a chat application, the user can input a piece of audio to the device, the chat application can firstly classify the audio through a VAD mechanism, the VAD mechanism can be an audio classification model, and after the user inputs a piece of audio, the audio classification model can convert the audio into audio data, and the audio data is the audio data to be classified.

Step 302, determining that the audio data corresponds to the first intermediate data based on the audio data and the input layer of the audio classification model.

Wherein the first intermediate data comprises first sub-intermediate data corresponding to a plurality of audio frames.

In implementation, after obtaining the audio data to be classified, the audio data may be split into a plurality of sub-audio data, and then each sub-audio data is input into an input layer of the audio classification model, to obtain first sub-intermediate data corresponding to a plurality of audio frames, where specific processing may be as follows:

after the audio data to be classified is obtained, the audio data can be split into a plurality of audio frames, namely a plurality of sub-audio data, then each audio frame is respectively input into an input layer of an audio classification model, the input layer performs feature extraction on each audio frame and obtains the audio feature corresponding to each audio frame, namely, first sub-intermediate data corresponding to each audio frame is obtained, for example, 39-dimensional MFCC (Mel-Frequency Cepstral Coefficients, mel cepstrum coefficient) features of each audio frame are extracted according to the time sequence signal of each audio frame, and after all the first sub-intermediate data of the audio data to be classified are obtained through the processing, the first intermediate data are obtained.

Optionally, if the building frame of the audio classification model is not compatible with the audio data, when the audio classification model receives the audio data, the audio features of the audio data may be extracted first, and then the obtained first intermediate data may be saved as a text format.

For example, the input layer may be built by a Tensorflow machine learning framework, where the Tensorflow machine learning framework is not compatible with audio data, when the audio classification model receives the audio data, the audio classification model may split the audio data into a plurality of sub-audio data, that is, audio frames, and then perform feature extraction on each sub-audio data, so as to obtain audio features corresponding to each sub-audio data, that is, obtain first sub-intermediate data corresponding to each sub-audio data, and then store all the obtained audio features in a text format for direct call by the audio classification model.

Step 303, determining second intermediate data corresponding to the audio data based on the first intermediate data and the full connection layer of the audio classification model.

In practice, this can be done by the following steps:

first, fourth intermediate data corresponding to the audio data is determined based on the first intermediate data and the convolution layer of the audio classification model.

After the input layer acquires the first intermediate data, the audio classification model can input a preset number of first sub intermediate data in the first intermediate data into the convolution layer, the preset number can be randomly generated, and then the accuracy of the audio classification model in processing audio data with different lengths can be improved. The three-dimensional matrix can be generated based on the first sub-intermediate data with the preset number in the convolution layer, and then the calculated audio data is obtained through convolution operations with different convolution kernels for two times, for example, the convolution operations with the convolution kernels of 5 and 3 are sequentially carried out on the three-dimensional matrix, so that audio with finer audio characteristics can be obtained.

After the operation is finished, in order to ensure that the size of the audio data obtained after the convolution operation is not reduced, that is, the number of the audio frames obtained after the convolution operation is not reduced, the convolution layer of the scheme can supplement 0 to the output audio data, for example, the scheme can set the packing parameter of the convolution layer as the same, so that after the operation is finished, the convolution layer can automatically supplement 0 to the calculated audio data, and further obtain the audio data with 0 supplement, namely, fourth intermediate data.

And a second step of determining second intermediate data corresponding to the audio data based on the fourth intermediate data and the full connection layer of the audio classification model.

After the fourth intermediate data is obtained through the steps, the fourth intermediate data can be split into a plurality of fourth sub intermediate data, then the plurality of fourth sub intermediate data are respectively input into a full connection layer of the audio classification model to obtain second sub intermediate data corresponding to a plurality of audio frames, and the second intermediate data can be obtained after all the second sub intermediate data are obtained, wherein the specific processing can be as follows:

firstly, after the fourth intermediate data is obtained, since the input of the full connection layer is in the format of a data frame, the audio classification model can split the fourth intermediate data into fourth sub intermediate data.

And secondly, determining second sub-intermediate data corresponding to the audio data based on the audio frame and the full-connection layer of the audio classification model.

When the fourth sub intermediate data is input into the full-connection layer, the full-connection layer may perform two full-connection operations on the fourth sub intermediate data according to two preset node numbers, for example, the two preset node numbers may be 512 and 256, and the full-connection layer may sequentially perform two full-connection operations with the node numbers being 512 and 256 based on the two preset node numbers, so as to further extract the audio feature capable of better expressing each audio frame, that is, obtain second sub intermediate data corresponding to each fourth sub intermediate data, and obtain the second intermediate data after obtaining all the second sub intermediate data.

And step 304, determining third intermediate data corresponding to the audio data based on the second intermediate data and the long-short-time memory layer of the audio classification model.

In implementation, after the second intermediate data is obtained, a plurality of second sub intermediate data are combined according to the time sequence of the corresponding audio frames, and are input into a long-short-time memory layer of an audio classification model to obtain third intermediate data corresponding to the audio data, and specific processing may be as follows:

firstly, combining a plurality of second sub intermediate data according to the time sequence of corresponding audio frames, and inputting the second sub intermediate data into a long-short-time memory layer of an audio classification model;

the time sequence of each audio frame in the audio data is obtained, a plurality of second sub-intermediate data obtained through the steps are combined according to the time sequence of the audio frames in the corresponding audio data, a series of second sub-intermediate data arranged in time sequence is obtained, and then the series of second sub-intermediate data arranged in time sequence is input into a long-short-time memory layer of an audio classification model.

And secondly, determining third intermediate data corresponding to the audio data based on the second sub intermediate data arranged in time sequence and the long-short-time memory layer of the audio classification model.

After the series of time-ordered second sub-intermediate data is input into the long and short time memory layer of the audio classification model, the time sequence relationship between each time-ordered second sub-intermediate data can be calculated based on the time-ordered second sub-intermediate data and the long and short time memory layer of the audio classification model, the long and short time memory layer can be constructed by an LSTM (long short-term memory networks), and considering the problem of parameter quantity, the scheme can adopt a forward LSTM to sequentially mine the relationship between each time-ordered second sub-intermediate data through a forgetting gate, an input gate and a cell state update output gate, and obtain third intermediate data according to the setting performed on the long and short time memory layer in advance.

Step 305, determining the audio type corresponding to the audio data based on the third intermediate data and the output layer of the audio classification model.

In an implementation, splitting the third intermediate data into third sub intermediate data corresponding to a plurality of audio frames, respectively inputting the third sub intermediate data corresponding to the plurality of audio frames into an output layer of an audio classification model, and determining an audio type corresponding to each audio frame, where a specific process may be as follows:

First, third sub intermediate data corresponding to a plurality of audio frames are respectively input into an output layer of an audio classification model, and the probability that each audio frame is of a first audio type is obtained.

And secondly, determining the audio frames with the corresponding probability values larger than a preset threshold value as a first audio type, namely normal audio classification, and determining the audio frames with the corresponding probability values smaller than the preset threshold value as a second audio type, namely abnormal audio classification.

For example, if the preset threshold is 0.5, an audio frame having a probability of the first audio type greater than 0.5 is determined as a normal audio classification, and an audio frame having a probability of the first audio type not greater than 0.5 is determined as an abnormal audio classification.

Optionally, the probabilities of the same third sub-intermediate data corresponding to different classifications may also be calculated, so as to determine the audio type corresponding to each third sub-intermediate data.

First, two frame vectors may be determined for each third sub-intermediate data corresponding to a normal audio classification and to an abnormal audio classification, respectively.

Second, each audio frame may be classified based on Softmax, which is formulated as follows:

the soft max(s) _i ) Representing the probability of a class of each frame, s _i Representing a certain class, i.e. one of the two different classes mentioned above, N may be set to n=2 for both classes of the present scheme, representing two classes.

The probability of each audio frame of the input audio recognition model belonging to the normal audio classification and the probability of the abnormal audio classification can be obtained through the Softmax formula, and then whether each audio frame of the input audio classification model belongs to the normal audio or the abnormal audio classification can be judged through a preset threshold value.

For example, the preset threshold is 0.5, the user inputs an audio frame to the audio classification model, the probability of normal audio classification and the probability of abnormal audio classification are finally obtained through the Softmax formula by the operation of the steps, if the probability of normal audio classification is greater than 0.5, the audio frame corresponding to the third sub-intermediate data belongs to the normal audio classification, and if the probability of abnormal audio classification is greater than 0.5, the audio frame corresponding to the third sub-intermediate data belongs to the abnormal audio classification.

Here, the sum of the probability of the normal audio classification and the probability of the abnormal audio classification is 1.

It should be noted that, before the above steps are performed, the audio classification model may be trained, the audio data to be marked is input into the audio classification model as shown in fig. 4, an operation result is obtained sequentially through the input layer, the convolution layer, the full connection layer, the long-short-time memory layer and the output layer, then the comparison is performed based on the marked audio data and the operation result, a difference value is calculated, the difference value is input into a verification algorithm, a verification value is obtained, and then an adjustment value of each parameter is obtained, and then each parameter is adjusted based on the adjustment value of each parameter, and the operation is repeated multiple times, so as to obtain the audio classification model capable of being used in the above steps.

It should be noted that after the training of the audio classification model is completed, before the audio classification model is used, format conversion of the audio classification model may be performed, that is, the audio classification model is converted into a format suitable for a use environment, for example, if the use environment of the audio classification model is a c++ environment and the audio classification model is built based on a python environment, the audio classification model in the ckpt format of the python environment is converted into an audio classification model in the pb format suitable for the c++ environment.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

Fig. 5 is a schematic structural diagram of an apparatus for audio classification according to an embodiment of the present application. Referring to fig. 4, the apparatus may be a terminal, the apparatus including:

an acquisition module 510, configured to acquire audio data of a unit duration in a time sequence in a target audio stream;

a determining module 520, configured to determine, based on the audio classification model, an audio type corresponding to the audio data, where the audio type includes a human voice type and a non-human voice type, and the audio classification model includes a full connection layer and a long and short time memory layer;

a detection module 530, configured to determine that the first audio data is voice start audio data when it is detected that the first audio data is a voice type and that the previous audio data of the first audio data is a non-voice type,

the detection module 530 is further configured to determine that the second audio data is voice endpoint audio data when it is detected that the second audio data is of a non-voice type and that the previous audio data of the second audio data is of a voice type;

And a processing module 540, configured to determine a voice audio segment in the target audio stream based on the voice start audio data and the voice end audio data, and perform target processing on the voice audio segment.

Optionally, the determining module is configured to:

acquiring audio data to be classified;

Optionally, the determining module 520 is configured to:

Optionally, the first determining module 520 is configured to: respectively inputting each piece of sub-audio data into an input layer of an audio classification model to obtain first sub-intermediate data corresponding to a plurality of audio frames;

the determining module 520 is configured to: respectively inputting each first sub intermediate data into a convolution layer of an audio classification model to obtain fourth sub intermediate data corresponding to a plurality of audio frames;

the determining module 520 is configured to: respectively inputting each fourth sub intermediate data into a full connection layer of the audio classification model to obtain second sub intermediate data corresponding to a plurality of audio frames;

the determining module 520 is configured to: combining the plurality of second sub-intermediate data according to the time sequence of the corresponding audio frames, and inputting the second sub-intermediate data into a long-short-time memory layer of the audio classification model to obtain third intermediate data corresponding to the audio data;

The determining module 520 is configured to: splitting the third intermediate data into third sub-intermediate data corresponding to a plurality of audio frames; and respectively inputting third sub intermediate data corresponding to a plurality of audio frames into an output layer of the audio classification model, and determining the audio type corresponding to each audio frame.

Optionally, the audio types include a first audio type and a second audio type, and the determining module 520 is configured to:

It should be noted that: in the audio classification device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the embodiments of the method for classifying audio provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not described herein again.

Fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 601 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 601 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the method of audio classification provided by the method embodiments in the present application.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a touch display 605, a camera assembly 606, audio circuitry 607, a positioning assembly 608, and a power supply 609.

Peripheral interface 603 may be used to connect at least one Input/Output (I/O) related peripheral to processor 601 and memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 604 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 604 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited in this application.

The display screen 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, providing a front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display, disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.

The location component 608 is used to locate the current geographic location of the terminal 600 to enable navigation or LBS (Location Based Service, location based services). The positioning component 608 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

A power supply 609 is used to power the various components in the terminal 600. The power source 609 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 further includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyroscope sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 601 may control the touch display screen 605 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 611. The acceleration sensor 611 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 may collect a 3D motion of the user on the terminal 600 in cooperation with the acceleration sensor 611. The processor 601 may implement the following functions based on the data collected by the gyro sensor 612: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 613 may be disposed at a side frame of the terminal 600 and/or at a lower layer of the touch screen 605. When the pressure sensor 613 is disposed at a side frame of the terminal 600, a grip signal of the terminal 600 by a user may be detected, and a left-right hand recognition or a shortcut operation may be performed by the processor 601 according to the grip signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the touch screen 605, the processor 601 controls the operability control on the U I interface according to the pressure operation of the user on the touch screen 605. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 614 is used for collecting the fingerprint of the user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 614 may be provided on the front, back, or side of the terminal 600. When a physical key or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical key or vendor Logo.

The optical sensor 615 is used to collect ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the intensity of ambient light collected by optical sensor 615. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 605 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 based on the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also referred to as a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600. In one embodiment, when the proximity sensor 616 detects a gradual decrease in the distance between the user and the front face of the terminal 600, the processor 601 controls the touch display 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 5 is not limiting of the terminal 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 7 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server may be a background server of the chat application, and the server 700 may be relatively different due to configuration or performance, and may include one or more processors (central processing units, CPU) 701 and one or more memories 702, where at least one instruction is stored in the memory 702, and the at least one instruction is loaded and executed by the processor 701 to implement the methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, e.g., a memory comprising instructions executable by a processor in a terminal to perform the method of audio classification in the above embodiment is also provided. For example, the computer readable storage medium may be Read-only Memory (ROM), random-access Memory (Random Access Memory, RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of audio classification, the method comprising:

each time audio data of a unit duration is obtained, determining an audio type corresponding to the audio data based on an audio classification model, wherein the audio type comprises a voice type and a non-voice type, and the audio classification model comprises a full-connection layer and a long-short-time memory layer;

2. The method of claim 1, wherein the determining the audio type corresponding to the audio data based on the audio classification model comprises:

3. The method of claim 2, wherein the determining the second intermediate data corresponding to the audio data based on the first intermediate data and the fully connected layer of the audio classification model comprises:

4. The method of claim 3, wherein the audio data comprises sub-audio data corresponding to a plurality of audio frames, the first intermediate data comprises first sub-intermediate data corresponding to a plurality of audio frames, the second intermediate data comprises second sub-intermediate data corresponding to a plurality of audio frames, and the fourth intermediate data comprises fourth sub-intermediate data corresponding to a plurality of audio frames.

5. The method of claim 4, wherein the determining that the audio data corresponds to first intermediate data based on the audio data and an input layer of an audio classification model comprises: respectively inputting each piece of sub-audio data into an input layer of an audio classification model to obtain first sub-intermediate data corresponding to a plurality of audio frames;

6. The method of claim 5, wherein the audio types include a first audio type and a second audio type, wherein the third sub-intermediate data corresponding to the plurality of audio frames is respectively input to the output layer of the audio classification model, and wherein determining the audio type corresponding to each audio frame comprises:

7. An apparatus for audio classification, the apparatus comprising:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining an audio type corresponding to audio data based on an audio classification model every time the audio data of a unit duration are acquired, the audio type comprises a human voice type and a non-human voice type, and the audio classification model comprises a full-connection layer and a long-short-time memory layer;

8. The apparatus of claim 7, wherein the means for determining is configured to:

9. The apparatus of claim 8, wherein the means for determining is configured to:

10. The apparatus of claim 9, wherein the audio data comprises sub-audio data corresponding to a plurality of audio frames, the first intermediate data comprises first sub-intermediate data corresponding to a plurality of audio frames, the second intermediate data comprises second sub-intermediate data corresponding to a plurality of audio frames, and the fourth intermediate data comprises fourth sub-intermediate data corresponding to a plurality of audio frames.

11. The apparatus of claim 10, wherein the determining module is configured to: respectively inputting each piece of sub-audio data into an input layer of an audio classification model to obtain first sub-intermediate data corresponding to a plurality of audio frames;

12. The apparatus of claim 11, wherein the audio types comprise a first audio type and a second audio type, the determining module to:

13. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform the operations performed by the method of audio classification of any of claims 1 to 6.

14. A computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to perform the operations performed by the method of audio classification of any of claims 1 to 6.