CN112102846A

CN112102846A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN112102846A
Application number: CN202010931958.8A
Authority: CN
Inventors: 赵苑珺; 夏咸军
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-18
Anticipated expiration: 2040-09-04
Also published as: CN112102846B

Abstract

The embodiment of the application provides an audio processing method, an audio processing device, an electronic device and a storage medium, which are applied to the field of artificial intelligence, and the method specifically comprises the following steps: acquiring audio to be processed, and extracting static audio features of the audio to be processed; carrying out differential processing on the static audio features to obtain dynamic audio features of the audio to be processed; and combining the static audio features and the dynamic audio features into target audio features of the audio to be processed, and identifying the target audio features to obtain the audio type of the audio to be processed. By the method and the device, accuracy and recognition efficiency of audio type recognition are improved.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

With the continuous development of internet technology, music detection technology plays an increasingly important role in the field of audio processing, and especially has higher requirements on music detection technology in remote music education, remote multi-person conferences and other scenes.

The existing music detection technology generally detects the audio to be processed in a manual detection mode, and the main realization mode is to judge the audio type of the audio to be processed after the whole audio to be processed is listened to manually. The manual detection mode is greatly influenced by subjectivity, misjudgment of the audio to be processed may be caused by subjective consciousness of people, so that the identification result of the audio type of the audio to be processed is not accurate enough, and the manual detection mode is low in efficiency.

Disclosure of Invention

The embodiment of the application provides an audio processing method and device, electronic equipment and a storage medium, and accuracy and efficiency of audio type identification are improved.

An aspect of an embodiment of the present application provides an audio processing method, including:

acquiring audio to be processed, and extracting static audio features of the audio to be processed;

carrying out differential processing on the static audio features to obtain dynamic audio features of the audio to be processed;

and combining the static audio features and the dynamic audio features into target audio features of the audio to be processed, and identifying the target audio features to obtain the audio type of the audio to be processed.

An aspect of an embodiment of the present application provides an audio processing apparatus, including:

the acquisition unit is used for acquiring audio to be processed;

the extraction unit is used for extracting the static audio features of the audio to be processed;

the processing unit is used for carrying out differential processing on the static audio features to obtain dynamic audio features of the audio to be processed;

a combination unit, configured to combine the static audio feature and the dynamic audio feature into a target audio feature of the audio to be processed;

and the identification unit is used for identifying the target audio characteristics to obtain the audio type of the audio to be processed.

In an aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the method in the foregoing embodiments.

An aspect of the embodiments of the present application provides a computer storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method in the foregoing embodiments is performed.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a computer device, the computer instructions perform the methods in the embodiments described above.

According to the audio processing method, the audio to be processed is subjected to feature extraction to obtain the static audio features of the audio to be processed, the static audio features of the audio to be processed are subjected to differential processing to obtain the dynamic features of the audio to be processed, the static audio features and the dynamic audio features are identified, and the audio type of the audio to be processed is determined. Compared with the method for manually identifying the audio type of the audio to be processed, the method is automatically completed by the electronic equipment, so that the identification efficiency is higher than that of manual identification, the interference of manual subjective consciousness can be avoided, and the accuracy of identifying the audio type of the audio to be processed is improved. Furthermore, because the static audio features of the audio to be processed and the dynamic audio features of the audio to be processed are extracted, the static audio features and the dynamic audio features are used as identification bases for identifying the audio type of the audio to be processed, the extracted features are richer, and the accuracy for identifying the audio type of the audio to be processed can be further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic architecture diagram of an audio processing system according to an embodiment of the present application;

2a-2d are schematic diagrams of a scene of audio processing provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of extracting static audio features of audio to be processed according to an embodiment of the present application;

fig. 5a is a schematic flowchart of a first-order difference processing provided in an embodiment of the present application;

fig. 5b is a schematic flowchart of a second order difference processing according to an embodiment of the present disclosure;

fig. 5c is a schematic flowchart of audio processing provided by an embodiment of the present application;

fig. 6 is a schematic flowchart of determining an audio type of audio to be processed according to an embodiment of the present application;

fig. 7 is a schematic flowchart of obtaining a sample audio set according to an embodiment of the present application;

fig. 8a is a schematic flowchart of sample audio set collection provided by an embodiment of the present application;

FIG. 8b is a schematic diagram of a sample audio set provided by an embodiment of the present application;

fig. 9 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the descriptions of "first", "second", etc. referred to in the embodiments of the present application are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a technical feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

Cloud technology (Cloud technology) is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on Cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Background services of the technical network systems currently require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

At present, cloud technologies are mainly classified into a cloud-based technology class and a cloud application class; the cloud-based technology class may be further subdivided into: cloud computing, cloud storage, databases, big data, and the like; the cloud application class may be further subdivided into: medical cloud, cloud-things, cloud security, cloud calls, private cloud, public cloud, hybrid cloud, cloud gaming, cloud education, cloud conferencing, cloud social, and artificial intelligence cloud services, among others.

From the perspective of basic technology, the audio processing method relates to cloud computing under the cloud technology; from the application perspective, the audio processing method relates to cloud education and cloud conferences belonging to the cloud technology.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

In the application, identifying the target audio characteristics and obtaining the audio type of the audio to be processed involves large-scale calculation, and requires huge calculation power and storage space, so in the application, the electronic device can obtain sufficient calculation power and storage space through a cloud computing technology, further extract the static audio characteristics and the dynamic audio characteristics of the audio to be processed, and determine the audio type of the audio to be processed according to the static audio characteristics and the dynamic audio characteristics of the audio to be processed.

Cloud Computing Education (CCEDU) refers to an educational platform service based on Cloud Computing business model applications. On the cloud platform, all education institutions, training institutions, enrollment service institutions, propaganda institutions, industry associations, management institutions, industry media, legal structures and the like are integrated into a resource pool in a centralized cloud mode, all resources are mutually displayed and interacted and communicated according to needs to achieve intentions, so that education cost is reduced, and efficiency is improved.

The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate. At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences. In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface. The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.

The audio processing method can be packaged into a cloud conference service or a cloud education service, and only one interface is exposed to the outside. When the audio type function for identifying the audio frequency is required to be used in service scenes such as cloud conferences or cloud education, the audio type of the audio frequency can be identified by calling the interface.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application belongs to the voice processing technology belonging to the field of artificial intelligence.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

In the application, the audio type of the audio is mainly identified through an artificial intelligence model, and the identified audio type can be used for judging whether the noise reduction of the audio data is needed or not. Cloud conference, cloud education and other fields.

The application can be applied to the following scenes: when the audio type of a certain section of audio needs to be identified, the audio to be processed is obtained, the characteristic of the audio to be processed is extracted to obtain the static audio characteristic of the audio to be processed, the dynamic audio characteristic of the audio to be processed is determined according to the static audio characteristic of the audio to be processed, and the audio type of the audio to be processed is obtained according to the static audio characteristic and the dynamic audio characteristic. Subsequently, the audio to be processed may be processed based on the audio type of the audio to be processed, for example, noise reduction is performed on the audio to be processed belonging to a certain audio type, accurate audio recommendation is performed according to the identified audio type, and the like.

Specifically, if the audio type of the audio to be processed is a music type, outputting the audio to be processed; and if the audio type of the audio to be processed is a non-music type, performing noise reduction processing on the audio to be processed, and outputting the audio to be processed after the noise reduction processing.

Fig. 1 is a system architecture diagram of audio processing according to an embodiment of the present application. The system architecture diagram of the audio processing comprises: server 140 and electronic device cluster, wherein, electronic device cluster may include: electronic device 110, electronic device 120, electronic device 130, and the like. The cluster of electronic devices and the server 10d may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The server 140 shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

The electronic device 110, the electronic device 120, the electronic device 130, and the like shown in fig. 1 may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a vehicle, a roadside device, an aircraft, a wearable device, such as a smart watch, a smart bracelet, a pedometer, and the like, and may be an intelligent device having an audio processing function.

Taking the electronic device 110 as an example, the electronic device 110 obtains the audio to be processed, and the electronic device 110 sends the audio to be processed to the server 140. The server 140 performs feature extraction on the audio to be processed to obtain a static audio feature of the audio to be processed; the static audio features extracted by the server 140 may specifically be mel-frequency spectrum features and constant Q transformation features; the server 140 performs differential processing on the static audio features to obtain dynamic audio features of the audio to be processed, where the differential processing may specifically be first-order differential processing and second-order differential processing; the server 140 invokes the audio discrimination model to determine the audio type corresponding to the static audio features of the audio to be processed and the dynamic audio features of the audio to be processed.

The server 140 may transmit the audio type of the resulting audio to be processed to the electronic device 110. The electronic device 110 stores the received audio type of the audio to be processed in association with the audio to be processed, and when the electronic device 110 receives an audio type acquisition request of a target user for the audio to be processed, the electronic device 110 outputs the audio type of the audio to be processed; or, the electronic device 110 further performs post-processing on the audio to be processed according to the audio type of the audio to be processed to output a post-processing result, where a manner of the electronic device 110 further performing post-processing on the audio to be processed according to the audio type of the audio to be processed may include: filtering or noise reduction processing, etc.

Extracting static features of the audio to be processed, determining dynamic features from the static features, and indeed the audio type of the audio to be processed from the dynamic features and the static features may also be performed by the electronic device 110 or any electronic device in the cluster of electronic devices.

It is to be understood that the system architecture diagram described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

Please refer to fig. 2a-2d, which are schematic views of an audio processing scenario according to an embodiment of the present disclosure. The scene schematic diagram may specifically be a cloud conference scene or a cloud education scene. As shown in fig. 2a, the present scenario includes an electronic device 210, an electronic device 220, an electronic device 230, an electronic device 240, and the like. The electronic device 210, the electronic device 220, the electronic device 230, and the electronic device 240 are communicatively connected to each other, and may communicate through a wireless network or a wired network. It should be noted that the types of the electronic devices in the scenario diagram may be the same or different, and the number of the electronic devices in the embodiment of the present application is only used as an example, and does not constitute a limitation to an application scenario to which the present application is specifically applied.

When the current scene is a cloud conference scene, the electronic device 210 may be an electronic device where a conference speaker is located, and the electronic device 220, the electronic device 230, and the electronic device 240 may be electronic devices where participants are located, where the electronic device 220, the electronic device 230, and the electronic device 240 may be located at the same physical location as the electronic device 210, or may be located at different physical locations.

Taking the electronic device 220 as an example, the electronic device 220 and the electronic device 210 are in different physical locations, which is illustrated in a cloud conference scenario. The microphone of the electronic device 210 captures audio of a conference speaker (pending audio) and transmits the captured pending audio to the electronic device 220. As shown in fig. 2b, the electronic device 220 performs feature extraction on the received audio to be processed, where the feature extraction may specifically be mel-frequency spectrum feature extraction and constant Q transformation feature extraction, where the mel-frequency spectrum feature extraction is performed on the audio to be processed to obtain mel-frequency spectrum static features, the constant Q transformation feature extraction is performed on the audio to be processed to obtain constant Q transformation static features, and the mel-frequency spectrum static features and the constant Q transformation static features are combined to obtain static feature information of the audio to be processed; the electronic device 220 performs a difference process, such as a first-order difference process, on the static feature information of the audio to be processed to obtain a first-order dynamic feature, performs a second-order difference process on the static feature information of the audio to be processed to obtain a second-order dynamic feature, and combines the first-order dynamic feature and the second-order dynamic feature to obtain the dynamic feature information of the audio to be processed. As shown in fig. 2c, the electronic device 220 invokes the audio discrimination model to identify the audio to be processed according to the dynamic feature information and the static feature information, so as to obtain the audio type of the audio to be processed.

In fig. 2d, the electronic device 220 calls the audio frequency discrimination model to identify the audio frequency to be processed according to the dynamic characteristic information and the static characteristic information, and after the audio frequency type of the audio frequency to be processed is obtained, the electronic device can perform post-processing according to the audio frequency type of the audio frequency to be processed. For example, in a terminal interface displayed by the electronic device 220, "open" and "close" options of a music mode are displayed, the user may perform "open" or "close" operation on the music mode, if the user clicks an "open" button, that is, if the user opens the music mode and the audio type of the audio to be processed is identified as the music type, the electronic device 220 may directly play the audio to be processed without performing noise reduction processing on the audio to be processed, specifically, in a process of playing the audio to be processed by the electronic device 220, text information corresponding to the audio may be displayed on the terminal interface of the electronic device 220, as shown in the figure, for example, "earning while the audio is being processed all over a life, and the confidence may change the future."; if the user clicks the "open" button, that is, if the user opens the music mode and the audio type of the to-be-processed audio is identified as a non-music type, the electronic device 220 performs filtering processing on the to-be-processed audio, that is, the user cannot hear the to-be-processed audio played by the electronic device 210. Through the application, the electronic equipment can identify the audio type of the audio to be processed according to the audio processing mode provided by the application in the scene of real-time audio playing, and the user can select to output or reduce the noise of the audio to be processed according to the audio type of the audio to be processed through the requirement of the user, so that the accuracy of identifying the audio type of the audio to be processed is improved compared with the existing audio processing technology, and the experience of the user is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart of audio processing according to an embodiment of the present disclosure. The method is applied to an electronic device, and as shown in fig. 3, the data processing method may include steps S310 to S330. Wherein:

step S310: and acquiring the audio to be processed, and extracting the static audio features of the audio to be processed.

The method for acquiring the audio to be processed by the electronic equipment can be that the electronic equipment acquires the audio to be processed from an audio database, and a large amount of audio is stored in the audio database; the mode of acquiring the audio to be processed by the electronic device may also be that the electronic device acquires the audio output by another electronic device connected to the electronic device, that is, the audio output by the another electronic device is the audio to be processed, where the electronic device and the another connected electronic device are connected to each other by communication, and both sides may perform network communication. The audio output by the other electronic device may be of the pure music type, the music + noise type, and the pure noise type. It should be noted that the representation form of the audio to be processed may be a section of audio with a specified duration, or may be the minimum unit frame audio included in a section of audio. The present solution takes the audio to be processed as a frame of audio as an example, and the detailed description is given.

It should be noted that the audio to be processed referred to in this application may be a small segment of a piece of audio. For this segment of audio, the segment of audio may be divided into several pieces of audio to be processed, say m pieces. The processing mode of each audio to be processed is the same as that of the electronic equipment in the scheme, so that m audio types of the audio to be processed can be obtained, and the audio type of the audio can be determined according to the m audio types. The manner of determining the audio type of the segment of audio may be: and determining the corresponding audio type of most of the m audio types belonging to a certain audio type as the audio type of the piece of audio.

In a possible implementation manner, the electronic device performs mel-frequency spectrum feature extraction on the audio to be processed to obtain mel-frequency spectrum static features, performs constant Q transformation feature extraction on the audio to be processed to obtain constant Q transformation spectrum static features, and then combines the mel-frequency spectrum static features and the constant Q transformation spectrum static features into the static audio features of the audio to be processed.

Referring to fig. 4, fig. 4 is a schematic flowchart of extracting static audio features of audio to be processed according to an embodiment of the present application. The static audio features of the audio to be processed include a mel-frequency spectrum static feature, which is to add a mel filtering function to a general spectrogram, and the step is to simulate the sensitivity of human auditory sense to actual frequencies. The specific process of extracting the static features of the Mel frequency spectrum is as follows: the audio to be processed is a time domain signal (the time domain signal refers to that the argument of the audio to be processed is time, and the dependent variable is amplitude or power), and specifically, steps S410 to S450 may be included.

Step S410: framing and windowing.

During specific implementation, the electronic device performs time-sharing frame and sliding window adding operation on the audio to be processed to obtain a plurality of unit audios. The electronic device processes each unit audio in the same manner.

Step S420: and (4) performing fast Fourier transform.

In specific implementation, the electronic device converts each unit audio obtained by division into a unit frequency domain signal. The electronic device may convert the unit audio into the unit frequency domain signal by performing Fast Fourier Transform (FFT), Short-Time Fourier Transform (STFT), Discrete Fourier Transform (DFT), or the like on the unit audio to obtain the spectral energy distribution of each frequency band point corresponding to the unit audio, that is, converting the unit audio from the Time domain signal into the frequency domain signal.

Step S430: taking the square.

In a specific implementation, the electronic device may perform squaring on each unit frequency domain signal to obtain a squared unit frequency domain signal.

Step S440: a mel filter bank.

In specific implementation, the electronic device passes each squared unit frequency domain signal through a mel filter to perform filtering processing on each unit frequency domain signal, so as to obtain a unit mel frequency spectrum static feature (the unit mel frequency spectrum static feature is an N-dimensional vector) of each unit frequency domain signal. The unit Mel frequency spectrum static characteristics of the unit audios are determined by the method, the unit Mel frequency spectrum static characteristics are combined into Mel frequency spectrum static characteristics of the audio to be processed, the Mel frequency spectrum static characteristics are an N multiplied by K characteristic matrix, K represents K unit Mel frequency spectrum static characteristics, and N represents the characteristic dimension of each unit Mel frequency spectrum static characteristic.

Step S450: and taking a logarithm.

During specific implementation, the obtained unit mel frequency spectrum characteristics can be subjected to logarithm processing, so that unit logarithm mel frequency spectrum static characteristics of each unit audio frequency can be obtained, and then the unit logarithm mel frequency spectrum static characteristics are combined to obtain the logarithm mel frequency spectrum static characteristics of the audio frequency to be processed. The electronic equipment can take the logarithmic Mel frequency spectrum static feature as the static feature of the audio to be processed to judge the type of the audio.

Of course, if step S450 is not executed, the K unit mel-frequency spectrum static features obtained in step S440 may be directly combined into the mel-frequency spectrum static features, and then the audio type may be determined based on the mel-frequency spectrum static features.

The conversion relationship between the mel frequency and the normal scale frequency can be expressed by the following formula (1):

in the formula (1), f represents the normal scale frequency, i.e., the frequency of the time domain signal, f_MelIndicating the mel frequency.

The method comprises the steps of converting audio to be processed of a time domain signal into a frequency domain signal by means of Mel frequency spectrum feature extraction of the audio to be processed, obtaining energy corresponding to unit audio after each unit frequency domain signal passes through a Mel filter, and adding the energy corresponding to each unit audio to obtain Mel frequency spectrum features of the audio to be processed. If the energy of the Mel frequency region is added, only how much energy is in each frequency region is concerned, so that the human ear can be distinguished, and the Mel frequency spectrum feature obtained by the method can have more identification. Finally, the Mel frequency spectrum characteristics are subjected to logarithmic processing, the auditory system is also from human ears, the perception of human to the sound intensity is not linear, generally speaking, the volume of sound is doubled, 8 times of energy needs to be input, in order to compress the energy, the Mel frequency spectrum characteristics are further subjected to logarithmic processing, the logarithmic Mel frequency spectrum characteristics of the audio to be processed are obtained, the logarithmic Mel frequency spectrum characteristics can well depict the audio frequency characteristics of the audio to be processed, and the logarithmic Mel frequency spectrum characteristics can be used for a subsequent training audio frequency discrimination model and serve as important basis and characteristic information of the audio frequency discrimination model for recognizing the audio to be processed.

The static audio features of the audio to be processed comprise constant Q transform spectrum static features, and the audio to be processed is a time domain signal. Firstly, the electronic device performs time-sharing frame and sliding window adding operations on the audio to be processed to obtain a plurality of unit audios, and it should be noted that the electronic device processes each unit audio in the same manner. Then, the electronic device obtains a quality factor Q, and determines a window length of each unit audio from each unit audio, the window length varying with a change in frequency in a constant Q transform. Then, the electronic device performs time-frequency conversion processing on each unit audio according to the quality factor Q and the window length of each unit audio, wherein the time-frequency conversion processing is that constant Q conversion processing is performed on each unit audio to obtain a unit constant Q conversion spectrum static characteristic of each unit audio, and finally, the electronic device combines the unit constant Q conversion spectrum static characteristics of a plurality of unit audios into the constant Q conversion spectrum static characteristic.

In one possible implementation, the electronics performs constant Q-transform feature extraction on the audio to be processed in a similar manner to computing a short-time fourier transform (STFT), with the difference that the window length used in the constant Q-transform varies with frequency and the frequency axis is in logarithmic scale.

As is known, the short-time fourier transform of a signal is calculated as in equation (2):

where w (n) is a window function, x (n) is the signal amplitude of the input signal at sample n, and k is the sample point.

Wherein the bandwidth of the kth filter is f_kThe quality factor Q is defined by the formula (3):

the value of Q is set to be constant and does not change as the center frequency of the filter changes. The window length N is expressed as formula (4):

wherein f is_sTo sample frequency, f_kIs the center frequency. As can be seen from equation (4), the center frequency increasesHigh, the shorter the window length. Then, in order to make each octave on the frequency axis represented by 12n grid, the linear frequency should be changed to a non-linear frequency based on log2, i.e., to be used as a frequency generator

To replace

The computational expression of the constant Q transformation can be finally obtained as formula (5):

where N [ k ] represents the window length for a constant Q transform, and N [ k ] is frequency dependent; w [ k, n ] represents a window function; x (n) represents the output signal to be processed; k is the frequency index of the constant Q transform spectrum, and the value of Nk is related to the value of k.

The method is characterized in that the static characteristics of the audio to be processed are obtained by performing constant Q transformation on the audio to be processed, and the constant Q transformation can change the length of a filter window function according to different spectral line frequencies, so that in a constant Q transformation spectrogram, a higher frequency resolution can be achieved in a low frequency region, a higher time resolution can be achieved in a high frequency region, and a frequency axis of logarithmic scale is more in line with an auditory system of a human ear than a linear scale. The constant Q transformation is the same as the distribution of the scale frequency, so that the audio characteristics obtained after the constant Q transformation is carried out on the audio to be processed can better show the characteristics of the audio of the music type, and then the static audio characteristics of the audio to be processed extracted by the constant Q transformation are used as the characteristics of the training data of the audio discrimination model, so that the audio discrimination model can identify the audio to be processed more accurately.

Step S320: and carrying out differential processing on the static audio features to obtain the dynamic audio features of the audio to be processed.

The static audio features include mel-frequency spectrum static features. The electronic equipment carries out first-order difference processing on the static features of the Mel frequency spectrum to obtain first-order dynamic features of the Mel frequency spectrum of the audio to be processed. For example, as shown in fig. 5a, fig. 5a is a schematic flow chart of a first-order difference processing provided in the embodiment of the present application, and as shown in fig. 5a, the mel-frequency spectrum static feature is an N × K feature matrix, where N represents a feature dimension of each unit mel-frequency spectrum static feature, generally, the feature dimension is a fixed value, and different feature dimensions may be specifically set according to different mel filters. K represents K unit Mel spectrum static characteristics, and the Mel spectrum characteristic matrix is a 4 x4 two-dimensional matrix, namely x1 represents the 1 st unit Mel spectrum characteristic, x2 represents the 2 nd unit Mel spectrum characteristic, x3 represents the 3 rd unit Mel spectrum characteristic, and x4 represents the 4 th unit Mel spectrum characteristic. x1, x2, x3, and x4 refer to unit mel spectral features of each unit of audio obtained along a time axis by the audio to be processed.

For example, in the mel-frequency spectrum feature matrix, x1 is (m11, m21, m31, m41), x2 is (m12, m22, m32, m42), x3 is (m13, m23, m33, m43) and x4 is (m14, m24, m34, m 44). The first-order difference processing on the mel-frequency spectrum feature matrix may specifically be performed by keeping the elements in the first sequence of the mel-frequency spectrum feature matrix unchanged, that is, the first sequence in the first-order mel-frequency spectrum feature matrix is the same as the first sequence in the mel-frequency spectrum feature matrix, so that the first sequence x 1' in the first-order mel-frequency spectrum feature matrix is (m11, m21, m31, m 41); correspondingly subtracting each element in the second sequence of the Mel frequency spectrum characteristic matrix from each element in the first sequence to obtain a second sequence in the first-order Mel frequency spectrum characteristic matrix, wherein the second sequence x 2' is (m12-m11, m22-m21, m23-m31, m42-m 41); similarly, in this way, a third sequence in the first-order mel-frequency spectrum feature matrix can be obtained, wherein the third sequence x3 'is (m13-m13, m23-m23, m33-m32, m43-m42) and a fourth sequence in the first-order mel-frequency spectrum feature matrix, and the fourth sequence x 4' is (m14-m13, m24-m23, m34-m33, m44-m 43). Combining the sequences x1 ', x 2', x3 'and x 4' to obtain a first-order Mel spectrum feature matrix of 4 × 4 dimensions, namely representing the dynamic features of the first-order Mel spectrum.

For example, inThe Mel frequency spectrum is statically characterized by

Then according to the above-mentioned mode, the first-order Mel frequency spectrum dynamic characteristics of the audio to be processed can be obtained as

In a possible implementation manner, the electronic device performs the first-order difference processing on the first-order mel-frequency spectrum dynamic characteristic again to obtain the second-order mel-frequency spectrum dynamic characteristic of the audio to be processed. Referring to fig. 5b, fig. 5b is a schematic flow diagram of a second order difference processing provided in the embodiment of the present application, and on the basis of the first order processing, the first order difference processing is performed on the first order mel spectrum dynamic feature of the audio to be processed, so as to obtain the second order mel spectrum dynamic feature of the audio to be processed. It should be noted that a processing mode of the electronic device obtaining the second-order mel-frequency spectrum dynamic feature matrix through the first-order mel-frequency spectrum dynamic feature matrix is the same as a processing mode of the electronic device obtaining the first-order mel-frequency spectrum dynamic feature matrix through the mel-frequency spectrum static feature matrix, and a specific calculation process of each sequence x1 ", x 2", x3 "and x 4" in the second-order mel-frequency spectrum dynamic feature matrix may refer to a calculation process of each sequence x1 ', x 2', x3 'and x 4' in the previous first-order mel-frequency spectrum dynamic feature matrix, which is not described herein again.

For example, the Mel-spectrum is statically characterized by

Then according to the above-mentioned mode, the second-order Mel frequency spectrum dynamic characteristics of the audio frequency to be processed can be obtained as

Therefore, the electronic equipment combines the first-order Mel spectrum dynamic characteristics with the second-order Mel spectrum dynamic characteristics to obtain the Mel spectrum dynamic characteristics of the audio to be processed.

Of course, the mel-frequency spectrum static feature of the audio to be processed may also be a logarithmic mel-frequency spectrum static feature, and the processing mode of the electronic device on the logarithmic mel-frequency spectrum static feature is the same as the processing mode of the electronic device on the mel-frequency spectrum static feature.

In one possible implementation, the static audio features include constant Q-transform spectral static features. The electronic equipment carries out first-order difference processing on the constant Q transformation static characteristic to obtain a first-order constant Q transformation dynamic characteristic of the audio to be processed; and performing second-order differential processing on the static characteristic of the degree constant Q transformation of the electronic equipment to obtain the dynamic characteristic of the second-order constant Q transformation of the audio to be processed. The electronic equipment combines the first-order constant Q transformation dynamic characteristic and the second-order constant Q transformation dynamic characteristic into the constant Q transformation dynamic characteristic of the audio to be processed.

It should be noted that, the processing mode of the electronic device obtaining the first-order constant Q transform spectrum dynamic characteristic according to the constant Q transform spectrum static characteristic and the processing mode of the electronic device obtaining the second-order constant Q transform spectrum dynamic characteristic according to the constant Q transform spectrum static characteristic may be referred to specifically as the processing mode of the electronic device obtaining the first-order mel spectrum dynamic characteristic according to the mel spectrum static characteristic and the processing mode of the electronic device obtaining the second-order mel spectrum dynamic characteristic according to the mel spectrum static characteristic, which are not described herein again.

In one possible implementation, the electronic device combines the mel-frequency spectrum dynamic feature of the audio to be processed and the constant Q-transform dynamic feature of the audio to be processed into the dynamic audio feature of the audio to be processed.

It should be noted that, in the embodiment of the present application, the dynamic audio feature of the audio to be processed is obtained by performing first-order differential processing and second-order differential processing on the static audio feature of the audio to be processed, and in some service scenarios, according to different scenarios and requirements, third-order differential processing or even higher-order differential processing may be performed on the static audio feature of the audio to be processed to obtain a third-order dynamic audio feature and a higher-order dynamic audio feature of the corresponding audio to be processed; then, the electronic device combines the first order dynamic audio features, the second order dynamic audio features, the third order dynamic audio features and the like obtained by the differential processing, and finally obtains the dynamic audio features of the audio to be processed.

Through the method and the device, the static audio features of the audio to be processed are subjected to differential processing to obtain the dynamic audio features of the audio to be processed, the feature information of the audio to be processed can be embodied in a dynamic dimension, the characterization of the features of the audio to be processed can be further better embodied, and the extracted features of the audio to be processed are more accurate.

Step S330: and combining the static audio features and the dynamic audio features into target audio features of the audio to be processed, and identifying the target audio features to obtain the audio type of the audio to be processed.

In one possible implementation, the electronic device combines the mel-frequency spectrum static characteristics, the first-order mel-frequency spectrum dynamic characteristics and the second-order mel-frequency spectrum dynamic characteristics into a mel-frequency spectrum characteristic sequence; the electronic equipment combines the static characteristic of the constant Q-conversion frequency spectrum, the dynamic characteristic of the first-order constant Q-conversion frequency spectrum and the dynamic characteristic of the second-order constant Q-conversion frequency spectrum into a constant Q-conversion frequency spectrum characteristic sequence. And combining the Mel frequency spectrum characteristic sequence and the constant Q transformation frequency spectrum characteristic sequence into a target audio characteristic. The combined target audio features are 2 three-dimensional matrixes, wherein the Mel frequency spectrum feature sequence is a three-dimensional matrix, the constant Q transformation frequency spectrum feature sequence is a three-dimensional matrix, and specifically, if the Mel frequency spectrum static features are an N × K two-dimensional matrix, the first-order Mel frequency spectrum dynamic features and the second-order Mel frequency spectrum dynamic features are both N × K two-dimensional matrixes, and the 3 two-dimensional matrixes are combined into an N × K × 3 three-dimensional matrix, namely the Mel frequency spectrum feature sequence in the target audio features has the size of N × K × 3; if the constant Q transform spectrum static feature is a two-dimensional matrix of NxM, the first order constant Q transform dynamic feature and the second order constant Q transform dynamic feature are both two-dimensional matrices of NxM, and the 3 two-dimensional matrices are combined into a three-dimensional matrix of NxM x3, namely the size of the constant Q transform spectrum feature sequence in the target audio feature is NxM x 3.

In one possible implementation manner, the electronic device calls a first audio frequency discrimination model, and determines a first matching probability set between a Mel frequency spectrum characteristic sequence and multiple audio frequency types in the first audio frequency discrimination model; and the electronic equipment calls the second audio frequency discrimination model and determines a second matching probability set between the constant Q transformation frequency spectrum characteristic sequence and multiple audio frequency types in the second audio frequency discrimination model. And finally, the electronic equipment determines the audio type of the audio to be processed according to the first matching probability set and the second matching probability set. It should be noted that, the multiple audio types in the first audio discrimination model and the multiple audio types in the second audio discrimination model are the same in alignment, it is assumed that the audio types in the first audio discrimination model include type 1, type 2, and type 3, the audio types in the second audio discrimination model include type 1, type 2, and type 3, the same in alignment means that the audio type corresponding to type 1 in the first audio discrimination model is the same as the audio type corresponding to type 1 in the second audio discrimination model, the audio type corresponding to type 2 in the first audio discrimination model is the same as the audio type corresponding to type 2 in the second audio discrimination model, and the audio type corresponding to type 3 in the first audio discrimination model is the same as the audio type corresponding to type 3 in the second audio discrimination model.

Referring to fig. 5c, fig. 5c is a schematic flowchart of an audio processing method according to an embodiment of the present disclosure. An audio signal collector of the electronic equipment can collect audio to be processed; then, the electronic device extracts the audio features of the audio to be processed, and specifically, the manner of extracting the audio features of the audio to be processed may be: extracting Mel frequency spectrum characteristics and constant Q transformation characteristics; further, the static audio features of the audio to be processed are subjected to differential processing, so that the dynamic audio features of the audio to be processed can be obtained; and finally, inputting the static audio features and the dynamic audio features of the audio to be processed into a GRU (audio discrimination model), performing audio recognition on the audio to be processed, and determining whether the audio type of the audio to be processed is a music type or a non-music type.

According to the audio processing method, Mel frequency spectrum characteristic extraction is carried out on the audio to be processed to obtain Mel frequency spectrum static characteristics of the audio to be processed and constant Q transformation characteristic extraction is carried out on the audio to be processed to obtain constant Q transformation static characteristics of the audio to be processed; then, according to the static audio features of the audio to be processed, difference processing is respectively carried out on the Mel frequency spectrum static features and the constant Q transformation static features to obtain Mel frequency spectrum dynamic features and constant Q transformation dynamic features; because the characteristic information of the audio to be processed is extracted, the accuracy of the audio discrimination model for identifying the audio to be processed can be improved. Finally, inputting the static characteristics and the dynamic characteristics of the Mel frequency spectrum into a first audio frequency discrimination model for recognition to obtain a first matching probability set, inputting the static characteristics and the dynamic characteristics of the constant Q transformation into a second audio frequency discrimination model for recognition to obtain a second matching probability set, and determining the audio type of the audio to be processed according to the first matching probability set and the second matching probability set; because the two audio frequency discrimination models are used for identifying the target audio frequency characteristics of the audio frequency to be processed, compared with the single audio frequency discrimination model, the accuracy of identifying the audio frequency type of the audio frequency to be processed can be further improved.

Referring to fig. 6, fig. 6 is a schematic flowchart of a process for determining an audio type of an audio to be processed according to an embodiment of the present application, where the determining the audio type of the audio to be processed includes the following steps S610 to S630, and the steps S610 to S630 are specific embodiments of the step S330 in the embodiment corresponding to fig. 3:

step S610: and calling a first audio frequency discrimination model, and determining a first matching probability set between the Mel frequency spectrum characteristic sequence and a plurality of audio frequency types in the first audio frequency discrimination model.

In one possible implementation, the mel-frequency spectrum feature sequence comprises a first mel-frequency spectrum vector feature and a second mel-frequency spectrum vector feature, wherein the first mel-frequency spectrum vector feature is any column vector in the mel-frequency spectrum feature sequence. Determining a first hidden feature and a first output feature of the first mel-frequency spectrum vector feature based on the first audio discrimination model, the first mel-frequency spectrum vector feature and an initial hidden feature; determining second hidden features and second output features of the first mel-frequency spectrum vector features based on the first audio discrimination model, the second mel-frequency spectrum vector features and the first hidden features; and carrying out full-connection processing on the first output characteristic and the second output characteristic to obtain a first matching probability set among multiple audio types in the first audio discriminant model.

For example, the first audio frequency discrimination model may be a Recurrent Neural Network model, and may be, for example, an RNN (Recurrent Neural Network, RNN) model, an LSTM (Long Short Term Memory, LSTM, Long Short-Term Memory), a GRU (Gated Recurrent Neural Network) model. Because the calculation efficiency is considered, the model volume is reduced on the basis of ensuring the detection accuracy, and the first audio discrimination model adopts a GRU model.

For example, taking the first audio discrimination model as a GRU model for detailed description, first, the electronic device inputs the initial hidden feature h0 and the first mel-frequency spectrum vector feature x1 into the GRU model, encodes the first mel-frequency spectrum vector feature x1 through the GRU model, and outputs a first hidden feature h1 and a first output feature y1 of the first mel-frequency spectrum vector feature; then, the electronic device inputs the first hidden feature h1 and the second mel-frequency spectrum vector feature x2 into a GRU model, encodes the second mel-frequency spectrum vector feature x2 through the GRU model, and outputs a second hidden feature h2 and a second output feature y2 of the second mel-frequency spectrum vector feature; finally, the electronic device outputs the first output characteristic y1 and the second output characteristic y2 through a plurality of full connection layers and activation function processing, and a first matching probability set among the plurality of audio types in the first audio discriminant model is output. It should be noted that the first mel-frequency spectrum vector feature and the second mel-frequency spectrum vector feature mentioned in this application are merely examples of processing any two parameters input into the GRU model, and the mel-frequency spectrum feature sequence input into the GRU model can be input into the processing procedure of the GRU model by referring to the first mel-frequency spectrum vector feature and the second mel-frequency spectrum vector feature. It should be noted that the present application may use a cross entropy function as a loss function for model training. The calculation expression of the cross entropy function is as formula (6):

where p represents the true value (set of audio type tags) and q represents the predicted value (set of audio prediction types).

In a possible implementation manner, the GRU model related to the present application can prevent overfitting in the training process through a nonlinear activation function, so that the generalization capability is improved. Specifically, a corrected linear unit (relu) function can alleviate the saturation condition when the values of the first output characteristic y1 and the second output characteristic y2 are too large, meanwhile overfitting can be prevented, training is accelerated, and then the relu function is selected as an activation function of the scheme; in order to prevent the occurrence of an overfitting condition (namely the error of a test set is increased due to the fact that the test set is too close to the real distribution of a training set), a regularization norm punishment term is added, and the sigmoid function is selected as an output function by an output layer of the scheme in consideration of the characteristics of stability and easiness in saturation of the sigmoid function. The sigmoid function is also called a Logistic function, and has a value range of (0, 1), and can map a real number to an interval of (0, 1). It should be noted that the activation functions involved in the GRU model include, but are not limited to: sigmoid, tanh, relu, leak relu, elu, maxout, etc., which are not limited in this application.

Step S620: calling a second audio frequency discrimination model, and determining a second matching probability set between the constant Q transformation frequency spectrum characteristic sequence and a plurality of audio frequency types in the second audio frequency discrimination model; the multiple audio types in the first audio discrimination model are the same as the multiple audio types in the second audio discrimination model.

It should be noted that the second audio frequency discrimination model and the first audio frequency discrimination model are the same in type, or the second audio frequency discrimination model and the first audio frequency discrimination model are the same model. The specific processing flow of the electronic device calling the second audio frequency discrimination model to determine the second matching probability set between the constant Q transform spectrum feature sequence and the multiple audio frequency types in the second audio frequency discrimination model may refer to the processing flow of the electronic device calling the first audio frequency discrimination model in step S610 to determine the first matching probability set between the mel spectrum feature sequence and the multiple audio frequency types in the first audio frequency discrimination model, which is not described herein again.

In one possible implementation manner, the electronic device trains the first audio discriminant model and the second audio discriminant model as follows: and acquiring a sample audio set, and training a first audio discrimination model through the sample audio set. Specifically, how the electronic device trains the first audio discrimination model is described in detail below:

the electronic device obtains a sample audio set, the sample audio set being collected by audio receiving devices belonging to a plurality of device types, the sample audio set comprising a plurality of sample audios. The electronic equipment extracts sample static audio features of each sample audio; the electronic equipment performs differential processing on the static audio features of each sample audio to obtain the dynamic audio features of each sample audio; the electronic device combines the each static audio feature and the each dynamic audio feature into a sample audio feature of the each sample audio. It should be noted that, the electronic device processes each sample audio to obtain an audio processing mode of the sample audio characteristic, which may specifically refer to the audio processing mode of the electronic device processing the to-be-processed audio to obtain the target audio characteristic of the to-be-processed audio in the embodiment of fig. 3, and this embodiment of the present application is not described herein again.

Firstly, the electronic equipment calls a sample first audio distinguishing model to respectively identify and process each sample audio feature, and determines an audio prediction type set of the sample audio set. And secondly, the electronic equipment acquires an audio type label set of the sample audio set, and trains a sample first audio discrimination model according to the audio prediction type set and the audio type label set. It should be noted that the process of training the first audio frequency discrimination model by the electronic device is iterative training, and this application only exemplifies a certain training among multiple training. Then, when the sample first audio discriminant model satisfies the model convergence condition, the sample first audio discriminant model is used as a first audio discriminant model.

The model convergence condition may be: when the training times of the sample first audio distinguishing model reach a preset training threshold value, for example, 100 times, the sample first audio distinguishing model meets the model convergence condition, that is, the sample first audio distinguishing model after being trained for 100 times is taken as a first audio distinguishing model; when the error between the audio prediction type set and the audio type label set is smaller than an error threshold value, the sample first audio distinguishing model meets a model convergence condition; and when the change between the audio prediction type sets obtained by two adjacent times of training of the sample first audio discrimination model is smaller than a change threshold value, the sample first audio discrimination model meets the model convergence condition.

Step S630: and determining the audio type of the audio to be processed according to the first matching probability set and the second matching probability set.

In a possible implementation manner, the electronic device performs weighted average operation on the first matching probability set and the second matching probability set to obtain a target matching probability set; and the electronic equipment determines the maximum matching probability from the target matching probability set, and if the maximum matching probability is greater than a preset probability threshold, the audio type corresponding to the maximum matching probability is determined as the audio type of the audio to be processed.

For example, assume that the first matching probability set is (0.7, 0.3), where the audio type corresponding to 0.7 is a music type, the audio type corresponding to 0.3 is a non-music type, and the second matching probability set is (0.8, 0.2), and similarly, the audio type corresponding to 0.8 is a music type, and the audio type corresponding to 0.2 is a non-music type; the electronic device obtains a target matching probability set of (0.75, 0.25) according to the matching probability sets of (0.7, 0.3) and (0.8, 0.2), and if the preset probability threshold is 0.6, the maximum matching probability 0.75 in the target matching probability set is greater than the preset probability threshold of 0.6, and then the audio type corresponding to 0.75 is determined as the audio type of the audio to be processed, that is, the audio type of the audio to be processed is the music type.

In one possible implementation manner, the audio to be processed is a minimum unit frame audio in a segment of audio, the segment of audio further includes a plurality of minimum unit frame audios, and the electronic device acquires a preset number of minimum unit frame audios that are forward adjacent to the audio to be processed in the segment of audio, where forward adjacent means that timestamps of the acquired preset number of minimum unit frame audios are earlier than a timestamp of the audio to be processed. Assuming that the preset number is 20, the electronic device obtains the first 20 minimum unit frame audios adjacent to the audio to be processed, multiplies each element in the 20 target matching probability sets corresponding to the first 20 minimum unit frame audios by each element in the target matching probability set corresponding to the audio to be processed in a contraposition manner to obtain a multiplied target matching probability set, and determines the audio type corresponding to the maximum matching probability in the multiplied target matching probability sets as the audio type of the audio to be processed. It should be noted that, if the number of minimum unit frame audios before the audio to be processed is less than 20, and it is assumed that the number is 10, each element in 10 target matching probability sets corresponding to the 10 minimum unit frame audios before the audio to be processed is bit-multiplied with each element in a target matching probability set corresponding to the audio to be processed to obtain a multiplied target matching probability set, and the audio type corresponding to the maximum matching probability in the multiplied target matching probability set is determined as the audio type of the audio to be processed.

In a possible implementation manner, after the electronic device determines the audio type of the audio to be processed, the electronic device further performs post-processing on the audio to be processed according to the audio type of the audio to be processed. Specifically, if the audio type of the audio to be processed is a target audio type, and the target audio type is a music type, outputting the audio to be processed; and if the audio type of the audio to be processed is a non-target audio type, and the non-target audio type is a non-music type, performing noise reduction on the audio to be processed, and outputting the audio to be processed after the noise reduction. Specifically, in a scene where a teacher and a student are performing online music education, if the audio acquired by the student is of a music type, the audio can be output for the student to learn; if the audio acquired by the student terminal is of a non-music type, which means that the audio may be mixed with a lot of noises or all noises, and is audio irrelevant to education, the terminal equipment used by the student automatically filters the audio, so that the teaching quality of an education classroom is improved, and the learning efficiency of the student is improved.

According to the embodiment of the application, a first matching probability set is obtained by calling a first audio frequency discrimination model to train a Mel frequency spectrum characteristic sequence, a second matching probability set is obtained by calling a second audio frequency discrimination model to train a constant Q transformation frequency spectrum characteristic sequence, and then the audio type of the audio to be processed is determined according to the first matching probability set and the second matching probability set. The two audio distinguishing models are used for training different characteristic sequences of the audio to be processed respectively, so that the finally obtained audio type of the audio to be processed is more accurate, and the accuracy of audio identification is improved.

Referring to fig. 7, fig. 7 is a schematic flowchart of a process for acquiring a sample audio set according to an embodiment of the present application, where the step of acquiring the sample audio set includes the following steps S710 to S730, and the steps S710 to S730 are specific embodiments of the electronic device in step S610 in the embodiment corresponding to fig. 6:

step S710: the electronic equipment acquires a plurality of first sample audios from audio receiving equipment belonging to a plurality of equipment types; the plurality of first sample audios are acquired when the audio playing device plays audio to be acquired and each audio receiving device starts an audio processing function.

Referring to fig. 8a, fig. 8a is a schematic flowchart of sample audio set collection according to an embodiment of the present disclosure. The details will be described by taking an example of how the electronic device 220 obtains the sample audio set. The audio playing device plays audio to be collected through the loudspeaker, wherein the audio to be collected comprises HiFi high-quality audio and scene noise without music. The high-quality audio is an audio format of the audio in a lossless format, for example, the lossless format may be a format such as FLAC, APE, WAV, etc., the high-quality audio is used as a positive sample, and the scene noise without music is used as a negative sample.

For example, the audio receiving device may include a first device type, a second device type, a third device type, and a fourth device type in a plurality of device types. The audio receiving device of the first device type may specifically be an apple mac notebook, the audio receiving device of the second device type may specifically be an apple ipad tablet, the audio receiving device of the third device type may specifically be a terminal device of an android system, and the audio receiving device of the fourth device type may specifically be a PC notebook of a Windows system. Turning on the audio echo filtering function means that each audio receiving device in multiple device types turns on a 3A algorithm, wherein the 3A algorithm specifically includes: echo cancellation Algorithms (AEC), Automatic Gain Control (AGC), Active Noise Control (ANC).

In a possible implementation manner, the audio playing device plays audio to be acquired in a pure music type, and the plurality of first sample audio refers to audio to be acquired respectively acquired by audio receiving devices in multiple device types. Specifically, the plurality of first sample audios are composed of: the system comprises audio to be acquired when the pure music type played by the audio playing device is acquired by the apple mac notebook computer, audio to be acquired when the pure music type played by the audio playing device is acquired by the apple ipad tablet personal computer, audio to be acquired when the pure music type played by the audio playing device is acquired by the terminal device of the android system, and audio to be acquired when the pure music type played by the audio playing device is acquired by the PC notebook computer of the Windows system.

In a possible implementation manner, the audio playing device plays pure noise type audio to be acquired, and the plurality of first sample audios are audio to be acquired respectively acquired by audio receiving devices of multiple device types. Specifically, the plurality of first sample audios are composed of: the audio acquisition device comprises audio to be acquired when the apple mac notebook computer acquires a pure noise type played by an audio playing device, audio to be acquired when the apple ipad tablet computer acquires a pure noise type played by an audio playing device, audio to be acquired when the terminal device of an android system acquires a pure noise type played by an audio playing device, and audio to be acquired when the PC notebook computer of a Windows system acquires a pure noise type played by an audio playing device.

Step S710: the electronic equipment acquires a plurality of second sample audios from audio receiving equipment belonging to a plurality of equipment types; the second sample audios are acquired when the audio playing device plays audio to be acquired and each audio receiving device closes the audio processing function.

For example, turning off the audio echo filtering function means that each audio receiving device of the plurality of device types turns off the 3A algorithm. In a possible implementation manner, the audio playing device plays pure noise type audio to be acquired, and the plurality of second sample audios are audio to be acquired respectively acquired by the audio receiving devices of multiple device types. Specifically, the plurality of second sample audios are composed of: the audio acquisition device comprises audio to be acquired when the apple mac notebook computer acquires a pure noise type played by an audio playing device, audio to be acquired when the apple ipad tablet computer acquires a pure noise type played by an audio playing device, audio to be acquired when the terminal device of an android system acquires a pure noise type played by an audio playing device, and audio to be acquired when the PC notebook computer of a Windows system acquires a pure noise type played by an audio playing device. It should be noted that, when the audio to be acquired and the audio receiving device start the audio processing function, the audio to be acquired played by the audio playing device is the same segment of audio.

In one possible implementation, the audio playing device plays music + noise type audio to be collected, and the audio receiving devices of multiple device types acquire multiple third sample audios. Specifically, the plurality of third sample audios are composed of: the audio acquisition device comprises audio to be acquired when the apple mac notebook computer acquires a music + noise type played by the audio playing device, audio to be acquired when the apple ipad tablet computer acquires a music + noise type played by the audio playing device, audio to be acquired when the terminal device of the android system acquires a music + noise type played by the audio playing device, and audio to be acquired when the PC notebook computer of the Windows system acquires a music + noise type played by the audio playing device.

In one possible implementation, Pulse Code Modulation (PCM) is performed on audio acquired by audio receiving devices of multiple device types, and specifically, PCM Pulse code Modulation includes three processes: sampling, quantizing and coding, wherein the obtained audio is subjected to pulse code modulation and then can be output into a binary code format, and the audio receiving equipment is used for placing the modulated audio into a sample audio set after obtaining the modulated audio. As shown in fig. 8b, fig. 8b is a schematic diagram of a sample audio set provided in an embodiment of the present application. The electronic equipment takes the audio 1, the audio 2, the audio 3 and the audio 4 which are subjected to pulse code modulation as sample audio and stores the sample audio in a sample audio set, wherein the audio 1 and the audio 2 are positive samples, and the audio 3 and the audio 4 are negative samples.

It should be noted that, because the audio receiving devices of different device types are affected differently by the 3A algorithm, the application may also consider only the audio receiving device of the specified type to turn on or turn off the audio processing function, where the audio receiving device of the specified type may include a PC notebook computer of a Windows system and a terminal device of an android system.

Step S730: the electronic device takes the first sample audio and the second sample audio as the sample audio, and combines all sample audio into the sample audio set.

For example, the electronic device stores, in the sample audio set, a plurality of first sample audios acquired by the audio receiving devices of the multiple device types and a second sample audio acquired by the audio receiving devices of the multiple device types as sample audios. It should be noted that, in this embodiment of the application, when the first sample audio and the second sample audio are only to be acquired by the audio playing device playing once, the audio to be acquired, which is acquired by the audio receiving devices of multiple device types, is illustrated, and the audio receiving device may play different audio to be acquired multiple times, so that the audio receiving device acquires the sample audio in the same manner, and thus stores the sample audio acquired multiple times in the sample audio set.

The method and the device consider the difference of the collected audio to be processed caused by the difference among the audio receiving devices of various device types, and take the sample audio received by the audio receiving devices of various device types as the training data of the training audio judgment model, so that the data sources are richer, and the recognition capability of the audio judgment model on the audio to be processed can be improved; moreover, the influence brought by the 3A algorithm function of the audio receiving equipment of the specified type is considered, the audio to be collected is required to be collected in the same section of audio played by the audio playing equipment, and the audio receiving equipment of the specified type is in two states of turning on the audio processing function and turning off the audio processing function, so that the training data of the audio judgment model is further enriched, the influence of turning on and turning off the audio processing function on the audio data to be collected is considered, and the recognition capability and the recognition accuracy of the audio judgment model on the audio to be processed are further improved.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present disclosure. The audio processing apparatus can be applied to the electronic device in the method embodiments corresponding to fig. 3 to fig. 8 b. The audio processing means may be a computer program (comprising program code) running in a computer device, e.g. the audio processing means being an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The audio processing apparatus may include:

an obtaining unit 910, configured to obtain an audio to be processed;

an extracting unit 920, configured to extract a static audio feature of the audio to be processed;

a processing unit 930, configured to perform differential processing on the static audio features to obtain dynamic audio features of the audio to be processed;

a combining unit 940, configured to combine the static audio feature and the dynamic audio feature into a target audio feature of the audio to be processed;

the identifying unit 950 is configured to identify the target audio feature to obtain an audio type of the audio to be processed.

In one possible implementation, the static audio features include mel-frequency spectrum static features and constant Q-transform spectrum static features;

the processing unit 930 performs differential processing on the static audio features to obtain dynamic audio features of the audio to be processed, including:

carrying out differential processing on the Mel frequency spectrum static characteristics to obtain the Mel frequency spectrum dynamic characteristics of the audio to be processed;

carrying out differential processing on the constant Q transform frequency spectrum static characteristics to obtain constant Q transform frequency spectrum dynamic characteristics of the audio to be processed;

combining the Mel spectral dynamics and the constant Q transformed spectral dynamics into the dynamic audio features.

In a possible implementation manner, the processing unit 930 performs a difference processing on the mel-spectrum static feature to obtain the mel-spectrum dynamic feature of the audio to be processed, including:

performing first-order difference processing on the Mel frequency spectrum static characteristics to obtain first-order Mel frequency spectrum dynamic characteristics of the audio to be processed;

performing first-order difference processing on the first-order Mel frequency spectrum dynamic characteristics to obtain second-order Mel frequency spectrum dynamic characteristics of the audio to be processed;

and combining the first order Mel spectral dynamics and the second order Mel spectral dynamics into the Mel spectral dynamics.

In one possible implementation, the audio to be processed belongs to a time-domain signal;

the extracting unit 920 extracts the static audio feature of the audio to be processed, including:

dividing the audio to be processed into a plurality of unit audios, and respectively converting each unit audio into a unit frequency domain signal;

filtering each unit frequency domain signal through a Mel filter to obtain a unit Mel frequency spectrum static characteristic of each unit frequency domain signal, and combining a plurality of unit Mel frequency spectrum static characteristics into a Mel frequency spectrum static characteristic of the audio to be processed;

acquiring a quality factor, and determining the window length of each unit audio according to the central frequency of each unit audio;

performing time-frequency conversion processing on each unit audio according to the quality factor and the window length of each unit audio to obtain a unit constant Q transform spectrum static characteristic of each unit audio, and combining the unit constant Q transform spectrum static characteristics of a plurality of unit audios into a constant Q transform spectrum static characteristic;

and combining the Mel frequency spectrum static characteristics and the constant Q transformation frequency spectrum static characteristics into static audio characteristics of the audio to be processed.

In one possible implementation manner, the target audio features include a mel-frequency spectrum feature sequence and a constant Q-transform spectrum feature sequence, the mel-frequency spectrum feature sequence is formed by combining a mel-frequency spectrum static feature and a mel-frequency spectrum dynamic feature, and the constant Q-transform spectrum feature sequence is formed by combining a constant Q-transform spectrum static feature and a constant Q-transform spectrum dynamic feature;

the identifying unit 950 identifies the target audio feature to obtain the audio type of the audio to be processed, including:

calling a first audio frequency discrimination model, and determining a first matching probability set between the Mel frequency spectrum characteristic sequence and a plurality of audio frequency types in the first audio frequency discrimination model;

calling a second audio frequency discrimination model, and determining a second matching probability set between the constant Q transformation frequency spectrum characteristic sequence and a plurality of audio frequency types in the second audio frequency discrimination model; the multiple audio types in the first audio distinguishing model are the same as the multiple audio types in the second audio distinguishing model;

and determining the audio type of the audio to be processed according to the first matching probability set and the second matching probability set.

In one possible implementation, the obtaining unit 910 obtains a sample audio set, where the sample audio set is collected by audio receiving devices belonging to multiple device types, and the sample audio set includes multiple sample audios;

the extracting unit 920 extracts a sample static audio feature of each sample audio;

the processing unit 930 performs differential processing on the sample static audio features of each sample audio to obtain sample dynamic audio features of each sample audio;

the combining unit 940 combines the static audio features of each sample and the dynamic audio features of each sample into the audio features of each sample, and calls a sample audio discrimination model to respectively identify and process the audio features of each sample to obtain an audio prediction type set of the sample audio set;

an obtaining unit 910 obtains an audio type label set of the sample audio set, and trains the sample audio discrimination model according to the audio prediction type set and the audio type label set; the sample audio distinguishing model comprises a sample first audio distinguishing model and a sample second audio distinguishing model;

when the sample audio frequency discrimination model satisfies the model convergence condition, the processing unit 930 takes the sample first audio frequency discrimination model as the first audio frequency discrimination model and takes the sample second audio frequency discrimination model as the second audio frequency discrimination model.

In one possible implementation, the obtaining unit 910 obtains a sample audio set, including:

acquiring a plurality of first sample audios from audio receiving devices belonging to a plurality of device types; the plurality of first sample audios are acquired when the audio playing equipment plays audio to be acquired and each audio receiving equipment starts an audio echo filtering function;

obtaining a plurality of second sample audios from audio receiving devices belonging to a plurality of device types; the second sample audios are acquired when the audio playing equipment plays audio to be acquired and each audio receiving equipment closes the audio echo filtering function;

and taking the first sample audio and the second sample audio as the sample audio, and combining all the sample audio into the sample audio set.

In one possible implementation, the sequence of mel-frequency spectral features includes a first mel-frequency spectral vector feature and a second mel-frequency spectral vector feature;

the identifying unit 950 invokes a first audio discriminant model to determine a first set of matching probabilities between the mel-frequency spectrum feature sequence and a plurality of audio types in the first audio discriminant model, including:

determining a first hidden feature and a first output feature of the first mel-frequency spectrum vector feature based on the first audio discrimination model, the first mel-frequency spectrum vector feature and an initial hidden feature;

determining second hidden features and second output features of the first mel-frequency spectrum vector features based on the first audio discrimination model, the second mel-frequency spectrum vector features and the first hidden features;

and carrying out full-connection processing on the first output characteristic and the second output characteristic to obtain a first matching probability set among multiple audio types in the first audio discriminant model.

In a possible implementation manner, the determining, by the identifying unit 950, the audio type of the audio to be processed according to the first matching probability set and the second matching probability set includes:

carrying out weighted average operation on the first matching probability set and the second matching probability set to obtain a target matching probability set;

and determining the maximum matching probability from the target matching probability set, and if the maximum matching probability is greater than a preset probability threshold, determining the audio type corresponding to the maximum matching probability as the audio type of the audio to be processed.

In one possible implementation, the audio processing apparatus further includes: an output unit 960.

An output unit 960, configured to output the to-be-processed audio if the audio type of the to-be-processed audio is a target audio type;

the output unit 960 is further configured to, if the audio type of the to-be-processed audio is a non-target audio type, perform noise reduction on the to-be-processed audio, and output the to-be-processed audio after the noise reduction; the target audio type is a music type and the non-target audio type is a non-music type.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device in the embodiment corresponding to fig. 3 to 8b may be the electronic device 1000, and as shown in fig. 10, the electronic device 1000 may include: a user interface 1002, a processor 1004, an encoder 1006, and a memory 1008. Signal receiver 1016 is used to receive or transmit data via cellular interface 1010, WIFI interface 1012. The encoder 1006 encodes the received data into a computer-processed data format. The memory 1008 has stored therein a computer program by which the processor 1004 is arranged to perform the steps of any of the method embodiments described above. The memory 1008 may include volatile memory (e.g., dynamic random access memory DRAM) and may also include non-volatile memory (e.g., one time programmable read only memory OTPROM). In some instances, the memory 1008 can further include memory located remotely from the processor 1004, which can be connected to the computer device 1000 via a network. The user interface 1002 may include: a keyboard 1018, and a display 1020.

In the electronic device 1000 shown in fig. 10, the processor 1004 may be configured to call the memory 1008 to store a computer program to implement:

the processor 1004 performs differential processing on the static audio features to obtain dynamic audio features of the audio to be processed, including:

In a possible implementation manner, the differentiating the mel-spectrum static feature by the processor 1004 to obtain the mel-spectrum dynamic feature of the audio to be processed includes:

the processor 1004 extracts the static audio features of the audio to be processed, including:

the processor 1004 identifies the target audio feature, and obtains an audio type of the audio to be processed, including:

In one possible implementation, the processor 1004 may be further configured to call the memory 1008 to store a computer program for performing the following steps:

obtaining a sample audio set, wherein the sample audio set is collected by audio receiving equipment belonging to a plurality of equipment types, and the sample audio set comprises a plurality of sample audios;

extracting sample static audio features of each sample audio;

carrying out differential processing on the sample static audio features of each sample audio to obtain sample dynamic audio features of each sample audio;

combining the static audio features and the dynamic audio features of each sample into the audio features of each sample audio, and calling a sample audio discrimination model to respectively identify and process the audio features of each sample to obtain an audio prediction type set of the sample audio set;

acquiring an audio type label set of the sample audio set, and training the sample audio distinguishing model according to the audio prediction type set and the audio type label set; the sample audio distinguishing model comprises a sample first audio distinguishing model and a sample second audio distinguishing model;

and when the sample audio discriminant model meets a model convergence condition, taking the sample first audio discriminant model as the first audio discriminant model, and taking the sample second audio discriminant model as the second audio discriminant model.

In one possible implementation, the processor 1004 obtains a sample audio set, including:

the processor 1004 invokes a first audio discriminant model to determine a first set of matching probabilities between the sequence of mel-frequency features and a plurality of audio types in the first audio discriminant model, including:

In one possible implementation manner, the determining, by the processor 1004, the audio type of the audio to be processed according to the first matching probability set and the second matching probability set includes:

In one possible implementation, the processor 1004 further performs the following steps:

if the audio type of the audio to be processed is the target audio type, outputting the audio to be processed;

if the audio type of the audio to be processed is a non-target audio type, performing noise reduction processing on the audio to be processed, and outputting the audio to be processed after the noise reduction processing; the target audio type is a music type and the non-target audio type is a non-music type.

It should be understood that the electronic device 1000 described in the embodiment of the present invention may perform the description of the audio processing method in the embodiment corresponding to fig. 3 to fig. 8b, and may also perform the description of the audio processing apparatus in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores a computer program executed by the aforementioned audio processing apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the method in the embodiment corresponding to fig. 3 to 8b can be executed, and therefore, details will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium to which the present invention relates, reference is made to the description of the method embodiments of the present invention. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, which may comprise a block chain system.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to fig. 3 to 8b, and therefore, the detailed description thereof will not be repeated here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method of audio processing, the method comprising:

2. The method of claim 1, wherein the static audio features comprise mel-frequency spectral static features and constant Q-transform spectral static features;

the differential processing is performed on the static audio features to obtain the dynamic audio features of the audio to be processed, and the differential processing comprises the following steps:

3. The method of claim 2, wherein the differentiating the mel-spectrum static features to obtain mel-spectrum dynamic features of the audio to be processed comprises:

4. The method according to claim 2, wherein the audio to be processed belongs to a time domain signal;

the extracting the static audio features of the audio to be processed comprises:

5. The method according to claim 1, wherein the target audio features comprise a mel-frequency spectrum feature sequence and a constant Q-transform spectrum feature sequence, the mel-frequency spectrum feature sequence is formed by combining mel-frequency spectrum static features and mel-frequency spectrum dynamic features, and the constant Q-transform spectrum feature sequence is formed by combining constant Q-transform spectrum static features and constant Q-transform spectrum dynamic features;

the identifying the target audio characteristics to obtain the audio type of the audio to be processed includes:

6. The method of claim 5, further comprising:

extracting sample static audio features of each sample audio;

7. The method of claim 6, wherein obtaining the set of sample audio comprises:

8. The method of claim 5, wherein the sequence of Mel spectral features comprises a first Mel spectral vector feature and a second Mel spectral vector feature;

the calling a first audio frequency discrimination model to determine a first matching probability set between the Mel frequency spectrum feature sequence and a plurality of audio frequency types in the first audio frequency discrimination model comprises:

9. The method of claim 5, wherein determining the audio type of the audio to be processed according to the first set of matching probabilities and the second set of matching probabilities comprises:

10. The method according to any one of claims 1 to 9, further comprising:

11. An audio processing apparatus, comprising:

the acquisition unit is used for acquiring audio to be processed;

12. An electronic device, comprising a memory and a processor, wherein the memory stores a set of program codes, and the processor calls the program codes stored in the memory to execute the method of any one of 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 10.