CN110880329A

CN110880329A - Audio identification method and equipment and storage medium

Info

Publication number: CN110880329A
Application number: CN201811038406.3A
Authority: CN
Inventors: 陈均; 赵旭峰; 沈锦龙; 樊征
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2020-03-13
Anticipated expiration: 2038-09-06
Also published as: CN110880329B

Abstract

The invention provides an audio recognition method, audio recognition equipment and a storage medium; the audio recognition method comprises the following steps: acquiring reference audio data and audio data to be detected; carrying out effectiveness detection on the reference audio data and the audio data to be detected, intercepting effective reference audio data and effective audio data to be detected, wherein the effectiveness represents the characteristic of most information in the audio data; extracting Mel frequency cepstrum coefficient characteristics of the effective reference audio data and the effective audio data to be detected to obtain reference audio characteristics and audio characteristics to be detected; time matching is carried out on the reference audio features and the audio features to be detected; and when the reference audio features are matched with the audio features to be detected in time, the similarity comparison based on the reference audio features and the audio features to be detected is realized through feature matching, and the audio identification of the reference audio data and the audio data to be detected is realized according to the comparison result.

Description

Audio identification method and equipment and storage medium

Technical Field

The present invention relates to audio processing technologies in the field of computer applications, and in particular, to an audio recognition method and apparatus, and a storage medium.

Background

With the development of information-oriented society and the popularization of internet technology, a great deal of digitized audio content is flooded in daily life, and information carried in audio (e.g., sound) has been more deeply explored and utilized.

Most of the existing researches for audio contents mainly refer to the aspects of audio classification, audio retrieval, audio (voice) recognition and the like, but the comparison of audio similarity is not separated in the actual implementation, specifically, audio comparison can be performed on audio data by extracting features such as waveform, envelope, zero-crossing rate and the like and standard audio features, and similarity judgment is performed by comparing the features with a similarity threshold.

However, the current audio processing method is simple in feature extraction and processing, and the processing of feature parameters directly affects the accuracy of similarity determination, so that the current feature extraction or processing algorithm is simple and single, which results in poor robustness of similarity determination.

Disclosure of Invention

The embodiment of the invention provides an audio recognition method, audio recognition equipment and a storage medium, which can improve the accuracy of feature extraction, thereby improving the robustness of similarity calculation.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an audio identification method, which comprises the following steps:

acquiring reference audio data and audio data to be detected;

carrying out effectiveness detection on the reference audio data and the audio data to be detected, intercepting effective reference audio data and effective audio data to be detected, wherein the effectiveness represents the characteristic that the information in the audio data is the most;

extracting Mel frequency cepstrum coefficient characteristics of the effective reference audio data and the effective audio data to be detected to obtain reference audio characteristics and audio characteristics to be detected;

performing time matching on the reference audio features and the audio features to be detected;

when the reference audio features and the audio features to be detected are matched in time, similarity comparison based on the reference audio features and the audio features to be detected is achieved through feature matching, and audio identification of the reference audio data and the audio data to be detected is achieved according to comparison results.

An embodiment of the present invention provides an audio recognition device, including:

the acquisition unit is used for acquiring reference audio data and audio data to be detected;

the validity detection unit is used for carrying out validity detection on the reference audio data and the audio data to be detected, intercepting effective reference audio data and effective audio data to be detected, wherein the validity represents the characteristic that the most information exists in the audio data;

the characteristic extraction unit is used for carrying out Mel frequency cepstrum coefficient characteristic extraction on the effective reference audio data and the effective audio data to be detected to obtain a reference audio characteristic and an audio characteristic to be detected;

the matching unit is used for performing time matching on the reference audio features and the audio features to be detected;

and the similarity recognition unit is used for realizing similarity comparison based on the reference audio features and the audio features to be detected through feature matching when the reference audio features and the audio features to be detected are matched in time, and realizing audio recognition of the reference audio data and the audio data to be detected according to a comparison result.

a processor, a memory, and a communication bus that the processor and the memory communicate;

the memory to store executable audio recognition instructions;

the processor is used for realizing the audio identification method provided by the embodiment of the invention when the executable audio identification instruction stored in the memory is executed.

The embodiment of the invention provides a computer-readable storage medium, which stores executable audio identification instructions and is used for causing a processor to execute the audio identification method provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

according to the audio identification method provided by the embodiment of the invention, the reference audio data and the audio data to be detected are obtained; carrying out effectiveness detection on the reference audio data and the audio data to be detected, intercepting effective reference audio data and effective audio data to be detected, wherein the effectiveness represents the characteristic of most information in the audio data; extracting Mel frequency cepstrum coefficient characteristics of the effective reference audio data and the effective audio data to be detected to obtain reference audio characteristics and audio characteristics to be detected; time matching is carried out on the reference audio features and the audio features to be detected; and when the reference audio features are matched with the audio features to be detected in time, the similarity comparison based on the reference audio features and the audio features to be detected is realized through feature matching, and the audio identification of the reference audio data and the audio data to be detected is realized according to the comparison result. By adopting the technical scheme, the audio identification equipment can firstly carry out effectiveness detection on the audio data (including the reference audio data and the audio data to be detected) and only keeps the part with the most information in the audio data, so that unnecessary data redundancy caused by partial blank leaving can be reduced, and in the characteristic processing of the audio data, time-based matching can be firstly carried out, the audio data to be detected matched with time can be selected, and then the processing of the similarity is carried out, namely, the screening is firstly carried out, after the accuracy of the characteristic processing is improved, the processing and identification of the similarity are carried out, the accuracy of extracting the characteristics is improved, and the purpose of improving the robustness of similarity calculation is further realized.

Drawings

FIG. 1 is a schematic diagram of an alternative architecture of an audio recognition system architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative structure of an audio recognition device provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative structure of an audio recognition device according to an embodiment of the present invention;

FIG. 4 is a first flowchart of an alternative audio recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating an exemplary audio recognition method provided by an embodiment of the invention;

FIG. 6 is an alternative flow diagram of an exemplary time domain process of an audio recognition method provided by an embodiment of the invention;

FIG. 7 is a schematic flow chart of an exemplary MFCC feature extraction of an audio recognition method provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of an alternative flow chart of an audio recognition method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating an alternative flow chart of an audio recognition method according to an embodiment of the present invention;

FIG. 10 is an alternative flowchart of an exemplary validity check of an audio recognition method provided by an embodiment of the present invention;

FIG. 11 is a fourth alternative flow chart of the audio recognition method according to the embodiment of the present invention;

FIG. 12 is a schematic diagram illustrating an alternative flow chart of an audio recognition method according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of an alternative scenario of an audio recognition method according to an embodiment of the present invention;

FIG. 14 is a first schematic diagram of an alternative terminal interface of the audio recognition method according to the embodiment of the present invention;

fig. 15 is a schematic diagram of an alternative terminal interface of the audio recognition method according to the embodiment of the present invention;

fig. 16 is a schematic diagram of an alternative terminal interface of the audio recognition method according to the embodiment of the present invention;

fig. 17 is a schematic diagram of an alternative terminal interface of the audio recognition method according to the embodiment of the present invention;

fig. 18 is a schematic diagram of an alternative terminal interface of the audio recognition method according to the embodiment of the present invention;

fig. 19 is a time domain diagram of an alternative original reference audio data of the audio recognition method provided by the embodiment of the invention;

fig. 20 is a time domain diagram of an alternative original audio data to be detected of the audio recognition method according to the embodiment of the present invention;

FIG. 21 is a schematic diagram of an alternative RMS energy diagram for an audio recognition method according to an embodiment of the invention;

FIG. 22 is a schematic time domain diagram of an alternative valid reference audio data for the audio recognition method provided by the embodiment of the invention;

fig. 23 is a schematic time domain diagram of an alternative valid audio data to be detected of the audio recognition method according to the embodiment of the present invention;

FIG. 24 is a diagram of an alternative reference audio frequency characteristic spectrum of an audio frequency identification method provided by an embodiment of the invention;

fig. 25 is a schematic diagram of an alternative frequency spectrum of an audio feature to be detected in the audio recognition method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

It should be noted that the terms "first", "second", and the like in the embodiments of the present invention are used for distinguishing similar objects and not for representing a specific ordering of the objects, and it should be understood that "first", "second", and the like may be interchanged with specific orders or sequences where possible to enable the embodiments of the present invention described herein to be implemented in an order other than that described herein.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Feature extraction: the original features are converted into a set of features with obvious physical significance (Gabor, geometric features [ corner points, invariant ], texture [ LBP HOG ], etc.) or statistical significance or kernel. The feature extraction in the embodiment of the present invention refers to extracting feature quantities of important audio information from audio data.

2) And windowing, namely, cutting an infinite-length signal by a finite-length window function to become a finite-length signal which can be processed by a computer and then processing the finite-length signal. In the embodiment of the invention, the windowing processing is applied to the audio data with an infinite period assumed by the fast fourier transform before the fast fourier transform processing is performed on the audio data, and a part of the audio data is intercepted as the data to be processed, so that the distortion and leakage of a frequency spectrum when the fast fourier transform is performed subsequently are reduced, that is, the effect of correcting the audio data is reduced, and the accuracy of data processing is improved.

3) Framing: because the speech signal (i.e. audio data) is quasi-stationary signal, the speech signal is usually divided into frames during data processing, each frame is about 10ms-30ms in length, and the speech signal is regarded as stationary signal within 10ms-30 ms. In speech signal processing, frame processing is performed on a speech signal in order to reduce the influence of unsteadiness and time variation of the entire speech signal.

The following describes an exemplary application of the audio recognition device implementing the embodiment of the present invention, and the voice recognition device provided in the embodiment of the present invention may be implemented as a terminal, and the terminal 300 may be various types of user terminals that run application functions, such as a mobile phone, a computer, a digital broadcast terminal, an information transceiver device, a game console, a tablet device, a medical device, a fitness device, and a personal digital assistant, and may also be implemented as a server. The server may be a service server corresponding to the terminal running the application function. In the following, an exemplary application will be explained covering a terminal when the audio recognition apparatus is implemented as a server.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of an audio recognition system 100 according to an embodiment of the present invention, in order to support an exemplary application, a terminal 400 (exemplary terminals 400-1 and 400-2 are shown) is connected to a server 300 through a network 200, where the network 200 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link.

The terminal 400 is configured to receive the reference audio data and the audio data to be detected, send the reference audio data and the audio data to be detected to a corresponding server for audio recognition, and implement a certain application function according to a recognition result returned by the server, that is, the terminal provides graphical interfaces of various applications to implement different application functions, where the application functions may include an entertainment function with voice recognition, such as a voice red envelope function and a karaoke scoring function, displayed on a graphical interface 410 (for example, graphical interface 410-1 and graphical interface 410-2), and the server 300 is configured to obtain the reference audio data and the audio data to be detected; carrying out effectiveness detection on the reference audio data and the audio data to be detected, intercepting effective reference audio data and effective audio data to be detected, wherein the effectiveness represents the characteristic of most information in the audio data; extracting Mel frequency cepstrum coefficient characteristics of the effective reference audio data and the effective audio data to be detected to obtain reference audio characteristics and audio characteristics to be detected; time matching is carried out on the reference audio features and the audio features to be detected; when the reference audio features and the audio features to be detected are matched in time, the similarity comparison based on the reference audio features and the audio features to be detected is realized through the feature matching, and the audio recognition of the reference audio data and the audio data to be detected is realized according to the comparison result, so that the audio recognition result is returned to the terminal for the terminal to realize the application function.

The audio identification device provided by the embodiment of the present invention may be implemented in hardware or a combination of hardware and software, and various exemplary implementations of the audio identification device provided by the embodiment of the present invention are described below.

Referring to fig. 2, fig. 2 is a schematic diagram of an alternative structure of a server 300 according to an embodiment of the present invention, where the server 300 may be a mobile phone, a computer, a digital broadcast terminal, an information transceiver, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like, and a service server corresponding to the application function is provided, and an exemplary structure of an audio recognition device implemented as a terminal is foreseen according to the structure of the server 300, so that the structure described herein should not be considered as a limitation, for example, some components described below may be omitted, or components not described below may be added to adapt to special requirements of some applications.

The server 300 shown in fig. 2 includes: at least one processor 310, memory 340, at least one network interface 320, and a user interface 330. The various components in server 300 are coupled together by a bus 350. It is understood that the communication bus 350 is used to enable connective communication between these components. The communication bus 350 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled in figure 2 as communication bus 350.

The user interface 330 may include a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad or touch screen, and the like.

The memory 340 may be either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a Flash Memory (Flash Memory), and the like. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM). The memory 340 described in connection with embodiments of the invention is intended to comprise these and any other suitable types of memory.

Memory 340 in embodiments of the present invention is capable of storing data to support the operation of server 300. Examples of such data include: any computer program for operating on the server 300, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.

As an example of the method provided by the embodiment of the present invention implemented by a combination of hardware and software, the method provided by the embodiment of the present invention can be directly embodied as a combination of software modules executed by the processor 310, the software modules can be located in a storage medium located in the memory 340, the processor 310 reads executable audio recognition instructions included in the software modules in the memory 340, and the audio recognition method provided by the embodiment of the present invention is completed in combination with necessary hardware (for example, including the processor 310 and other components connected to the communication bus 350).

By way of example, the Processor 310 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

That is, exemplarily, an embodiment of the present invention provides an audio recognition apparatus, including at least:

a processor 310, a memory 340, and a communication bus 350 through which the processor 310 and the memory 340 communicate;

the memory 340 for storing executable audio recognition instructions;

the processor 310 is configured to implement the audio recognition method provided by the embodiment of the present invention when executing the executable audio recognition instruction stored in the memory 340.

An exemplary structure of software modules is described below, and in some embodiments, as shown in FIG. 3, the software modules in server 300 may include: the device comprises an acquisition unit 10, an effectiveness detection unit 11, a feature extraction unit 12, a matching unit 13 and a similarity identification unit 14; wherein the content of the first and second substances,

an acquiring unit 10, configured to acquire reference audio data and audio data to be detected;

the validity detection unit 11 is configured to perform validity detection on the reference audio data and the audio data to be detected, and intercept effective reference audio data and effective audio data to be detected, where the validity represents a characteristic that information in the audio data is the most;

a feature extraction unit 12, configured to perform mel-frequency cepstrum coefficient feature extraction on the effective reference audio data and the effective audio data to be detected to obtain a reference audio feature and an audio feature to be detected;

the matching unit 13 is configured to perform time matching on the reference audio features and the audio features to be detected;

and the similarity recognition unit 14 is configured to, when the reference audio feature and the audio feature to be detected are temporally matched, implement similarity comparison based on the reference audio feature and the audio feature to be detected through feature matching, and implement audio recognition of the reference audio data and the audio data to be detected according to a comparison result.

In some embodiments of the present invention, the obtaining unit 10 is specifically configured to obtain original reference audio data and original audio data to be detected; and performing time domain processing on the original reference audio data and the original audio data to be detected to obtain the discrete reference audio data and the discrete audio data to be detected.

In some embodiments of the present invention, the validity detecting unit 11 is specifically configured to calculate audio intensity for each frame of data in the reference audio data and the audio data to be detected; according to the audio intensity of each frame of data of the reference audio data, cutting off the audio data of which the audio intensity in a first specific frame is smaller than a preset audio intensity threshold value in the reference audio data to obtain the effective reference audio data; cutting off audio data of which the audio intensity in a second specific frame is smaller than a preset audio intensity threshold value in the reference audio data according to the audio intensity of each frame of the audio data to be detected, so as to obtain the effective audio data to be detected;

wherein the first specific frame and the second specific frame are both the most previous frame and/or the last frame in the audio data.

In some embodiments of the present invention, the feature extraction unit 12 is specifically configured to perform mel-frequency cepstrum coefficient feature extraction on each frame of data of the effective reference audio data and the effective audio data to be detected respectively to obtain each frame of effective reference audio features and each frame of effective audio features to be detected; sequencing each frame of effective reference audio features and each frame of effective audio features to be detected respectively according to audio intensity to obtain N-dimensional Mel frequency cepstrum coefficient reference audio features with the highest audio intensity and N-dimensional Mel frequency cepstrum coefficient audio features to be detected with the highest audio intensity after sequencing; and taking the N-dimensional mel frequency cepstrum coefficient reference audio features corresponding to each frame of data of the effective reference audio data as the reference audio features, and taking the N-dimensional mel frequency cepstrum coefficient to be detected audio features corresponding to each frame of data of the effective audio data to be detected as the audio features to be detected.

In some embodiments of the present invention, the matching unit 13 is specifically configured to obtain a first frame number of the reference audio feature and a second frame number of the audio feature to be detected from the reference audio feature and the audio feature to be detected; comparing the first frame number with the second frame number to obtain a frame number difference value; when the frame number difference value belongs to the time difference threshold value range, representing that the reference audio feature is matched with the audio feature to be detected in time; and when the frame number difference does not belong to the time difference threshold range, representing that the reference audio feature is not matched with the audio feature to be detected in time.

In some embodiments of the present invention, the similarity recognition unit 14 is specifically configured to perform feature matching on the reference audio feature and the audio feature to be detected according to a frame level to obtain a feature matching degree; and obtaining the similarity based on the feature matching degree and a preset similarity model.

In some embodiments of the present invention, the similarity recognition unit 14 is further specifically configured to obtain the i-M frame to-be-detected audio features to the i + L frame to-be-detected audio features of the to-be-detected audio features, and the i-th frame reference audio feature of the reference audio features; wherein M is an integer greater than or equal to 0, i-M is greater than or equal to 1, i-M is less than i + L, L is a positive number greater than or equal to 1, i is greater than or equal to 1, and is less than or equal to the number of frames of the audio features to be detected or the number of frames of the reference audio features; searching whether target audio features to be detected matched with the reference audio features of the ith frame exist in the audio features to be detected from the ith-M frame to the ith + L frame; and recording the ith matching result, entering the matching of the i +1 th frame of audio features to be detected and the i +1 th frame of reference audio features until the audio features to be detected are matched, and recording the matching results of all frames to obtain the feature matching degree.

In some embodiments of the present invention, the similarity recognition unit 14 is further configured to, after searching whether there is a target audio feature to be detected that matches the i-th frame reference audio feature from the i-M-th frame audio feature to be detected to the i + L-th frame audio feature, and before recording an i-th matching result, characterize that the i-th frame audio feature to be detected matches the i-th frame reference audio feature when the target audio feature to be detected exists, and the i-th matching result is matching; or, when the target audio features to be detected do not exist, the ith frame audio features to be detected are represented to be not matched with the ith frame reference audio features, and the ith matching result is not matched.

In some embodiments of the present invention, the similarity identification unit 14 is further specifically configured to obtain a preset weight database, where the preset weight database corresponds to the feature matching degree; and inputting the feature matching degree and the preset weight database into the preset similarity model, and outputting the similarity.

In some embodiments of the present invention, the similarity identification unit is further configured to, after the time matching is performed on the reference audio feature and the audio feature to be detected, identify that the audio to be detected does not match the reference audio data when the reference audio feature and the audio feature to be detected do not match in time.

As an example of the audio recognition method provided by the embodiment of the present invention implemented by hardware, the method provided by the embodiment of the present invention may be directly implemented by the processor 310 in the form of a hardware decoding processor, for example, the method provided by the embodiment of the present invention may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable gate arrays (FPGAs), or other electronic components.

An audio recognition method implementing an embodiment of the present invention will be described below in conjunction with the foregoing exemplary application and implementation of an audio recognition device implementing an embodiment of the present invention.

Referring to fig. 4, fig. 4 is an alternative flow chart of the audio recognition method according to the embodiment of the present invention, which will be described with reference to the steps shown in fig. 4.

S101, acquiring reference audio data and audio data to be detected.

S102, effectiveness detection is carried out on the reference audio data and the audio data to be detected, effective reference audio data and effective audio data to be detected are intercepted, and effectiveness represents the characteristic that the information in the audio data is the most.

S103, extracting the Mel frequency cepstrum coefficient characteristics of the effective reference audio data and the effective audio data to be detected to obtain the reference audio characteristics and the audio characteristics to be detected.

And S104, performing time matching on the reference audio features and the audio features to be detected.

And S105, when the reference audio features and the audio features to be detected are matched in time, realizing similarity comparison based on the reference audio features and the audio features to be detected through feature matching, and realizing audio identification of the reference audio data and the audio data to be detected according to a comparison result.

In the embodiment of the present invention, as shown in fig. 5, an audio identification device reads and processes original reference audio data and original audio data to be detected, so as to realize audio data acquisition, obtain reference audio data and audio data to be detected, perform validity detection on the reference audio data and the audio data to be detected, obtain effective reference audio data and effective audio data to be detected, extract mel-frequency cepstrum coefficient features from the effective reference audio data and the effective audio data to be detected, process the features, calculate time matching and feature matching degrees of the original reference audio data and the original audio data to be detected, and calculate similarity, so as to realize audio identification.

In S101, the audio identification device may acquire original reference audio data (as shown in fig. 19) and original audio data to be detected (as shown in fig. 20) through a user operating the terminal, where the acquisition process of the original reference audio data and the original audio data to be detected is acquisition of the audio data. The original reference audio data is input audio data used for performing audio identification reference, and the original audio data to be detected is input data to be detected, which is used for performing comparison matching with the reference audio data and realizing a certain application function through an identification result.

That is to say, in the embodiment of the present invention, the audio identification device may acquire original reference audio data and original audio data to be detected; the audio identification equipment performs time domain processing on original reference audio data and original audio data to be detected to obtain discrete reference audio data and discrete audio data to be detected, namely the audio identification equipment obtains the reference audio data and the audio data to be detected.

In the embodiment of the present invention, the specific processing of the audio identification device performing time domain processing on the original reference audio data and the original audio data to be detected may be that the audio identification device performs sampling, framing and windowing on the original reference audio data and the original audio data to be detected, and outputs a discrete time domain audio signal amplitude sequence (for example, the reference audio data and the audio data to be detected).

It should be noted that, before analyzing and processing the speech signals (i.e. the original audio data, such as the original reference audio data and the original audio data to be detected), it is necessary to perform pre-processing operations such as pre-emphasis, framing, and windowing. The aim of the operations is to eliminate the influence of aliasing, higher harmonic distortion, high frequency and other factors caused by human vocal organs and equipment for acquiring voice signals on the quality of the voice signals so as to ensure that the voice signals obtained by subsequent voice processing are more uniform and smooth as far as possible, and the voice provides high-quality parameters for signal parameter extraction, thereby improving the voice processing quality. Among them, framing throughout the entire process of audio analysis is a "short time analysis technique. "the speech signal has a time-varying characteristic, but in a short time range (generally considered to be in a short time of 10-30 ms), the characteristic thereof remains substantially unchanged, i.e., is relatively stable, and thus it can be regarded as a quasi-steady-state process, i.e., the speech signal has short-time stationarity. Any analysis and processing of the speech signal must be based on "short-term" analysis, i.e. performing "short-term analysis," segmenting the speech signal to analyze its characteristic parameters, where each segment is called a "frame," and the length of the frame is typically 10-30 ms. Thus, for the whole speech signal, the analyzed characteristic parameter time sequence is composed of the characteristic parameters of each frame. The purpose of windowing is to let the amplitude of a frame of speech signal fade to 0 at both ends. The tapering is beneficial for fourier transformation and may improve the resolution of the transformed result (i.e. the spectrum). The cost of windowing is that the portions at the ends of a frame signal are attenuated and not as important as the central portion. The embodiment of the invention solves the cost of windowing by intercepting parts which are mutually overlapped without intercepting back to back during framing. The time difference between the start positions of two adjacent frames is called frame shift, and the frame shift adopted in the embodiment of the present invention is half of the frame length, and may also be fixed to be 10 milliseconds.

In some embodiments of the present invention, when the audio acquisition receives the original audio data to be detected and the original reference audio data through the microphone, the number of receiving channels needs to be set; when receiving a voice signal, the audio signal is set to be a mono channel, and when receiving a music signal, the audio signal is set to be a binaural channel.

In the embodiment of the present invention, the original reference audio data and the original audio data to be detected may be audio files in a WAV format, and when the audio file is acquired by the audio identification device, the audio file in the WAV format needs to be analyzed, and then the WAV file is read according to a sampling rate (assuming that when a single channel is adopted to read the audio, the processing modes of multiple channels are completely the same). The sampling rate of the current mainstream audio acquisition is concentrated in 8000HZ to 16000HZ, and the sampling rate of 44100HZ can be adopted in the embodiment of the invention, because the sampling rate of 44100HZ meets the nyquist sampling theorem, the accuracy is high.

For example, the audio identification device may perform audio data acquisition on a plurality of sampling points on original reference audio data and original audio data to be detected, and then perform framing processing, taking an audio signal (audio data) with 512 sampling points as a frame, so that the time length of each frame is about 11.6 ms. The principle of framing is that the frame length of two adjacent frames is half of the frame length, and the signal length of the last frame is less than one frame and is marked as one frame. After framing is finished, windowing is carried out on each frame of audio signal by adopting a window function, so that a discrete time domain audio signal amplitude sequence is obtained (namely, original reference audio data and original audio data to be detected are respectively subjected to sampling, framing and windowing to obtain reference audio data and audio data to be detected).

In this embodiment of the present invention, the windowing function may be a hamming window function, a rectangular window, a triangular window, a Black man window, a Kaiser window, or the like, and the embodiment of the present invention is not limited thereto.

Illustratively, the hamming window function may be formula (1):

where K is the length of the window function and w (K) is a representation of the window function.

In the embodiment of the present invention, the audio identification device may multiply the audio signal after being framed by a hamming window function, and output a final discrete time domain audio signal amplitude sequence.

For example, as shown in fig. 6, in the time domain processing manner of the original audio data, assuming that the original audio data is a WAV audio file, a sampling rate is 44.1HZ, 512 sampling points are one frame, and when a hamming window function is adopted as the window function, the process of performing time domain processing on the original audio data (e.g., the original reference audio data and the original audio data to be detected) by the audio recognition device may be: sampling the WAV audio file at a sampling rate of 44.1HZ, dividing each 512 sampling points into a frame, shifting the frame length to 0.5 time, and finally multiplying the frame length by a Hamming window function to obtain a discrete time domain audio signal amplitude sequence.

In S102, after the audio identification device acquires the reference audio data and the audio data to be detected, the audio identification device may perform validity detection on the reference audio data and the audio data to be detected, and intercept effective reference audio data (as shown in fig. 22) and effective audio data to be detected (as shown in fig. 23), where the validity represents a characteristic of a continuous audio with most information in the audio data.

It should be noted that, in the embodiment of the present invention, the audio identification device may perform effective signal (i.e., validity) detection on the reference audio data and the audio data to be detected, detect a portion of an effective signal in the audio data, and cut off sampling points before and after an effective signal segment in the audio data, that is, only retain continuous audio sampling point data with the most information in the audio data.

It can be understood that, because the beginning and the end of the audio data may have the situations of blank and environmental noise with high probability, in the embodiment of the present invention, effective signal segments excluding the portions where the audio intensity is weak at the beginning and the end may be found through the audio intensity, so that the remaining effective audio data (for example, effective reference audio data and effective audio data to be detected) are continuous frame signals and are also the portions where the information amount is most accurate, thereby improving the accuracy of data processing.

In some embodiments of the present invention, the audio identification device may characterize or calculate the audio intensity of the audio data or the audio signal by a short-time average zero-crossing rate, a short-time energy, an energy entropy, a spectrum center, a spectrum spread, a spectrum entropy, a spectrum flux, a spectrum roll-off point, and the like.

In S103, after the audio identification device acquires the effective reference audio data and the effective audio data to be detected, in the embodiment of the present invention, the audio identification device only performs processing on the effective audio data, so that the processing amount and the accuracy of the audio data are improved, the audio identification device performs mel-frequency cepstrum coefficient feature extraction on the effective reference audio data and the effective audio data to be detected to obtain a reference audio feature and an audio feature to be detected, and performs matching and similarity calculation between features based on the reference audio feature and the audio feature to be detected.

In some embodiments of the present invention, the audio recognition device performs feature extraction by using Mel-Frequency Cepstrum coefficients, that is, obtains Mel-Frequency Cepstrum Coefficient (MFCC) features.

It should be noted that a person produces sound through the vocal tract, and the shape of the vocal tract determines what sound is produced, and the shape of the vocal tract includes tongue, teeth, and the like. Moreover, the shape of the vocal tract is shown in the envelope of the short-time power spectrum of speech, and the MFCC feature is a feature that accurately describes the envelope. The human auditory system is a special nonlinear system whose sensitivity to signals of different frequencies is different. In the aspect of extracting the audio features, the human auditory system can extract not only semantic information but also personal features of a speaker, so that if the characteristics of human auditory perception processing can be simulated in the audio recognition system, the recognition rate of the voice can be improved.

While the Mel frequency cepstral coefficients take human auditory features into consideration, as shown in fig. 7, the audio recognition device performs Fast Fourier Transform (FFT) on effective audio data (e.g., effective reference audio data and effective audio data to be detected) to obtain a linear spectrum, then maps the linear spectrum to a Mel nonlinear spectrum based on auditory perception, and finally converts the linear spectrum of the audio data to a cepstrum, specifically, the Mel nonlinear spectrum is obtained by passing the linear spectrum of the audio data through a set of Mel filters, at this time, cepstrum analysis can be performed on the Mel nonlinear spectrum, in the embodiment of the present invention, the cepstrum analysis includes logarithmic processing and Discrete Cosine Transform (DCT), and a MFCC feature is finally output, and the MFCC is a feature of a frame of audio data (e.g., the reference audio feature shown in fig. 24 and the audio feature to be detected shown in fig. 25), it should be noted that the feature extraction in the embodiment of the present invention is directed to each frame of audio data, and therefore, the mel-frequency cepstrum coefficient simulates the characteristics of human auditory perception processing, and the recognition rate of speech is improved.

Illustratively, the process of passing a linear spectrum through a set of Mel-filters by an audio recognition device to obtain a Mel-nonlinear spectrum is represented by formula (2), a logarithmic operation is represented by formula (3), and a DCT transform is represented by formula (4):

log X [ k ] ═ log (Mel-Spectrum) formula (2)

log X [ k ] + log H [ k ] + log E [ k ] formula (3)

x [ k ] ═ h [ k ] + e [ k ] formula (4)

Where X [ k ] denotes a linear spectrum, H [ k ] denotes an envelope of the linear spectrum, E [ k ] denotes details of the linear spectrum, X [ k ] denotes a Mel nonlinear spectrum, H [ k ] denotes Mel frequency cepstral coefficients, i.e., MFCC features, and E [ k ] denotes details of the Mel spectrum.

In the embodiment of the present invention, the audio identification device may perform the mel-frequency cepstrum coefficient feature extraction on the effective reference audio data and the effective audio data to be detected, respectively, so as to obtain the reference audio feature and the audio feature to be detected.

In S104, the audio recognition device may perform time matching on the reference audio feature and the audio feature to be detected, which are obtained by extracting the mel-frequency cepstrum coefficient feature, and perform detailed feature matching or comparison after the time matching.

In the embodiment of the present invention, since the reference audio features and the audio features to be detected processed by the audio recognition device are data after being framed, that is, the reference audio features and the audio features to be detected are data in units of frames, the audio recognition device can obtain the first frame number of the reference audio features and the second frame number of the audio features to be detected from the reference audio features and the audio features to be detected. The first frame number is the total frame number corresponding to the reference audio features, and the second frame number is the total frame number corresponding to the audio features to be detected. And the audio identification equipment compares the first frame number with the second frame number to obtain a frame number difference value. And when the frame number difference value belongs to the time difference threshold range, representing that the reference audio feature is matched with the audio feature to be detected in time. And when the frame number difference does not belong to the time difference threshold range, representing that the reference audio feature is not matched with the audio feature to be detected in time. That is to say, the audio recognition device may determine the relative time difference between the audio feature to be detected and the reference audio feature, where the time difference is represented by a frame number difference, and it can be understood that if the time difference between the audio feature to be detected and the reference audio feature is too much, the similarity between the two features is meaningless to calculate, and the probability of mismatching is very high.

Illustratively, the calculation of the relative time difference, i.e. the difference between the frame numbers, can be obtained according to formula (5), wherein formula (5) is:

wherein, diff_timeAnd y is the first frame number and m is the second frame number.

In some embodiments of the present invention, the relative time difference threshold range may be around 30%, or may not be limited.

In S105, when the reference audio feature and the audio feature to be detected are temporally matched, the reference audio feature and the audio feature to be detected may be further compared in terms of features, specifically, the similarity comparison based on the reference audio feature and the audio feature to be detected is realized through the feature matching, and the audio recognition of the reference audio data and the audio data to be detected is realized according to the comparison result.

In the embodiment of the present invention, the audio recognition device completes the implementation of the application function according to the audio recognition result after the audio recognition, and the description of the scene application will be performed in the subsequent embodiments.

It can be understood that, because the audio recognition device can firstly detect the validity of the audio data (including the reference audio data and the audio data to be detected), and only reserve the part with the most information in the audio data, unnecessary data redundancy caused by partial blank reservation can be reduced, and in the feature processing of the audio data, time-based matching can be firstly carried out, the audio data to be detected with time matching can be selected, and then the processing of the similarity is carried out, that is, firstly, the screening is carried out, after the accuracy of the feature processing is improved, the processing and recognition of the similarity are carried out, the accuracy of extracting the features is improved, and the purpose of improving the robustness of the similarity calculation is realized.

In some embodiments of the present invention, referring to fig. 8, fig. 8 is an optional flowchart of the audio recognition method provided in the embodiments of the present invention, and based on fig. 4, after S104, S106 may also be performed. The following were used:

and S106, when the reference audio features and the audio features to be detected are not matched in time, identifying that the audio to be detected is not matched with the reference audio data.

After the audio recognition device performs time matching on the reference audio features and the audio features to be detected, because the reference audio features and the audio features to be detected processed by the audio recognition device are data after framing, that is, the reference audio features and the audio features to be detected are data in units of frames, the audio recognition device can acquire a first frame number of the reference audio features and a second frame number of the audio features to be detected from the reference audio features and the audio features to be detected. The first frame number is the total frame number corresponding to the reference audio features, and the second frame number is the total frame number corresponding to the audio features to be detected. And the audio identification equipment compares the first frame number with the second frame number to obtain a frame number difference value. And when the frame number difference value belongs to the time difference threshold range, representing that the reference audio feature is matched with the audio feature to be detected in time. And when the frame number difference does not belong to the time difference threshold range, representing that the reference audio feature is not matched with the audio feature to be detected in time. That is to say, the audio recognition device may determine the relative time difference between the audio feature to be detected and the reference audio feature, where the time difference is represented by a frame number difference, and it can be understood that, if the time difference between the audio feature to be detected and the reference audio feature is too much, it is meaningless to calculate the similarity between the two features, and the probability of mismatching is very large, therefore, in the embodiment of the present invention, a time difference threshold range is set, as long as the frame number difference is in this range, it can be considered that the relative time difference between the audio feature to be detected and the reference audio feature is very small, and it is matched in time, and further feature comparison can be performed, however, when the reference audio feature and the audio feature to be detected are not matched in time, it can be considered that the two are not matched in time, and it can directly identify that the audio to be detected is not matched with the reference, and the user is prompted with the identification result, so that the calculation process is simplified, and the identification accuracy is improved.

In some embodiments of the present invention, referring to fig. 9, fig. 9 is an optional flowchart of the audio recognition method provided in the embodiments of the present invention, and S102 shown in fig. 9 may be implemented by S1021 to S1023, which will be described with reference to the steps.

And S1021, calculating the audio intensity of each frame of data in the reference audio data and the audio data to be detected respectively.

And S1022, according to the audio intensity of each frame of data of the reference audio data, cutting off the audio data of which the audio intensity in the first specific frame is smaller than the preset audio intensity threshold value in the reference audio data, so as to obtain the effective reference audio data.

S1023, according to the audio intensity of each frame of data of the audio data to be detected, cutting off the audio data of which the audio intensity in a second specific frame is smaller than a preset audio intensity threshold value in the reference audio data to obtain effective audio data to be detected; wherein the first specific frame and the second specific frame are the foremost frame and/or the last frame in the audio data.

In the embodiment of the present invention, since the reference audio data and the audio data to be detected processed by the audio recognition device are both framed data, that is, the reference audio data and the audio data to be detected are data in units of frames, the audio recognition device processes the data in units of frames. The audio identification device calculates the audio intensity of each frame of audio data, and the embodiment of the invention sets a preset audio intensity threshold value for filtering the energy or information of the audio data, so that the audio identification device can cut off the audio data of which the audio intensity in a first specific frame (namely the foremost frame and/or the last frame in the reference audio data) in the reference audio data is less than the preset audio intensity threshold value according to the audio intensity of each frame of data of the reference audio data, namely cut off sampling points before and after an effective signal segment, namely only reserve continuous audio sampling point data with most information in the reference audio data, and obtain the effective reference audio data; and according to the audio intensity of each frame of data of the audio data to be detected, cutting off audio data of which the audio intensity in a second specific frame (namely the foremost frame and/or the last frame in the audio data to be detected) in the reference audio data is smaller than a preset audio intensity threshold value, namely cutting off sampling points before and after the effective signal segment, namely only keeping continuous audio sampling point data with the most information in the audio data to be detected, and obtaining the effective audio data to be detected. Wherein the first specific frame and the second specific frame are the foremost frame and/or the last frame in the audio data. The first frame and the last frame may be one frame or multiple frames, which is not limited in the embodiment of the present invention and is obtained by filtering according to actual calculation.

The calculation of audio intensity as energy is described here by way of example.

For example, as shown in fig. 10, assuming that there are 512 sample points per frame of audio data, the calculation process for each frame of audio data is: the method includes the steps of calculating Root-Mean-Square-Energy (Root-Mean-Square-Energy) of each frame of data for 512 sampling points of each frame of audio data (namely, each frame of data of a discrete time domain signal amplitude sequence) (as shown in fig. 21), setting a preset Root-Mean-Square Energy threshold (namely, a preset audio intensity threshold), abandoning frame signals which are smaller than the preset Root-Mean-Square Energy threshold before and after an effective signal of each frame of audio data according to the comparison between the Root-Mean-Square Energy and the preset Root-Mean-Square Energy threshold, and reserving frame signals in the middle whether the frame signals are larger than or equal to the threshold, so that reserved continuous frame signals are effective signals.

The frame root mean square energy can be calculated by using formula (6), as follows:

wherein, RMSE is frame root mean square energy, x (k) is sample point data of each frame of audio data, and D is the number of sample points.

In some embodiments of the present invention, the audio identification device may further perform the validity determination by calculating a square or an absolute value of a frame energy of each frame of audio data, and embodiments of the present invention are not limited thereto.

It can be understood that, because the beginning (the first few frames) and the end (the last few frames) of the audio data may have a large probability of occurrence of blank and environmental noise, and the like, in the embodiment of the present invention, effective signal segments excluding portions where the audio intensity is weak at the beginning and the end may be found by the audio intensity, so that the remaining effective audio data (for example, effective reference audio data and effective audio data to be detected) are continuous frame signals, and are also portions where the information amount is most accurate, thereby improving the accuracy of data processing.

In some embodiments of the present invention, referring to fig. 11, fig. 11 is an optional flowchart of the audio recognition method provided in the embodiments of the present invention, and S103 shown in fig. 11 may be implemented by S1031 to S1033, which will be described with reference to the steps.

And S1031, respectively carrying out Mel frequency cepstrum coefficient feature extraction on each frame of data of the effective reference audio data and the effective audio data to be detected, and obtaining each frame of effective reference audio features and each frame of effective audio features to be detected.

S1032, sequencing each frame of effective reference audio feature and each frame of effective audio feature to be detected respectively according to audio intensity, and obtaining the sequenced N-dimensional Mel frequency cepstrum coefficient reference audio feature with the highest audio intensity and the N-dimensional Mel frequency cepstrum coefficient to be detected with the highest audio intensity.

S1033, taking the N-dimensional mel frequency cepstrum coefficient reference audio features corresponding to each frame of data of the effective reference audio data as reference audio features, and taking the N-dimensional mel frequency cepstrum coefficient to-be-detected audio features corresponding to each frame of data of the effective to-be-detected audio data as to-be-detected audio features.

In the embodiment of the invention, after the audio identification device acquires the effective reference audio data and the effective audio data to be detected, the audio identification device only processes the effective audio data, so that the processing amount and the accuracy of the audio data are improved.

In detail, the audio recognition device extracts mel frequency cepstrum coefficient characteristics of each frame of data of the effective reference audio data and the effective audio data to be detected respectively, and the process of obtaining each frame of effective reference audio characteristics and each frame of effective audio characteristics to be detected is as follows: the audio identification device performs fast Fourier transform on sampling points of each frame of audio data (namely each frame of effective reference audio data and each frame of effective audio data to be detected) to obtain a frequency spectrum of each frame, and performs modular squaring on the frequency spectrum to obtain a power spectrum of each frame of audio features (each frame of effective reference audio features and each frame of effective audio features to be detected).

Illustratively, the audio identification device performs a fast fourier transform on 512 sample points of each frame of signal (i.e., each frame of audio data), wherein the fast fourier transform is as in equation (7), and then filters each frame of signal using a mel filter bank, assuming that the energy spectrum of each frame of signal is filtered using a 128 mel-scale triangular band pass filter bank. The Mel scale can describe the nonlinear characteristics of human ears, the relation between the Mel scale and the frequency can be approximately expressed by a formula (8), then logarithm is taken for the energy value output of the triangular band-pass filter bank, Discrete Cosine Transform (DCT) change is carried out on the logarithmic energy Mel spectrum, and Mel frequency cepstrum coefficient characteristics, namely each frame of effective reference audio characteristics and each frame of effective audio characteristics to be detected, are obtained.

Where x (n) is an input speech signal, and the number of fourier transform points B is 512.

It can be understood that the mel frequency cepstrum coefficient simulates the processing characteristics of human auditory perception, the similarity comparison result of human auditory sensitivity is obtained, and the comparison result can accurately reflect the perception degree of normal people to the sound similarity degree, so that the recognition rate of voice or audio can be improved.

In the embodiment of the invention, the audio identification device performs re-screening on each frame of obtained effective audio features (corresponding to the number of time domain sampling points of each frame of effective audio features), selects N-dimensional audio features in an energy set (namely high audio intensity) in a sequencing manner to serve as final audio features of each frame of mel-frequency cepstrum coefficients, and combines the mel-frequency cepstrum coefficient audio features of all frames into audio features (such as reference audio features or audio features to be detected) so as to enable the dimensionality of each frame of mel-frequency cepstrum coefficient audio features to be consistent, and the frames are aligned, so that subsequent similarity comparison with the frame as a unit is better performed.

Specifically, the audio identification device sorts each frame of effective reference audio features and each frame of effective audio features to be detected respectively according to audio intensity to obtain N-dimensional mel frequency cepstrum coefficient reference audio features with the highest audio intensity and N-dimensional mel frequency cepstrum coefficients to be detected with the highest audio intensity, and finally, the N-dimensional mel frequency cepstrum coefficient reference audio features corresponding to each frame of data of the effective reference audio data are used as the reference audio features, and the N-dimensional mel frequency cepstrum coefficient to be detected corresponding to each frame of data of the effective audio data to be detected are used as the audio features to be detected.

In some embodiments of the present invention, N is preferably 6, and embodiments of the present invention are not limited thereto.

Illustratively, each frame of the valid signal (e.g., the valid reference audio data and the valid audio data to be detected) is represented as a 6-dimensional feature, the number of frames is the time length of the valid signal (hereinafter, simply referred to as "duration") divided by the frame length of 11.6ms, each valid reference audio data and valid audio data to be detected corresponds to a feature matrix (i.e., the reference audio feature and the audio features to be detected) having a number of rows of 6 and a number of columns of frame numbers of the feature matrix after being processed. For example, the reference audio features are to obtain a <6, y > reference audio feature matrix, and the audio features to be detected are to obtain a <6, m > audio feature matrix to be detected, so that the audio recognition device can perform matching according to the feature matrix.

It can be understood that the dimensions of the audio features obtained by the processing are consistent, data alignment can be performed, the difficulty of subsequent matching is reduced, and the matching accuracy is improved.

In some embodiments of the present invention, referring to fig. 12, fig. 12 is an optional flowchart of the audio recognition method provided in the embodiment of the present invention, and the implementation of the similarity comparison based on the reference audio feature and the audio feature to be detected in S105 shown in fig. 12 through feature matching may be implemented through S1051 to S1053, which will be described with reference to the steps.

S1051, the audio recognition device can perform feature matching on the reference audio features and the audio features to be detected according to the frame level to obtain the feature matching degree.

And S1052, obtaining the similarity based on the feature matching degree and a preset similarity model.

In the embodiment of the invention, the audio recognition equipment processes the audio by taking frames as a unit, performs feature matching on each frame to obtain a feature matching result of each frame, and finally combines the feature matching degree.

In some embodiments of the present invention, the audio identification unit also performs feature matching on each frame of audio features, and finally obtains an overall feature matching degree of the audio features.

Here, the following description is made with respect to the matching process of each frame of audio features:

s10511, the audio recognition equipment acquires the i-M frame audio features to be detected from the i + L frame audio features to be detected of the audio features to be detected, and the i frame reference audio features of the reference audio features; wherein M is an integer greater than or equal to 0, i-M is greater than or equal to 1, i-M is less than i + L, L is a positive number greater than or equal to 1, i is greater than or equal to 1, and is less than or equal to the number of frames of the audio features to be detected or the number of frames of the reference audio features.

When the audio recognition device matches the ith frame audio feature of each frame audio feature, the audio recognition device can expand the ith frame to-be-detected audio feature of the to-be-detected audio feature to obtain the i-M frame to-be-detected audio feature to the (i + L) th frame to-be-detected audio feature, and the matching of the ith frame to-be-detected audio feature and the ith frame reference audio feature is carried out through the comparison of the i-M frame to-be-detected audio feature to the (i + L) th frame to-be-detected audio feature and the ith frame reference audio feature.

It should be noted that, the values of M and L are not limited in the embodiments of the present invention, but when the i-M starts from the first frame, M is 0, i-M is greater than or equal to 1, and i-M is smaller than i + L. Wherein, M and L corresponding to each frame may be variable or may be fixed, and the embodiment of the present invention is not limited.

Illustratively, when the 1 st frame of audio features to be detected is matched with the 1 st frame of reference audio features, the audio recognition device acquires the 1 st frame of audio features to be detected from the 5 th frame of audio features to be detected, and matches the 1 st frame of audio features to be detected from the 5 th frame of audio features to be detected with the 1 st frame of reference audio features.

S10512, searching whether the target audio features to be detected matched with the reference audio features of the ith frame exist in the audio features to be detected from the ith-M frame to the ith + L frame.

S10513, when the target audio features to be detected exist, the ith frame audio features to be detected are represented to be matched with the ith frame reference audio features, and the ith matching result is matching.

S10514, when the target audio features to be detected do not exist, the ith frame audio features to be detected are represented to be not matched with the ith frame reference audio features, and the ith matching result is not matched.

When the audio recognition equipment performs matching of the ith frame audio feature of each frame audio feature, matching of the ith frame audio feature to be detected and the ith frame reference audio feature is performed through comparison of the ith-M frame audio feature to be detected to the ith + L frame audio feature to be detected and the ith frame reference audio feature, specifically, when the audio recognition equipment searches for the target audio feature to be detected which is matched with the ith frame reference audio feature from the ith-M frame audio feature to the ith + L frame audio feature to be detected, the ith frame audio feature to be detected is represented to be matched with the ith frame reference audio feature, and the ith matching result is matching; when the target audio features to be detected which are matched with the ith frame reference audio features are not searched, the ith frame audio features to be detected are represented to be not matched with the ith frame reference audio features, and the ith matching result is not matched. Thus, the audio recognition device can obtain the matching result of the ith frame.

S10515, recording the ith matching result, matching the to-be-detected audio features of the (i + 1) th frame with the reference audio features of the (i + 1) th frame until the to-be-detected audio features are matched, and recording the matching results of all frames to obtain the feature matching degree.

After the audio recognition device obtains the matching result of the ith frame, recording the ith matching result, ending the matching of the audio feature to be detected of the ith frame, matching the audio feature to be detected of the (i + 1) th frame, namely, entering the matching of the audio feature to be detected of the (i + 1) th frame and the reference audio feature of the (i + 1) th frame, adding 1 to i, executing the processes from S10511 to S10515 until all the audio features to be detected of all the frames of the audio feature to be detected are matched, counting the matching results recorded to all the frames, and finally obtaining the feature matching degree.

In some embodiments of the present invention, each frame based on the reference audio feature and the audio feature to be detected is data with the same dimension, and therefore, the matching degree of the features of the audio identification device may be the matching degree of the features in units of the dimension of each frame of audio features obtained after normalizing the matching result of each frame.

For example, when N is 6, after matching with the reference audio feature, the feature matching degrees obtained after normalization are respectively denoted as [ N ═ N { (N) } N { (N } N { (₁,n₂,n₃,n₄,n₅,n₆]。

In some embodiments of the present invention, the obtaining, by the audio recognition device, the similarity according to the model based on the feature matching degree and the preset similarity may include: the audio identification equipment acquires a preset weight database, and the preset weight database corresponds to the feature matching degree; and inputting the feature matching degree and the preset weight database into a preset similarity model, and outputting the similarity.

In some embodiments of the present invention, due to the existence of the environmental noise and the human pitch frequency, the preset weight corresponding to the feature matching degree is set in the preset weight database, and the preset weight corresponding to each dimension of the feature matching degree is different.

Illustratively, the preset weight database may be [ w ]₁,w₂,w₃,w₄,w₅,w₆]^T。

Therefore, the audio recognition equipment can input the feature matching degree and the preset weight database into the preset similarity model and output the similarity.

In some embodiments of the present invention, the preset similarity model in the audio recognition device may be as shown in equation (9).

Wherein Similarity is Similarity.

After the audio identification device obtains the similarity, whether the original audio data to be detected is consistent with the original reference audio data or not can be identified based on the result of the similarity, so that the application function which can be performed by utilizing the consistency of the audio identification is realized.

It can be understood that, due to the different weights among the features, the method can be suitable for the audio similarity detection of complex scenes, and has strong expansibility.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The description will be made taking a voice recognition device as a server and taking application functions (such as voice red envelope, karaoke score, etc.) for realizing audio recognition on a terminal as an example.

As shown in fig. 13, a terminal a receives original reference audio data 2 at a voice red packet function interface (e.g. group chat interface 1), records the voice red packet and sets a sum 3, then sends the voice red packet to a terminal B through a server C, receives original audio data 4 to be detected at the terminal B, sends the original audio data 4 to be detected to the server C, performs audio recognition on the original reference audio data 2 and the original audio data 4 to be detected at the server C, specifically realizes that the server C samples audio files of the original reference audio data 2 and the original audio data 4 to be detected, obtains an amplitude sequence of discrete time domain audio through processing such as framing and windowing, then performs validity detection, calculates root mean square energy of each frame of valid signals, sets a preset energy threshold (preset audio intensity threshold) for the root mean square energy, intercepting the parts before and after the sound signals start and end in an audio file, reserving effective sound fragments with root mean square energy larger than a preset energy threshold value, namely effective reference audio data and effective audio data to be detected, then carrying out Mel frequency cepstrum coefficient extraction and processing, carrying out short-time fast Fourier transform on each frame of signals to obtain a signal frequency spectrum, enabling the signal frequency spectrum to pass through a group of Mel filters to obtain a Mel frequency spectrum, carrying out logarithm on two sides to obtain a logarithmic spectrum, then carrying out inverse Fourier transform in a discrete cosine transform mode to obtain Mel frequency cepstrum coefficient characteristics, namely reference audio characteristics and audio characteristics to be detected, and finally carrying out similarity calculation when the relative time difference value between the reference audio characteristics and the audio characteristics to be detected is within a preset time difference threshold value range, wherein each frame of Mel frequency cepstrum coefficient characteristics corresponds to the front N-dimensional cepstrum coefficient with the largest signal energy as characteristics, and finally, according to a set similarity model and a preset weight, comparing the matching degree between the corresponding characteristics of the two end audios, determining a recognition result based on the similarity, and sending the recognition result to the terminal B, so that when the recognition result displayed in the terminal B is that the similarity is 98%, the voice red packet sent by the terminal A is successfully received.

It should be noted that, in the embodiment of the present invention, the server may further implement different levels of receiving the red envelope amount according to different degrees of the identification result. As shown in fig. 14, when the similarity is 80%, the terminal B can successfully receive only 1.5 yen even if the terminal a transmits a 5-yen red packet, and as shown in fig. 15, when the similarity is 90%, the terminal B can successfully receive only 3.5 yen even if the terminal a transmits a 5-yen red packet, as shown in fig. 16, and when the similarity is less than 80%, a prompt of "the voice similarity is too low and the voice cannot be successfully received" is provided to re-record.

It should be noted that, when the voice recognition device is a terminal, the implementation in the server is only performed on the terminal side.

Exemplarily, when a song K is performed on a terminal, original reference audio data is a song original sound and can be acquired based on a network, at this time, recording of a song 1 can be performed on a song K interface of the terminal, that is, original audio data to be detected is acquired, so that the terminal can obtain similarity of the song and the song original sound according to a processing mode like a server, as shown in fig. 17, if the real-time sound wave similarity is 80%, the real-time singing score is 75 points; as shown in fig. 18, if the real-time sound wave similarity is 98%, the real-time singing score is 95.

Embodiments of the present invention provide a storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform an audio recognition method provided by embodiments of the present invention, for example, the audio recognition method as shown in fig. 4, 8, 9, 11 and 12.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiments of the present invention, since the audio identification device can first perform validity detection on the audio data (including the reference audio data and the audio data to be detected), and only retain the portion with the most information in the audio data, unnecessary data redundancy caused by partial blank retention can be reduced, and in the feature processing of the audio data, time-based matching can be performed first, the audio data to be detected with time matching can be selected, and then the similarity processing is performed, that is, a filtering is performed first, and after the accuracy of the feature processing is improved, the similarity processing and identification are performed, so that the purpose of improving the accuracy of feature extraction and thus improving the robustness of the similarity calculation is achieved.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. An audio recognition method, comprising:

acquiring reference audio data and audio data to be detected;

2. The method according to claim 1, wherein the obtaining of the reference audio data and the audio data to be detected comprises:

acquiring original reference audio data and original audio data to be detected;

and performing time domain processing on the original reference audio data and the original audio data to be detected to obtain the discrete reference audio data and the discrete audio data to be detected.

3. The method according to claim 1, wherein the performing validity detection on the reference audio data and the audio data to be detected and intercepting valid reference audio data and valid audio data to be detected comprises:

calculating the audio intensity of each frame of data in the reference audio data and the audio data to be detected respectively;

according to the audio intensity of each frame of data of the reference audio data, cutting off the audio data of which the audio intensity in a first specific frame is smaller than a preset audio intensity threshold value in the reference audio data to obtain the effective reference audio data;

according to the audio intensity of each frame of data of the audio data to be detected, cutting off the audio data of which the audio intensity in a second specific frame is smaller than a preset audio intensity threshold value in the reference audio data to obtain the effective audio data to be detected;

4. The method according to claim 1, wherein said extracting mel-frequency cepstrum coefficient features from said valid reference audio data and said valid audio data to be detected to obtain reference audio features and audio features to be detected comprises:

respectively extracting Mel frequency cepstrum coefficient characteristics of each frame of data of the effective reference audio data and the effective audio data to be detected to obtain each frame of effective reference audio characteristics and each frame of effective audio characteristics to be detected;

sequencing each frame of effective reference audio features and each frame of effective audio features to be detected respectively according to audio intensity to obtain N-dimensional Mel frequency cepstrum coefficient reference audio features with the highest audio intensity and N-dimensional Mel frequency cepstrum coefficient audio features with the highest audio intensity after sequencing;

and taking the N-dimensional Mel frequency cepstrum coefficient reference audio features corresponding to each frame of data of the effective reference audio data as the reference audio features, and taking the N-dimensional Mel frequency cepstrum coefficient to be detected audio features corresponding to each frame of data of the effective audio data to be detected as the audio features to be detected.

5. The method according to any one of claims 1 to 4, wherein the time matching the reference audio feature and the audio feature to be detected comprises:

acquiring a first frame number of the reference audio features and a second frame number of the audio features to be detected from the reference audio features and the audio features to be detected;

comparing the first frame number with the second frame number to obtain a frame number difference value;

when the frame number difference value belongs to the time difference threshold value range, representing that the reference audio feature is matched with the audio feature to be detected in time;

and when the frame number difference does not belong to the time difference threshold range, representing that the reference audio feature is not matched with the audio feature to be detected in time.

6. The method according to claim 1, wherein the performing similarity comparison based on the reference audio feature and the audio feature to be detected through feature matching comprises:

according to the frame level, carrying out feature matching on the reference audio features and the audio features to be detected to obtain a feature matching degree;

and obtaining the similarity based on the feature matching degree and a preset similarity model.

7. The method according to claim 6, wherein said performing feature matching on the reference audio features and the audio features to be detected according to a frame level to obtain a feature matching degree comprises:

acquiring the i-M frame to-be-detected audio features to the i + L frame to-be-detected audio features of the to-be-detected audio features and the i frame reference audio features of the reference audio features; wherein M is an integer greater than or equal to 0, i-M is greater than or equal to 1, i-M is less than i + L, L is a positive number greater than or equal to 1, i is greater than or equal to 1, and is less than or equal to the number of frames of the audio features to be detected or the number of frames of the reference audio features;

searching whether target audio features to be detected matched with the reference audio features of the ith frame exist in the audio features to be detected from the ith-M frame to the ith + L frame;

recording the ith matching result, entering the matching of the (i + 1) th frame of audio features to be detected and the (i + 1) th frame of reference audio features until the audio features to be detected are matched, and recording the matching results of all frames to obtain the feature matching degree;

correspondingly, after searching whether the target audio feature to be detected matched with the reference audio feature of the ith frame exists in the audio features to be detected from the ith-M frame to the ith + L frame, and before recording the ith matching result, the method further includes:

when the target audio features to be detected exist, representing that the ith frame audio features to be detected are matched with the ith frame reference audio features, wherein the ith matching result is matching;

and when the target audio features to be detected do not exist, representing that the ith frame audio features to be detected are not matched with the ith frame reference audio features, wherein the ith matching result is mismatching.

8. The method according to claim 6 or 7, wherein the obtaining the similarity based on the feature matching degree and a preset similarity model comprises:

acquiring a preset weight database, wherein the preset weight database corresponds to the feature matching degree;

and inputting the feature matching degree and the preset weight database into the preset similarity model, and outputting the similarity.

9. An audio recognition device, comprising:

the memory to store executable audio recognition instructions;

the processor, when executing executable audio recognition instructions stored in the memory, implementing the method of any of claims 1 to 8.

10. A computer-readable storage medium having stored thereon executable audio recognition instructions for causing a processor to perform the method of any one of claims 1 to 8 when executed.