CN114155884A - Audio highlight determination method and device, electronic equipment and storage medium - Google Patents

Audio highlight determination method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114155884A
CN114155884A CN202111418361.4A CN202111418361A CN114155884A CN 114155884 A CN114155884 A CN 114155884A CN 202111418361 A CN202111418361 A CN 202111418361A CN 114155884 A CN114155884 A CN 114155884A
Authority
CN
China
Prior art keywords
audio data
audio
original audio
original
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111418361.4A
Other languages
Chinese (zh)
Inventor
林佳月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Original Assignee
Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu iQIYI Intelligent Innovation Technology Co Ltd filed Critical Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority to CN202111418361.4A priority Critical patent/CN114155884A/en
Publication of CN114155884A publication Critical patent/CN114155884A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for determining audio highlights, electronic equipment and a storage medium. The method comprises the steps of obtaining original audio data; slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit; carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features; inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data; and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment. The scheme provided by the invention can be adopted to combine the requirements of the service scene to finish the automatic detection and identification of the audio wonderful point.

Description

Audio highlight determination method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of audio detection, in particular to a method and a device for determining audio highlights, electronic equipment and a storage medium.
Background
Audio markers are an important direction in audio pattern recognition, and their basic task is to analyze audio signals and recognize audio type meanings, such as: voice, singing voice, music voice, etc. The existing audio marking technology mainly aims at a signal processing mode, extracts characteristics such as loudness and frequency of audio, and analyzes audio types according to rules. Or training the model by using the data set, and using the convolutional neural network in the visual depth model to execute the audio marking task to improve the audio marking effect.
However, in the prior art, there is no research in academic circles on the detection of wonderful sounds (such as laughter, applause, and cheering).
Disclosure of Invention
In order to solve the technical problem that the audio highlight cannot be automatically detected and identified in the prior art, embodiments of the present invention provide an audio highlight determining method, an audio highlight determining device, an electronic device, and a storage medium.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a method for determining audio highlights, which comprises the following steps:
acquiring original audio data;
slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit;
carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features;
inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data;
and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment.
In the foregoing scheme, the slicing the original audio data to obtain sliced audio includes:
acquiring a set sliding window step length and a window size;
segmenting the original audio data into a plurality of one-dimensional arrays according to the sliding window step length and the window size;
acquiring the number of the plurality of one-dimensional arrays and the second number of the original audio data;
copying a preset number of first arrays and/or second arrays from beginning to end of the one-dimensional arrays according to the number of the one-dimensional arrays and the number of seconds of the original audio data, so that the number of the one-dimensional arrays is the same as the number of seconds of the original audio data; wherein the first array is a first one-dimensional array obtained when the original audio data is segmented; the second array is the last one-dimensional array obtained when the original audio data are segmented;
and taking the plurality of one-dimensional arrays copied from beginning to end as the slice audio of the original audio data.
In the foregoing solution, the segmenting the original audio data into a plurality of one-dimensional arrays according to the sliding window step size and the window size includes:
sliding the original audio data according to the sliding window step length, and segmenting the slid original audio data according to the window size once sliding to obtain a one-dimensional array;
and sliding the original audio data for multiple times according to the sliding window step length to obtain a plurality of one-dimensional arrays.
In the foregoing scheme, the performing feature extraction on the slice audio by using a mel-frequency feature extraction algorithm to obtain mel-frequency spectrum features includes:
pre-emphasis processing is carried out on the slice audio to obtain first audio data;
performing framing and windowing operation on the first audio data to obtain second audio data;
performing discrete Fourier transform on the second audio data to obtain frequency domain data;
and identifying the frequency domain data by using a Mel filter to obtain Mel frequency spectrum characteristics.
In the foregoing scheme, the inputting the original audio data and the mel-frequency spectrum feature into a preset depth model for audio recognition, and determining an audio tag corresponding to each second of audio data in the original audio data includes:
performing feature extraction on the original audio data by using a shallow feature extraction module in the depth model to obtain a first feature;
performing data enhancement on the Mel frequency spectrum characteristic by using the depth model to obtain a second characteristic;
and performing feature processing on the first feature and the second feature in the depth model, and determining an audio label corresponding to each second of audio data in the original audio data.
In the foregoing solution, the performing feature processing on the first feature and the second feature in the depth model, and determining an audio tag corresponding to each second of audio data in the original audio data includes:
respectively connecting the first feature and the second feature to obtain a three-dimensional feature map;
performing feature extraction on the three-dimensional feature map on a time domain dimension by using a one-dimensional convolutional layer to obtain three-dimensional time domain features on the time domain dimension;
performing feature extraction on the three-dimensional time domain features on time domain and space dimensions by using the two-dimensional convolution layer to obtain the three-dimensional time domain and space features on the time domain and space dimensions;
performing characteristic redundancy reduction and characteristic size unification processing on the three-dimensional time domain and space characteristics to obtain unified size characteristics;
and carrying out space mapping and label discrimination on the uniform size characteristics to obtain an audio label corresponding to each second of audio data in the original audio data.
In the foregoing solution, the performing spatial mapping and label discrimination on the uniform size features to obtain an audio label corresponding to each second of audio data in original audio data includes:
performing space mapping on the uniform size features to obtain label numerical values corresponding to the uniform size features;
acquiring a set recall threshold;
comparing the label numerical value with the recall threshold value, and judging a label binary numerical value corresponding to the label numerical value;
and judging whether the audio data per second corresponding to the label numerical value has an audio label corresponding to the label numerical value according to the label numerical value.
In the foregoing solution, the performing highlight generation processing on the audio tag corresponding to each second of audio data in the original audio data to obtain an audio highlight segment includes:
obtaining a smooth step length and a smooth window size;
sliding the original audio data according to the smooth step length, and performing numerical value unification processing on the binarized numerical value of the original audio data with the size of the smooth window once every sliding to obtain the binarized numerical value with the unified numerical value;
sliding the original audio data for multiple times according to the smoothing step length to obtain a unified binary numerical value of the audio data per second in the original audio data;
and determining the audio frequency segment corresponding to the binarization numerical value of the continuous preset numerical value as an audio wonderful point segment according to the continuity of the binarization numerical value after the numerical value of the audio frequency data per second in the original audio frequency data is unified.
The embodiment of the invention also provides an audio highlight determining device, which comprises:
the acquisition module is used for acquiring original audio data;
the slicing module is used for slicing the original audio data to obtain sliced audio, and the sliced audio takes seconds as a unit;
the characteristic extraction module is used for carrying out characteristic extraction on the slice audio by utilizing a Mel characteristic extraction algorithm to obtain Mel spectral characteristics;
the audio identification module is used for inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data;
and the highlight generation module is used for performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment.
An embodiment of the present invention further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor is adapted to perform the steps of any of the methods described above when running the computer program.
The embodiment of the invention also provides a storage medium, wherein a computer program is stored in the storage medium, and when the computer program is executed by a processor, the steps of any one of the methods are realized.
The method, the device, the electronic equipment and the storage medium for determining the audio highlight provided by the embodiment of the invention are used for acquiring original audio data; slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit; carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features; inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data; and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment. The scheme provided by the invention can be adopted to combine the requirements of the service scene to finish the automatic detection and identification of the audio wonderful point.
Drawings
FIG. 1 is a flowchart illustrating a method for determining highlight of audio frequency according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the system operation steps according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a Mel frequency spectrum extraction process according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a preprocessing module according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a depth model according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of a second-level tag identification module according to an embodiment of the present invention;
FIG. 7 is a flow chart of a post-processing module according to an embodiment of the present invention;
FIG. 8 is a system framework diagram of an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of an audio highlight determining apparatus according to an embodiment of the present invention;
fig. 10 is an internal structural view of a computer device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
An embodiment of the present invention provides an audio highlight determining method, as shown in fig. 1, the method includes:
step 101: acquiring original audio data;
step 102: slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit;
step 103: carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features;
step 104: inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data;
step 105: and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment.
Specifically, when the original audio data is acquired, the audio data may be directly acquired, or the audio and video data may be acquired first, and then the audio data in the audio and video data is separated by using software to acquire the original audio data. Here, the original audio data may be separated from the audio-video data using ffmpeg open source software.
In practical application, the original audio data can be encoded to a specific format for storage, and then the original audio data in the specific format is subjected to audio recognition, so that highlights in the original audio data can be recognized. Here, the highlight in the present application refers to an audio point location having a viewpoint such as laughing, applause, cheering, and the like. In practical application, the original audio data can be encoded to the wav format for storage, and then the audio identification is performed on the original audio data in the wav format, so as to identify the wonderful point in the original audio data. During specific storage, the original audio data can be stored in a one-dimensional array mode by using the open source audio processing library and stored in the memory.
After the original audio data is acquired, slicing processing may be performed on the original audio data to obtain sliced audio, where the sliced audio is in units of seconds.
Specifically, in order to realize that analysis of the short-time audio has high reliability and analysis processing of the second-level audio is realized, original audio data can be sliced in units of seconds.
Further, in an embodiment, the slicing the original audio data to obtain second-level audio data includes:
acquiring a set sliding window step length and a window size;
segmenting the original audio data into a plurality of one-dimensional arrays according to the sliding window step length and the window size;
acquiring the number of the plurality of one-dimensional arrays and the second number of the original audio data;
copying a preset number of first arrays and/or second arrays from beginning to end of the one-dimensional arrays according to the number of the one-dimensional arrays and the number of seconds of the original audio data, so that the number of the one-dimensional arrays is the same as the number of seconds of the original audio data; wherein the first array is a first one-dimensional array obtained when the original audio data is segmented; the second array is the last one-dimensional array obtained when the original audio data are segmented;
and taking the plurality of one-dimensional arrays copied from beginning to end as the slice audio of the original audio data.
In practical application, slicing processing can be performed based on a service scene. Specifically, the step length of the sliding window may be set based on the detection granularity requirement of the service scene; setting the size of a window based on the characteristics of a depth model, detection confidence coefficient and/or sample length in a service scene; and segmenting the original audio data into one-dimensional audio data by using the set sliding window step length and the window size to obtain a plurality of one-dimensional arrays.
According to the sliding window step length and the window size, when the original audio data are segmented, the original audio data can be slid according to the sliding window step length, and the original audio data after sliding are segmented according to the window size every time of sliding, so that a one-dimensional array is obtained; and sliding the original audio data for multiple times according to the sliding window step length to obtain a plurality of one-dimensional arrays. For example, the sliding window step size may be set to 1 second based on the traffic scenario detection granularity requirement; and setting the window size to be 2 seconds based on the characteristics of a depth model, detection confidence coefficient and/or sample length in a service scene, sliding the original audio data at the speed of sliding the original audio data for 1 second each time when the original audio is 10 seconds, slicing the audio data for 2 seconds from the slid original audio data each time, wherein the audio data for 2 seconds each time is a one-dimensional array. And slicing 10 seconds of audio data according to the sliding window step length of 1 second and the window size of 2 seconds to obtain 9 one-dimensional arrays by slicing.
Because this application is with the second to carry out audio frequency identification to original audio data as the unit, for guaranteeing that every second audio data can both correspond there is an identification result in the original audio data, avoid because the number that the section handled and lead to the one-dimensional array reduces and make the audio data of part second number do not have the condition of the identification result that corresponds, this application can be a plurality of the first array of the quantity of presetting and/or second array are duplicated to the head and the tail of one-dimensional array, so that the number of one-dimensional array with the second number of original audio data is the same, thereby realizes that every second audio data can both correspond there is a corresponding identification result. For example, when the step size of the sliding window is 1 second and the window size is 2 seconds, slicing processing is performed on 10 seconds of original audio data, and after 9 one-dimensional arrays are obtained, a first one-dimensional array may be copied at the head of the 9 one-dimensional arrays, or a last one-dimensional array may be copied at the tail of the 9 one-dimensional arrays, so that the number of the one-dimensional arrays is the same as the number of seconds of the original audio.
Here, when the sliding window step size and the window size are selected, an appropriate sliding window step size and window size may be selected based on an application scenario. For example, for a second-level label scene, when a sliding window step length and a window size are selected, the sliding window step length can be set to be smaller than the window size, and here, one-dimensional arrays after slicing processing are mutually overlapped, so that the second-level label scene is adapted (one second-level label needs to be output for each second video in the second-level label scene, and the one-dimensional arrays are mutually overlapped, so that the identification stability can be improved); for a segment-level label scene, when a sliding window step size and a window size are selected, the sliding window step size can be set to be the same as the window size, and the one-dimensional arrays after slicing are not overlapped, so that the segment-level label scene is adapted (the segment-level label scene only needs to determine whether a highlight point exists in the segment, and therefore, the one-dimensional arrays are not overlapped with each other for the segment-level label scene).
In addition, for a second-level tag scene, it should be noted that when the one-dimensional array after slicing is supplemented end to end, it is required to ensure that each slice audio corresponding to each second of audio data includes the audio data corresponding to the second in the original audio data, so as to ensure that the audio tag of each second of audio data obtained based on each slice audio is accurate.
After the slice audio is obtained, feature extraction can be performed on the obtained slice audio to obtain mel frequency spectrum features.
Further, in an embodiment, the performing feature extraction on the slice audio by using a mel feature extraction algorithm to obtain mel spectrum features includes:
pre-emphasis processing is carried out on the slice audio to obtain first audio data;
performing framing and windowing operation on the first audio data to obtain second audio data;
performing discrete Fourier transform on the second audio data to obtain frequency domain data;
and identifying the frequency domain data by using a Mel filter to obtain Mel frequency spectrum characteristics.
Specifically, a mel-frequency feature extraction algorithm, also called mel-frequency spectrum extraction technology, is based on that the sound frequency perceived by human ears and the actual frequency of sound are not linear, a group of filters are constructed to filter an input signal, the signal energy output by each filter is used as the basic feature of the signal, and the feature is further processed to be used as the input feature of voice. The Mel feature extraction algorithm does not depend on the properties of the signals, does not make any assumption and limitation on the input signals, and utilizes the research results of the auditory model. Therefore, the method has good robustness, accords with the hearing characteristics of human ears, and still has good recognition performance when the signal-to-noise ratio is reduced.
By utilizing a Mel feature extraction algorithm, the slice audio can be converted into a two-dimensional feature matrix with time domain and frequency domain information.
Specifically, based on the mel feature extraction algorithm, the sliced audio may be pre-emphasized first, and when the sliced audio is pre-emphasized, a high-pass filter may be used to enhance a high-frequency portion of the sliced audio. In addition, it is necessary to ensure that the slice audio uses the same signal-to-noise ratio to obtain the spectrum in the whole frequency band from low frequency to high frequency. Here, when actually enhancing, enhancing the high frequency part in the slice audio can be realized by increasing the pre-emphasis coefficient.
In the present application, the frequency of highlights due to laughing points, applause, cheering, etc. is generally higher than that of other sounds. Therefore, in the present embodiment, the detection of highlights by the present embodiment can be enhanced by enhancing the high frequency part of the sliced audio. For example, in a laugh-like variety program, in order to identify points such as laughing points and palms in audio data, since the frequencies of the points such as laughing points and palms are generally higher than those of other points, in order to achieve a better identification effect of the points such as laughing points and palms, after the audio data is sliced, the 2000-plus-3000 Hz frequency band in the sliced audio is enhanced, and the pre-emphasis coefficient is set to 0.97, so as to enhance the high-frequency part in the sliced audio.
After the slice audio is subjected to pre-emphasis processing, the pre-emphasized first audio data can be subjected to framing operation, and framing in a time domain is performed. That is, the first audio data is divided into a plurality of one-dimensional first arrays according to a certain fixed time length. Here, each one-dimensional first array of the division may be referred to as a frame. Here, it should be noted that the framing operation in the present embodiment does not refer to the slicing operation in the above-described embodiment. In this embodiment, the original audio data may be sliced to obtain a sliced audio, and then the sliced audio may be framed by using a framing operation in the mel feature extraction algorithm. After the framing operation is carried out, windowing processing can be carried out on the plurality of one-dimensional first arrays after framing, so that the continuity of the left end and the right end of the frame is improved, and the frequency spectrum leakage is reduced. In practice, the windowing operation may be performed using a conventional window function, such as a Hamming window function.
After the windowing operation is performed, discrete fourier transform can be performed on the windowed second audio data, and a signal which is originally represented by a time domain is converted into a representation of energy in a frequency domain, wherein different energy distributions can represent different voice characteristics. Here, the data after the discrete fourier transform may be defined as frequency domain data.
After the frequency domain data are obtained, the frequency domain data can be identified by utilizing a Mel filter to obtain Mel frequency spectrum characteristics. Specifically, the frequency domain data can be transmitted into a Mel filter, linear frequency spectrums are mapped into Mel nonlinear frequency spectrums based on auditory perception, auditory perception is simulated, and two-dimensional Mel frequency spectrum characteristics are obtained. Since the mel-frequency spectrum feature recognition is performed on the slice audio with the unit of second, the recognized mel-frequency spectrum feature can also be called the second-level mel-frequency spectrum feature.
Further, in an embodiment, the inputting the original audio data and the mel-frequency spectrum feature into a preset depth model for audio recognition, and the determining an audio tag corresponding to each second of audio data in the original audio data includes:
performing feature extraction on the original audio data by using a shallow feature extraction module in the depth model to obtain a first feature;
performing data enhancement on the Mel frequency spectrum characteristic by using the depth model to obtain a second characteristic;
and performing feature processing on the first feature and the second feature in the depth model, and determining an audio label corresponding to each second of audio data in the original audio data.
Here, the depth model may be a convolutional neural network model. The depth model can dynamically adjust the convolution kernel weight according to the task target and the penalty function, performs feature extraction and classification mapping on the samples, generalizes the samples to data samples in a service scene, and finishes the label judgment of the samples in the service scene.
Specifically, after the depth model receives the original audio data and the mel-frequency spectrum features, the original audio data are sent into a shallow feature extraction model formed by a convolutional layer and a pooling layer, and some audio features neglected by the mel-frequency spectrum, namely first features, are obtained in a parameter learning mode. This feature may retain more information for the input of a deep feature map for a subsequent depth model. Meanwhile, the depth model can also perform data enhancement on the Mel frequency spectrum characteristics to obtain second characteristics, and perform data enhancement on the Mel frequency spectrum characteristics to enhance the generalization of the network and prevent overfitting. Here, data enhancement means such as Mixup, SpecAugment, and the like may be used.
In this embodiment, the time point location of the viewpoint audio can be inferred by performing second-level audio label inference operation on the original audio data and the mel-frequency spectrum features by using the depth model. That is, the present embodiment can learn whether there is an audio viewpoint-related sound tag currently from the original audio data and mel-frequency spectrum features by using the depth model.
Further, in an embodiment, the performing feature processing on the first feature and the second feature in the depth model, and determining an audio tag corresponding to each second of audio data in the original audio data includes:
respectively connecting the first feature and the second feature to obtain a three-dimensional feature map;
performing feature extraction on the three-dimensional feature map on a time domain dimension by using a one-dimensional convolutional layer to obtain three-dimensional time domain features on the time domain dimension;
performing feature extraction on the three-dimensional time domain features on time domain and space dimensions by using the two-dimensional convolution layer to obtain the three-dimensional time domain and space features on the time domain and space dimensions;
performing characteristic redundancy reduction and characteristic size unification processing on the three-dimensional time domain and space characteristics to obtain unified size characteristics;
and carrying out space mapping and label discrimination on the uniform size characteristics to obtain an audio label corresponding to each second of audio data in the original audio data.
Specifically, when the first feature and the second feature are connected respectively, the first feature and the second feature may be connected respectively through a connection function, so as to obtain a three-dimensional feature map. The connection of the first feature and the second feature may be understood as stacking two-dimensional feature maps, and the stacked two-dimensional feature maps are three-dimensional feature maps. Here, the join function may be a concat () function.
After the three-dimensional feature map is obtained, the three-dimensional feature map may be passed into a one-dimensional convolution layer in the depth model. And extracting different characteristics of the three-dimensional characteristic graph on a time domain dimension by using the layer to obtain the three-dimensional time domain characteristic. And then, the three-dimensional time domain characteristics can be sent into a two-dimensional convolution layer in the depth model, so that the receptive field is increased, and the characteristics of the fusion time domain and the fusion space are generated, namely the three-dimensional time domain and space characteristics on the time domain and the space dimensions. After the three-dimensional time domain and space characteristics are obtained, the three-dimensional time domain and space characteristics can be sent to a global pooling layer in the depth model, and redundancy of the characteristics is reduced; meanwhile, different sizes in the three-dimensional time domain and space characteristics are processed in a unified mode through characteristic sizes, and the unified size characteristics with fixed sizes are obtained. After the uniform size features are obtained, the uniform size features can be sent to a full connection layer in a depth model, and the learned distributed feature representation is mapped to a sample mark space; and outputs the result. After obtaining the output, the output can be fed into a Dropout layer in the depth model, where the Dropout layer will discard some of the feature map content at random to prevent over-fitting of the model. Finally, the output result of the Dropout layer can be predicted by using the same number of discriminators as the tags, and whether the sample has the audio content of a certain tag or not can be predicted. Here, sigmoid discriminators may be used for prediction.
Further, in an embodiment, the inputting the original audio data and the mel-frequency spectrum feature into a preset depth model for audio recognition, and determining the audio tag corresponding to each second of audio data in the original audio data further includes:
acquiring training data and a real audio label corresponding to the training data;
training an initial depth model by using the training data and the real audio label to obtain a trained depth model;
and inputting the original audio data and the Mel frequency spectrum characteristics into a trained depth model for audio recognition, and determining an audio label corresponding to each second of audio data in the original audio data.
In practical application, before the depth model is used for identification and judgment, the depth model can be built firstly, and the depth model is trained, so that the depth model capable of performing highlight identification is obtained.
In this case, the building and training of the deep model can be realized through the steps of audio data organization, model building, pre-training weight loading, multi-epoch training, weight saving and the like. Specifically, training data, i.e., raw audio data and mel-frequency spectrum features for model training are obtained first. Here, the raw audio data and mel-frequency spectrum features used for model training may be organized into an HDF5 format file and corresponding real audio tags may be obtained. And then, building a depth model, and loading the open source AudioSet pre-training weight into the built depth model as an initialization weight for guiding subsequent training. After the weight is loaded into the depth model, the HDF5 format file can be input into the depth model for multi-round training, and after the training is finished, the trained network weight can be stored in a disk for calling of a subsequent reasoning process.
And after the trained network weight is obtained, the trained network weight can be used for identifying and judging the audio wonderful point. The identification and judgment of the audio highlights can be realized through the steps of weight loading, pre-processing data reading, forward reasoning and the like. Specifically, the trained network weight is loaded into the depth model, original audio data to be detected are obtained, the original audio data to be detected are input into the loaded depth model, and forward reasoning is performed, so that a two-dimensional second-level label is obtained.
Further, in an embodiment, the performing spatial mapping and label discrimination on the uniform size feature to obtain an audio label corresponding to each second of audio data in the original audio data includes:
performing space mapping on the uniform size features to obtain label numerical values corresponding to the uniform size features;
acquiring a set recall threshold;
comparing the label numerical value with the recall threshold value, and judging a label binary numerical value corresponding to the label numerical value;
and judging whether the audio data per second corresponding to the label numerical value has an audio label corresponding to the label numerical value according to the label numerical value.
Specifically, after the uniform-size feature is subjected to spatial mapping, a label numerical value corresponding to the feature can be obtained. That is, with the depth model, 1 second of audio data can obtain tag values for multiple tag classes. For the label value of a certain label category in a plurality of label categories, a recall threshold value can be utilized to judge whether the second audio data has the category label. When the tag value of the class tag category is greater than or equal to the recall threshold, it may be determined that the second audio data has the class tag, and when the tag value of the class tag category is less than the recall threshold, it may be determined that the second audio data does not have the class tag. Here, the recall threshold may be set based on a traffic scenario. And flexibly adjusting parameters according to different service requirements, so that the calculation result of the model is adapted to different application scenes. In practical application, the label binary numerical value can be used for representing whether the second audio data has the label. For example, the label binarization value can be set to 1 to represent that the second audio data has the label; setting the label binary value to 0 indicates that the second audio data does not have the label.
Further, in an embodiment, the performing highlight generation processing on the audio tag corresponding to each second of audio data in the original audio data to obtain an audio highlight segment includes:
obtaining a smooth step length and a smooth window size;
sliding the original audio data according to the smooth step length, and performing numerical value unification processing on the binarized numerical value of the original audio data with the size of the smooth window once every sliding to obtain the binarized numerical value with the unified numerical value;
sliding the original audio data for multiple times according to the smoothing step length to obtain a unified binary numerical value of the audio data per second in the original audio data;
and determining the audio frequency segment corresponding to the binarization numerical value of the continuous preset numerical value as an audio wonderful point segment according to the continuity of the binarization numerical value after the numerical value of the audio frequency data per second in the original audio frequency data is unified.
Specifically, the audio tags of the audio data per second may be smoothed, so that a viewpoint segment with time point locations is generated according to the smoothed result. In practical application, the size of a smooth window and the smooth step length can be formulated based on the continuity of a service scene and sound, the determined binary numerical values are connected into smooth segment binary numerical values, and the audio highlight segment is determined according to the continuity of the smoothed binary numerical values. For example, when the binarized value of each second of audio data in 9 seconds of original audio data is 111011000, the binarized value may be smoothed into 111111000 using a smoothing window size and a smoothing step size, and then the audio segment of the 1 st to 6 th seconds in the original audio data is an audio highlight segment.
The audio highlight determining method provided by the embodiment of the invention obtains original audio data; slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit; carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features; inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data; and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment. The scheme provided by the invention can be adopted to combine the requirements of the service scene to finish the automatic detection and identification of the audio wonderful point.
The present invention will be described in further detail with reference to the following application examples.
The embodiment can be applied to the scene of audio existence, processes and analyzes the audio signal, analyzes the time point positions of laughter, applause and cheering in the audio, and performs the superposition post-processing, thereby completing the automatic audio/video wonderful point position detection, such as: positioning of comprehensive gold sentences, marking of movie and television watching points and the like. The embodiment can analyze and locate the sound which may be highlight points in the audio by combining the requirement of searching the viewpoint by the service under the condition that the sample is difficult to collect.
The embodiment mainly utilizes the traditional voice feature extraction technology to represent the audio digital signals into digital features; the digital characteristics are transmitted into a depth network model, namely a supervision training model, characteristic representation which can be used for distinguishing the audio frequency (laughter/applause/cheering) of the watching point is extracted, and the extracted characteristic representation is sent into a sigmoid discriminator (which can be understood as a label discriminator) to realize probability prediction of the audio frequency with or without the watching point every second; and then smoothing the second-level label into a fragment-level label by using a post-processing strategy suitable for the service scene, and outputting the fragment-level label as a service for subsequent label use.
Before describing the specific scheme of the present embodiment, several algorithm principles used in the present embodiment are described:
the first method comprises the following steps: conventional mel-frequency spectrum extraction techniques.
The mel-frequency spectrum extraction technique is based on that the frequency of sound perceived by human ears and the actual frequency of sound are not linear, and an input signal is filtered by constructing a group of filters. The signal energy output by each filter is used as the basic characteristic of the signal, and the characteristic can be used as the input characteristic of the voice after being further processed. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the feature has good robustness, better conforms to the hearing characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.
And the second method comprises the following steps: and the depth model is mainly a convolutional neural network.
The principle of the method is mainly that a convolution kernel weight value is dynamically adjusted according to a task target and a penalty function, characteristic extraction and classification mapping are carried out on samples, the samples are generalized to data samples in a service scene, and label judgment of the samples in the service scene is completed.
Specifically, the scheme proposed by the embodiment is mainly divided into three modules: the system comprises a data audio signal preprocessing module, a second-level label identification module and a post-processing module based on a service scene. That is, the specific operation steps of the present embodiment can be as shown in fig. 2. Namely, it is
S1: the audio is loaded and processed into mel-frequency spectrum characteristics using a data audio signal preprocessing module.
S2: and (4) sending the original signal array and the Mel frequency spectrum characteristics generated in the S1 into a depth model, performing second-level audio label reasoning operation, and reasoning out the time positioning with the viewpoint audio.
S3: and (5) connecting the second-level labels output by the S2 into a viewpoint segment with a time point through a post-processing module.
The above steps will be explained in further detail below.
Data audio signal preprocessing module
In this embodiment, the preprocessing of the digital signal mainly uses a conventional mel-frequency feature extraction algorithm to convert the digital signal into a two-dimensional feature matrix with time domain and frequency domain information.
Specifically, referring to fig. 3, the operation process of the mel-frequency spectrum extraction algorithm includes the following steps:
the Mel characteristic extraction algorithm firstly performs pre-emphasis operation on loaded voice data, the voice signal passes through a high-pass filter, the high-frequency part in the voice signal is enhanced, the voice signal is kept in the whole frequency band from low frequency to high frequency, and the same signal-to-noise ratio can be used for solving the frequency spectrum. In this embodiment, since the viewpoint audio is detected according to the service scene concerned, the pre-emphasis coefficient can be set to be large to highlight the high-frequency audio information. Then, using the one-dimensional audio data to perform framing operation on a time domain, and dividing a dimension into a plurality of one-dimensional arrays according to a certain fixed time length, wherein each one-dimensional array is called a frame; then, each frame is windowed, and the window function used in this embodiment may be a conventional Hamming window to increase the continuity of the left and right ends of the frame and reduce the spectral leakage. After windowing, a frame of audio is converted into a signal characterized in the time domain by using discrete fourier transform, and the signal is characterized in the frequency domain by energy, and different energy distributions can represent different speech features. And then, the frequency domain data is transmitted into a Mel filter, the linear frequency spectrum is mapped into a Mel nonlinear frequency spectrum based on auditory perception, the auditory perception is simulated, and finally, two-dimensional Mel frequency spectrum characteristics are obtained.
Based on the above operation process of the mel feature extraction algorithm, the specific working steps of the digital signal preprocessing module in this embodiment can be as shown in fig. 4. The method comprises the following steps:
s11: the target audio is separated from the audiovisual samples using ffmpeg (which can be understood as an audio separation technique) and re-encoded into wav format; loading the audio signal into a memory by using an open source audio processing library librosa, and storing the audio signal in a one-dimensional array mode;
s12: carrying out corresponding slice preprocessing on the original audio data output by the S11 based on the service scene;
s13: the slice audio output in S12 is extracted with mel-frequency spectrum features, and understood as mel features of the current number of seconds.
Here, the detailed process of S12 can be largely divided into the following two steps:
s121: setting a sliding window step length based on the granularity detection requirement of a service scene; setting the size of a window based on the characteristics of the detection confidence coefficient and the sample length under the scene of a subsequent depth model; segmenting original one-dimensional audio data into a plurality of one-dimensional arrays by using the step length and the window size;
s122: and (3) supplementing the number of the one-dimensional single-chip audio arrays output by the S121 strategy in a copying and expanding mode, so that the number of the one-dimensional single-chip audio arrays is the same as the number of seconds of the original audio, and representing the calculation result of the current number of seconds according to the subsequent calculation result of the current audio array.
Here, the sliding window step size and window size may be selected based on traffic scenario needs. For example, for highlight such as applause, since applause is similar to rain, in order to accurately recognize applause and avoid interference of sounds such as rain, the window size may be set to a large value, for example, 3 seconds. Through setting up the great mode with the window size, make the section audio frequency after the section longer to when discerning, can improve the degree of accuracy of discerning the applause, avoid the interference of rain sound. For another example, for highlight such as laughing, the laughing is easily recognized because it is greatly different from other sounds, and here, the window size can be selected to be 1 second to realize highlight detection of laughing.
In addition, because the method and the device identify the second-level label scene, the step length of the sliding window can be set to be 1 second so as to ensure a better identification result.
Specifically, the original audio data may be slid according to the sliding window step length, and each time the original audio data is slid, the slid original audio data is segmented according to the window size to obtain a one-dimensional array; and sliding the original audio data for multiple times according to the sliding window step length to obtain a plurality of one-dimensional arrays. For example, the sliding window step size may be set to 1 second based on the traffic scenario detection granularity requirement; and setting the window size to be 2 seconds based on the characteristics of a depth model, detection confidence coefficient and/or sample length in a service scene, sliding the original audio data at the speed of sliding the original audio data for 1 second each time when the original audio is 10 seconds, slicing the audio data for 2 seconds from the slid original audio data each time, wherein the audio data for 2 seconds each time is a one-dimensional array. And slicing 10 seconds of audio data according to the sliding window step length of 1 second and the window size of 2 seconds to obtain 9 one-dimensional arrays by slicing.
After slicing, using a copy expansion mode to make the number of the one-dimensional single-chip audio arrays output by the S121 strategy be the same as the number of the original audio seconds through head and tail supplement. For example, when the step size of the sliding window is 1 second and the window size is 2 seconds, slicing processing is performed on 10 seconds of original audio data, and after 9 one-dimensional arrays are obtained, a first one-dimensional array may be copied at the head of the 9 one-dimensional arrays, or a last one-dimensional array may be copied at the tail of the 9 one-dimensional arrays, so that the number of the one-dimensional arrays is the same as the number of seconds of the original audio.
The above is the related content of the data audio signal preprocessing module in this embodiment.
(II) second-level label identification module
In the module, a depth model (audioCNN) is mainly built, and a supervised training model is provided, so that the model can learn whether audio viewpoint related sound labels exist currently from original audio features (Mel frequency spectrum features).
Here, the specific network model, as shown in fig. 5, may include the following processes:
as shown in fig. 5, the network model receives the original audio data and the mel spectrum feature generated by the digital signal preprocessing module as the input of the AudioCNN model. The original audio data is sent into a shallow feature extraction model formed by a convolutional layer and a pooling layer, and the purpose is to acquire some audio features neglected by Mel frequency spectrum in a parameter learning mode and reserve more information for the input of subsequent deep feature mapping; and the other Mel characteristic input branch can perform effective data enhancement after acquiring the Mel spectral characteristics, and comprises the following steps: mixup data enhancement and SpecAugent data enhancement, in order to enhance the generalization of the network to prevent overfitting. After the processing is finished, the features output by different input channels are connected into a three-dimensional feature map in a concat mode, then the three-dimensional feature map is transmitted into a one-dimensional convolution layer, and the features are respectively extracted on different time domains; secondly, using two-dimensional convolution to increase the receptive field and generate the characteristics of a fusion time domain and a fusion space; then, the data are transmitted into a global pooling layer, the redundancy of the features is reduced, and simultaneously, feature maps with different sizes generated by inputting with different lengths are fixed into a specific dimension; the uniform dimension characteristics produced by the pooling layer are sent to a full-connection layer, and the learned distributed characteristic representation is mapped to a sample mark space; the output of each full connection layer is sent to a Dropout layer, and the content of a feature map is discarded randomly, so that overfitting of the model can be prevented; and finally, respectively predicting whether the sample has the audio content of a certain label by using the sigmoid activators with the same number as the labels.
Based on the depth model, the main process of this module may include:
as shown in fig. 6: the second-level tag identification module can be mainly divided into the following two processes: a training process and a using process. Wherein the content of the first and second substances,
(1) and (5) training the process. The method mainly comprises the following steps: audio data organization, model building, pre-training weight loading, multi-epoch training and weight saving. The method specifically comprises the following steps:
s21: after the training data is processed by the data audio signal preprocessing module, the output original audio data and the Mel frequency spectrum characteristics are organized into an HDF5 format file, and the data and the corresponding real audio label are recorded.
S22: and (5) building the depth model.
S23: and loading the open source AudioSet pre-training weight into the depth model built in the step S22, and taking the depth model as an initialization weight to guide subsequent training.
S24: the data prepared in step S21 is input to the depth model initialized in step S23, and a plurality of rounds of training are performed.
S25: and after the step S24 is finished, storing the network weight value into a disk for calling a subsequent reasoning process.
(2) The procedure was used. The method mainly comprises the following steps: loading weight, reading in preprocessed data and carrying out forward reasoning. The method specifically comprises the following steps:
s26: the weight values stored in step S25 are loaded into the model.
S27: and after being processed by the data audio signal preprocessing module, the audio data to be detected is read.
S28: the data read in step S27 is sent to the model prepared in step S26, and forward inference is performed.
S29: and (5) outputting the two-dimensional second-level label for a subsequent post-processing module to use.
The above is the related content of the second-level tag identification module in this embodiment.
(III) post-processing module
The post-processing module in the embodiment is designed mainly according to the requirements of service scenes. And according to different service requirements, parameters can be flexibly adjusted, so that the calculation result of the model is adapted to different application scenes.
Specifically, referring to fig. 7, the overall process design of the post-processing module in the embodiment is mainly divided into the following two steps: the first is the recall mechanism design of the second level tags, followed by the sliding window design of the concatenation of the second level tags into segment level tags. The two steps are as follows:
(1) marking out the content of the point label on the original label according to the service scene, then determining the recall threshold value of the point label through a service guide sample, and generating a binary second-level label.
(2) Analyzing a service scene and the continuity of sound, formulating the size of a smooth window, connecting the binary second-level labels into smooth segment-level labels, and finally generating an audio viewpoint guidance segment.
Specifically, the depth model outputs tag values of a plurality of corresponding tag categories for each second of audio data in the original audio data. And judging the label value by using the determined recall threshold value, and determining whether the second audio data has the label, wherein a binary value is used for representing the judgment result. For example, when the binary value is 1, it is characterized that the second audio data has the class label; when the binary value is 0, the second audio data is characterized to have no class label.
Setting a smooth step length and a smooth window size based on a service scene, sliding the original audio data according to the smooth step length, and performing numerical value unification processing on a binarized numerical value of the original audio data with the smooth window size every time the original audio data is slid, so as to obtain a binarized numerical value with the unified numerical value; sliding the original audio data for multiple times according to the smoothing step length to obtain a unified binary numerical value of the audio data per second in the original audio data; and determining the audio frequency segment corresponding to the binarization numerical value of the continuous preset numerical value as an audio wonderful point segment according to the continuity of the binarization numerical value after the numerical value of the audio frequency data per second in the original audio frequency data is unified.
For example, when the binarized value of each second of audio data in 9 seconds of original audio data is 111011000, the binarized value may be smoothed into 111111000 using a smoothing window size and a smoothing step size, and then the audio segment of the 1 st to 6 th seconds in the original audio data is an audio highlight segment.
By designing a recall mechanism and smoothing of the tags, audio highlight segments can be automatically generated in the original audio data.
The above is the relevant content of the post-processing module in this embodiment.
In summary, after the related contents of the three modules of the present embodiment are described in detail, the following describes in detail the complete process of the present embodiment based on the three modules, which may include:
specifically, referring to fig. 8, as described above, the system of the present embodiment is mainly divided into three modules: the system comprises a data audio signal preprocessing module, a second-level label identification module and a service scene post-processing module. The audio signal preprocessing module executes the processes of loading an audio signal into a memory, acquiring one-dimensional audio data, extracting a Mel frequency spectrum and obtaining two-dimensional Mel characteristics; the second-level label identification module performs the processes of data enhancement, model training, model forward reasoning and second-level label identification; and the business scene post-processing module executes the processes of binarization and smoothing of the second-level label into a fragment label. Through the three modules, the viewpoint audio clip is finally obtained. The implementation sends the original audio frequency into the system, can realize automatic detection and write out the audio frequency viewpoint guide segment for logic superposition and use of subsequent services.
In addition, in summary, the embodiment has the following advantages compared with the prior art: (1) in the audio preprocessing, the frequency spectrum of a high-frequency band is amplified according to the characteristics of the viewpoint audio, so that the characteristics of the viewpoint audio are easier to learn by a model. (2) In the depth model, a shallow original audio feature extraction layer is embedded, so that the diversity of audio features can be better learned, the audio features are not limited to feature information provided by a Mel frequency spectrum, and the generalization of a network is enhanced. (3) The post-processing is parameterized, and meanwhile, the reference parameter range can be automatically given without much manual intervention. (4) And the audio frequency tag and the watching point requirement are combined to complete the proposal of the audio frequency watching point tag.
In the embodiment, the effective training of the supervision model under the condition of insufficient data samples is compensated by combining the traditional characteristic processing with the data enhancement mode; meanwhile, aiming at the task positioning requirement, the existing model is improved, so that the model has stronger reliability in the short-time audio scene analysis; and finally, aiming at the service scene and the target sound characteristic, a preprocessing method for expanding a specific frequency band and a post-processing device for controlling the interval smoothness are provided. The embodiment is applied to the viewpoint clipping service and the pure sharing service, can realize automatic giving of viewpoint and wonderful point location information from the audio angle, provides for selection of later clipping, and reduces the labor cost.
In order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides an audio highlight determining apparatus, and as shown in fig. 9, the audio highlight determining apparatus 900 includes: the device comprises an acquisition module 901, a slicing module 902, a feature extraction module 903, an audio recognition module 904 and a highlight generation module 905; wherein the content of the first and second substances,
an obtaining module 901, configured to obtain original audio data;
a slicing module 902, configured to slice the original audio data to obtain a sliced audio, where the sliced audio is in units of seconds;
a feature extraction module 903, configured to perform feature extraction on the slice audio by using a mel feature extraction algorithm to obtain mel frequency spectrum features;
an audio identification module 904, configured to input the original audio data and the mel-frequency spectrum feature into a preset depth model for audio identification, and determine an audio tag corresponding to each second of audio data in the original audio data;
the highlight generation module 905 is configured to perform highlight generation processing on an audio tag corresponding to each second of audio data in the original audio data to obtain an audio highlight segment.
In practical applications, the obtaining module 901, the slicing module 902, the feature extracting module 903, the audio identifying module 904 and the highlight generating module 905 may be implemented by a processor in the audio highlight determining apparatus.
It should be noted that: the above-mentioned apparatus provided in the above-mentioned embodiment is only exemplified by the division of the above-mentioned program modules when executing, and in practical application, the above-mentioned processing may be distributed to be completed by different program modules according to needs, that is, the internal structure of the terminal is divided into different program modules to complete all or part of the above-mentioned processing. In addition, the apparatus provided by the above embodiment and the method embodiment belong to the same concept, and the specific implementation process thereof is described in the method embodiment and is not described herein again.
Based on the hardware implementation of the program module, in order to implement the method according to the embodiment of the present invention, an electronic device (computer device) is also provided in the embodiment of the present invention. Specifically, in one embodiment, the computer device may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer apparatus includes a processor a01, a network interface a02, a display screen a04, an input device a05, and a memory (not shown in the figure) connected through a system bus. Wherein processor a01 of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises an internal memory a03 and a non-volatile storage medium a 06. The nonvolatile storage medium a06 stores an operating system B01 and a computer program B02. The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a 06. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program is executed by the processor a01 to implement the method of any of the above embodiments. The display screen a04 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device a05 of the computer device may be a touch layer covered on the display screen, a button, a trackball or a touch pad arranged on a casing of the computer device, or an external keyboard, a touch pad or a mouse.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The device provided by the embodiment of the present invention includes a processor, a memory, and a program stored in the memory and capable of running on the processor, and when the processor executes the program, the method according to any one of the embodiments described above is implemented.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media (transmyedia) such as modulated data signals and carrier waves.
It will be appreciated that the memory of embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for audio highlight determination, the method comprising:
acquiring original audio data;
slicing the original audio data to obtain sliced audio, wherein the sliced audio takes seconds as a unit;
carrying out feature extraction on the slice audio by utilizing a Mel feature extraction algorithm to obtain Mel spectral features;
inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data;
and performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment.
2. The method of claim 1, wherein the slicing the original audio data to obtain sliced audio comprises:
acquiring a set sliding window step length and a window size;
segmenting the original audio data into a plurality of one-dimensional arrays according to the sliding window step length and the window size;
acquiring the number of the plurality of one-dimensional arrays and the second number of the original audio data;
copying a preset number of first arrays and/or second arrays from beginning to end of the one-dimensional arrays according to the number of the one-dimensional arrays and the number of seconds of the original audio data, so that the number of the one-dimensional arrays is the same as the number of seconds of the original audio data; wherein the first array is a first one-dimensional array obtained when the original audio data is segmented; the second array is the last one-dimensional array obtained when the original audio data are segmented;
and taking the plurality of one-dimensional arrays copied from beginning to end as the slice audio of the original audio data.
3. The method of claim 1, wherein the performing feature extraction on the slice audio by using a mel-frequency feature extraction algorithm to obtain mel-frequency spectral features comprises:
pre-emphasis processing is carried out on the slice audio to obtain first audio data;
performing framing and windowing operation on the first audio data to obtain second audio data;
performing discrete Fourier transform on the second audio data to obtain frequency domain data;
and identifying the frequency domain data by using a Mel filter to obtain Mel frequency spectrum characteristics.
4. The method of claim 1, wherein the inputting the original audio data and the mel-frequency spectrum features into a preset depth model for audio recognition, and the determining the audio label corresponding to each second of audio data in the original audio data comprises:
performing feature extraction on the original audio data by using a shallow feature extraction module in the depth model to obtain a first feature;
performing data enhancement on the Mel frequency spectrum characteristic by using the depth model to obtain a second characteristic;
and performing feature processing on the first feature and the second feature in the depth model, and determining an audio label corresponding to each second of audio data in the original audio data.
5. The method of claim 4, wherein the performing feature processing on the first feature and the second feature in the depth model and determining the audio tag corresponding to each second of audio data in the original audio data comprises:
respectively connecting the first feature and the second feature to obtain a three-dimensional feature map;
performing feature extraction on the three-dimensional feature map on a time domain dimension by using a one-dimensional convolutional layer to obtain three-dimensional time domain features on the time domain dimension;
performing feature extraction on the three-dimensional time domain features on time domain and space dimensions by using the two-dimensional convolution layer to obtain the three-dimensional time domain and space features on the time domain and space dimensions;
performing characteristic redundancy reduction and characteristic size unification processing on the three-dimensional time domain and space characteristics to obtain unified size characteristics;
and carrying out space mapping and label discrimination on the uniform size characteristics to obtain an audio label corresponding to each second of audio data in the original audio data.
6. The method of claim 5, wherein the performing spatial mapping and label discrimination on the uniform-size features to obtain an audio label corresponding to each second of audio data in original audio data comprises:
performing space mapping on the uniform size features to obtain label numerical values corresponding to the uniform size features;
acquiring a set recall threshold;
comparing the label numerical value with the recall threshold value, and judging a label binary numerical value corresponding to the label numerical value;
and judging whether the audio data per second corresponding to the label numerical value has an audio label corresponding to the label numerical value according to the label numerical value.
7. The method of claim 6, wherein the performing highlight generation processing on the audio tag corresponding to each second of audio data in the original audio data to obtain an audio highlight segment comprises:
obtaining a smooth step length and a smooth window size;
sliding the original audio data according to the smooth step length, and performing numerical value unification processing on the binarized numerical value of the original audio data with the size of the smooth window once every sliding to obtain the binarized numerical value with the unified numerical value;
sliding the original audio data for multiple times according to the smoothing step length to obtain a unified binary numerical value of the audio data per second in the original audio data;
and determining the audio frequency segment corresponding to the binarization numerical value of the continuous preset numerical value as an audio wonderful point segment according to the continuity of the binarization numerical value after the numerical value of the audio frequency data per second in the original audio frequency data is unified.
8. An audio highlight determination apparatus, characterized in that the audio highlight determination apparatus comprises:
the acquisition module is used for acquiring original audio data;
the slicing module is used for slicing the original audio data to obtain sliced audio, and the sliced audio takes seconds as a unit;
the characteristic extraction module is used for carrying out characteristic extraction on the slice audio by utilizing a Mel characteristic extraction algorithm to obtain Mel spectral characteristics;
the audio identification module is used for inputting the original audio data and the Mel frequency spectrum characteristics into a preset depth model for audio identification, and determining an audio label corresponding to each second of audio data in the original audio data;
and the highlight generation module is used for performing highlight generation processing on the audio label corresponding to the audio data per second in the original audio data to obtain an audio highlight segment.
9. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor is adapted to perform the steps of the method of any one of claims 1 to 7 when running the computer program.
10. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1 to 7.
CN202111418361.4A 2021-11-25 2021-11-25 Audio highlight determination method and device, electronic equipment and storage medium Pending CN114155884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111418361.4A CN114155884A (en) 2021-11-25 2021-11-25 Audio highlight determination method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111418361.4A CN114155884A (en) 2021-11-25 2021-11-25 Audio highlight determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114155884A true CN114155884A (en) 2022-03-08

Family

ID=80458084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111418361.4A Pending CN114155884A (en) 2021-11-25 2021-11-25 Audio highlight determination method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114155884A (en)

Similar Documents

Publication Publication Date Title
Chung et al. Out of time: automated lip sync in the wild
US10108709B1 (en) Systems and methods for queryable graph representations of videos
CN111477250A (en) Audio scene recognition method, and training method and device of audio scene recognition model
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
Sawhney et al. Situational awareness from environmental sounds
EP3255633B1 (en) Audio content recognition method and device
Yang et al. Multi-scale semantic feature fusion and data augmentation for acoustic scene classification
CN114422825A (en) Audio and video synchronization method, device, medium, equipment and program product
KR20210046416A (en) Audio classification method based on neural network for waveform input and analyzing apparatus
EP3701528A1 (en) Segmentation-based feature extraction for acoustic scene classification
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
US20210390949A1 (en) Systems and methods for phoneme and viseme recognition
CN113992970A (en) Video data processing method and device, electronic equipment and computer storage medium
CN114155884A (en) Audio highlight determination method and device, electronic equipment and storage medium
CN113299276B (en) Multi-person multi-language identification and translation method and device
CN109089112B (en) Multi-channel virtual sound image audio and video online detection method and device
CN111785282A (en) Voice recognition method and device and intelligent sound box
CN116758936B (en) Processing method and device of audio fingerprint feature extraction model and computer equipment
Zarazaga et al. A processing framework to access large quantities of whispered speech found in ASMR
CN113506565B (en) Speech recognition method, device, computer readable storage medium and processor
KR102564570B1 (en) System and method for analyzing multimodal emotion
Pan et al. Acoustic event detection with sequential attention and soft boundary information
CN110909613B (en) Video character recognition method and device, storage medium and electronic equipment
Ramineni et al. Detecting of Tree Cutting Sound in the Forest by Machine Learning Intelligence
US20230047378A1 (en) Processing accelerator architectures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination