CN116092521A - Feature frequency point recognition model training and audio fingerprint recognition method, device and product - Google Patents

Feature frequency point recognition model training and audio fingerprint recognition method, device and product Download PDF

Info

Publication number
CN116092521A
CN116092521A CN202211094118.6A CN202211094118A CN116092521A CN 116092521 A CN116092521 A CN 116092521A CN 202211094118 A CN202211094118 A CN 202211094118A CN 116092521 A CN116092521 A CN 116092521A
Authority
CN
China
Prior art keywords
audio
song
characteristic frequency
signal
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211094118.6A
Other languages
Chinese (zh)
Inventor
孔令城
胡诗超
谭志力
陈颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202211094118.6A priority Critical patent/CN116092521A/en
Publication of CN116092521A publication Critical patent/CN116092521A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Abstract

The application relates to the field of audio processing, and provides a training method of a characteristic frequency point identification model, an audio fingerprint identification method, meter equipment and a product, which can improve the accuracy of the characteristic frequency points and the audio fingerprints obtained by identification. The method comprises the following steps: acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio; determining a reference characteristic frequency point under the original song signal frequency domain; inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model; and adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.

Description

Feature frequency point recognition model training and audio fingerprint recognition method, device and product
Technical Field
The present application relates to the field of audio processing, and in particular, to a training method for a feature frequency point recognition model, an audio fingerprint recognition method, a computer device, and a computer program product.
Background
With the continuous development of audio processing technology, more and more audio types apply the audio matching technology provided, and audio fingerprints are widely applied in the field of audio matching.
In the related art, each characteristic frequency point having a specified characteristic in an audio signal may be identified, and an audio fingerprint of the audio signal may be obtained based on each characteristic frequency point. However, when interference noise exists in the audio, the audio fingerprint obtained by the recognition in the mode is often inaccurate, and the recognition accuracy is low.
Disclosure of Invention
Based on this, it is necessary to provide a training method, an audio fingerprint identification method, a computer device and a computer program product for the feature frequency point identification model in order to solve the above-mentioned technical problems.
In a first aspect, the present application provides a training method for a feature frequency point identification model. The method comprises the following steps:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
And adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
In a second aspect, the present application further provides an audio fingerprint identification method. The method comprises the following steps:
acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model;
an audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
Determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
and adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
In a fourth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model;
An audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
and adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
In a sixth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
Acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model;
an audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
According to the training method, the audio fingerprint identification method, the computer equipment and the computer program product of the characteristic frequency point identification model, noisy song audio of original song audio can be obtained, wherein the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio, further, a reference characteristic frequency point under the frequency domain of the original song signal can be determined, the noisy song signal is input into the neural network model to be trained, a predicted characteristic frequency point related to the original song signal in the noisy song signal under the frequency domain is obtained through the neural network model, and model parameters of the neural network model are adjusted based on difference values between the predicted characteristic frequency point and the reference characteristic frequency point until training end conditions are met, so that the trained characteristic frequency point identification model is obtained. According to the scheme, the predicted characteristic frequency points are identified from the noisy song signals containing the noise signals and the original song signals through the neural network model in the training process, and learning is carried out based on the reference characteristic frequency points of the original song signals, so that the model can reject the characteristic frequency points associated with the noise signals in the input song signals, only the characteristic frequency points associated with the original song signals in the noisy song signals are reserved, interference frequency points generated by the noise signals are effectively eliminated, reliability and accuracy of the characteristic frequency points obtained through identification are improved, and further the identification accuracy of the audio fingerprints is improved.
Drawings
FIG. 1 is a flow chart of a training method of a feature frequency point identification model in an embodiment;
FIG. 2 is a flowchart illustrating steps for obtaining audio of a song with noise in one embodiment;
FIG. 3 is a flowchart illustrating steps for obtaining spectrum information in one embodiment;
FIG. 4 is a schematic diagram of spectral information of noisy song audio in one embodiment;
FIG. 5 is a diagram of an application environment for an audio fingerprinting method according to an embodiment;
FIG. 6 is a flowchart of an audio fingerprint recognition method according to an embodiment;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The application provides a training method of a characteristic frequency point identification model, which can be executed by a terminal, a server and other computer equipment. The terminal can be, but is not limited to, various personal computers, notebook computers and tablet computers; the server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. In one embodiment, as shown in fig. 1, a training method of a feature frequency point identification model is provided, and the method may include the following steps:
S101, obtaining noisy song audio of original song audio; the noisy song signal of the noisy song audio includes the noise signal and the original song signal of the original song audio.
In practical applications, noisy song audio of the original song audio may be obtained.
The original song audio may be song audio obtained from a preset song library, and the audio signal of the original song audio may not include a noise signal of a preset type; the song audio with noise may be obtained by adding a noise signal of a preset type to the original song audio, in other words, the audio signal of the song audio with noise includes a noise signal in addition to the audio signal of the original song audio.
S102, determining a reference characteristic frequency point under the frequency domain of the original song signal.
As an example, the characteristic frequency point may be a frequency point having a preset frequency point characteristic, and an image formed by a plurality of characteristic frequency points in a two-dimensional plane (time-frequency) may also be referred to as a constellation diagram. In order to facilitate distinguishing between the feature frequency points extracted based on the original song signal and the feature frequency points extracted based on the noisy song signal, the feature frequency points extracted based on the original song signal may be referred to as reference feature frequency points.
The frequency point characteristic may be a characteristic that the frequency point has on frequency domain energy. Illustratively, the preset frequency point characteristic may be at least one of the following: the frequency domain energy exceeds a preset energy threshold, and the frequency domain energy change of the frequency point exceeds a preset change threshold. The frequency domain energy change amount may be a change amount of frequency domain energy of one frequency point compared with frequency domain energy of one or more adjacent frequency points, and the change amount may be an absolute value or a relative value (e.g. a percentage). In an example, if the frequency domain energy variation of the frequency point feature for filtering the feature frequency point is that the frequency point exceeds the preset variation threshold, the filtered feature frequency point may also be referred to as a local peak point.
In practical application, after the original song signal of the original song audio is obtained, the characteristic frequency point of the original song signal in the frequency domain can be obtained. For example, a time-varying relation of sound intensity in original song audio may be obtained, an original song signal in a time domain may be obtained, and time-frequency conversion may be performed on the original song signal to obtain an original song signal in a frequency domain, where the original song signal in the frequency domain may be spectrum information (such as a spectrogram) of the original song audio, the spectrum information includes a plurality of frequency points of the original song audio in different frequencies, and each frequency point with a preset frequency point characteristic may be further screened out from the plurality of frequency points, and may be used as a characteristic frequency point of the original song signal in the frequency domain.
S103, inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model.
And S104, adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, and obtaining a trained characteristic frequency point identification model.
In a specific implementation, a characteristic frequency point under a frequency domain of a song signal with noise can be obtained, and when the characteristic frequency point is obtained, although a frequency point with a preset frequency point characteristic can be selected from the song signal with noise under the frequency domain in a related mode to be used as the characteristic frequency point of the song signal with noise, the characteristic frequency point is extracted in an indiscriminate way, namely, the extracted frequency point with the preset frequency point characteristic is not considered to be from an original song signal or a noise signal, and is extracted as the characteristic frequency point as long as a certain frequency point has the preset frequency point characteristic. It can be understood that if the noise signal also includes a frequency point having a preset frequency point characteristic, the frequency point extraction method of the related art will erroneously extract the frequency point and cause interference, so that the finally obtained characteristic frequency point includes not only the characteristic frequency point for the original song signal, but also the characteristic frequency point for the noise signal, which results in difficulty in accurately reflecting the characteristic of the original song signal.
Based on the above, the method can input the song signal with noise into the neural network model to be trained, and the neural network model obtains the characteristic frequency points which are associated with the original song signal part of the song signal with noise in the frequency domain, wherein the characteristic frequency points can also be called as prediction characteristic frequency points, and a plurality of prediction characteristic frequency points can form a constellation diagram of the song signal with noise. After the predicted characteristic frequency points are obtained, model parameters of the neural network model can be adjusted based on the difference values between the predicted characteristic frequency points and the reference characteristic frequency points until training ending conditions are met, and a trained characteristic point identification model can be obtained.
Specifically, the embodiment can take the reference characteristic frequency point of the original song signal as a learning target of the neural network model, introduce a noise signal serving as an interference factor into the original song signal, and input the noisy song signal containing the original song signal and the noise signal into the neural network model, so that the neural network model can learn and reject relevant information of the noise signal in the noisy song signal in the training process, and screen the characteristic frequency point based on the reserved audio signal to obtain a predicted characteristic frequency point only aiming at the original song signal part of the noisy song signal.
In this embodiment, noisy song audio of original song audio may be obtained, where the noisy song signal of the noisy song audio includes a noise signal and an original song signal of the original song audio, and further, a reference feature frequency point under a frequency domain of the original song signal may be determined, the noisy song signal is input to a neural network model to be trained, a predicted feature frequency point associated with the original song signal in the noisy song signal under the frequency domain is obtained through the neural network model, and a model parameter of the neural network model is adjusted based on a difference value between the predicted feature frequency point and the reference feature frequency point until a training end condition is satisfied, so as to obtain a trained feature frequency point identification model. According to the scheme, the predicted characteristic frequency points are identified from the noisy song signals containing the noise signals and the original song signals through the neural network model in the training process, and learning is carried out based on the reference characteristic frequency points of the original song signals, so that the model can reject the characteristic frequency points associated with the noise signals in the input song signals, only the characteristic frequency points associated with the original song signals in the noisy song signals are reserved, interference frequency points generated by the noise signals are effectively eliminated, reliability and accuracy of the characteristic frequency points obtained through identification are improved, and further the identification accuracy of the audio fingerprints is improved.
In one embodiment, in S104, based on the difference value between the predicted feature frequency point and the reference feature frequency point, the model parameters of the neural network model are adjusted until the training ending condition is satisfied, so as to obtain a trained feature frequency point identification model, which may include the following steps:
determining a model loss value based on a difference value between the predicted characteristic frequency point and the reference characteristic frequency point; the model loss value and the difference value are positively correlated; and adjusting model parameters of the neural network model according to the model loss value until the training ending condition is met, and obtaining a trained characteristic frequency point identification model.
As an example, the difference value may include at least one of: the difference value of the frequency point positions and the difference value of the frequency point numbers. The frequency point positions may include the positions of the frequency points and/or the relative positions of the frequency points, for example, if the predicted characteristic frequency point A, B is identified, the position difference between the predicted characteristic frequency point a and the reference characteristic frequency point a ' may be obtained, and the difference between the relative positions of the predicted characteristic frequency point A, B and the relative positions of the reference characteristic frequency points a ', B ' may also be obtained.
In practical applications, a difference value between the predicted characteristic frequency point and the reference characteristic frequency point may be obtained, and a model loss value may be determined according to the difference value, where the model loss value may be in a negative correlation with the difference value between the predicted characteristic frequency point and the reference characteristic frequency point, and in an example, the model loss value may be determined by a mean square loss function (MSELoss) when determining the model loss value.
After the model loss value is obtained, model parameters of the neural network model can be adjusted according to the determined model loss value until a training ending condition is met, for example, the model loss value is smaller than a threshold value or the iteration number of the neural network model reaches a preset number of times, the neural network model can be determined to be trained, and the current neural network model can be used as a trained characteristic frequency point identification model.
In this embodiment, the smaller the difference value between the predicted characteristic frequency point and the reference frequency point is, the smaller the model loss value of the neural network model is, and the model loss value of the neural network model is determined based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point, and the model parameters of the neural network model are adjusted towards the direction of reducing the model loss value, so that the predicted characteristic frequency point identified by the neural network model based on the input song signal with noise is more and more similar to the reference characteristic frequency point identified by the neural network model based on the original song signal in the model training process, and when the audio signal simultaneously contains the song signal and the noise signal with interference, the characteristic frequency points of the noise signal and the noise signal irrelevant to the song signal can be effectively removed by the neural network model, and the characteristic frequency point of the song signal part can be correctly reserved.
In one embodiment, the characteristic frequency point is a local peak point, as shown in fig. 2, and the step of obtaining noisy song audio of the original song audio in S101 may include the following steps:
s201, acquiring a noise signal of non-stationary noise, wherein a plurality of frequency points in a noise signal frequency domain comprise at least one local peak point.
In practical application, if the characteristic frequency point to be identified from the original song signal is a local peak point, because the Noise signal frequency domain energy of Stationary Noise is relatively stable, the local peak point of interference will not appear on the spectrogram, when the audio signal includes the Noise signal of Stationary Noise and the song signal at the same time, the Noise signal of Stationary Noise does not affect the terminal or the server to accurately identify the local peak point related to the song signal in the audio signal, that is, when the Noise signal is the Noise signal of Stationary Noise, the local peak point can have relatively excellent anti-Noise capability as the characteristic frequency point.
However, the Noise signal of Non-stationary Noise includes at least one local peak point in a plurality of frequency points in the frequency domain, that is, at least one frequency point of the Non-stationary Noise may have the same frequency point characteristic as the characteristic frequency point of the song signal, so that a part of frequency points in the Noise signal of the Non-stationary Noise may be erroneously identified as the characteristic frequency point of the song signal, thereby affecting the identification accuracy of the characteristic frequency point and the audio fingerprint. Based on this, a noise signal of non-stationary noise can be acquired in the present embodiment.
S202, carrying out fusion processing on the noise signals and original song signals of original song audio in a song library, and obtaining noisy song audio of the original song audio based on fusion results.
After the noise signal of the non-stationary noise is obtained, the original song audio can be obtained from the song library, the noise signal of the non-stationary noise and the original song signal of the original song audio are fused, and then the noisy song audio of the original song audio can be obtained according to the signal fusion result.
In this embodiment, by performing fusion processing on the noise signal of the non-stationary noise and the original song signal of the original song audio in the song library, and obtaining the noisy song audio of the original song audio based on the fusion result, the data added with the non-stationary noise is enhanced by training the neural network model, so that the model learns to identify and ignore the local peak point of the non-stationary noise in the training process, and the accuracy of identifying the obtained local peak point is effectively improved under the condition that the audio signal contains the noise signal of the non-stationary noise.
In one embodiment, S201 acquires a noise signal of non-stationary noise, which may include the steps of:
acquiring preset non-stationary noise of various types; at least one type of non-stationary noise is randomly determined from among a plurality of types of non-stationary noise, and a noise signal of the at least one type of non-stationary noise is acquired.
In practical applications, a plurality of different types of non-stationary noise may be obtained in advance, and the plurality of types of non-stationary noise may include at least two of the following: speech sounds, transient noise, environmental noise, where the environmental noise may include indoor environmental noise and/or outdoor environmental noise, such as environmental noise in public transportation scenarios (e.g., noise generated when vehicles such as vehicles, aircraft, high-speed rail, etc. are traveling or are on-off), environmental noise inside a mall.
When noisy song audio of the original song audio needs to be generated, at least one type of non-stationary noise can be randomly determined from a plurality of types of non-stationary noise, and a noise signal of the determined non-stationary noise is acquired. Specifically, for example, a plurality of types of non-stationary noise may be randomly combined to obtain noise signals of transient noise and outdoor environmental noise.
In this embodiment, a plurality of preset types of non-stationary noise can be obtained, at least one type of non-stationary noise is randomly determined from the plurality of types of non-stationary noise, and a noise signal of the at least one type of non-stationary noise is obtained, so that randomness and diversity of the noise signal of the non-stationary noise contained in the song signal with noise can be increased, the feature frequency point identification model obtained through final training can effectively remove feature frequency points of different types of non-stationary noise, and robustness of the feature frequency point identification model is improved.
In one embodiment, the step S202 of fusing the noise signal and the original song signal of the original song audio in the song library, and obtaining the noisy song audio of the original song audio based on the fusion result may be implemented by the following steps:
randomly adding a noise signal of non-stationary noise into an original song signal of original song audio to obtain a fused audio signal; and acquiring the noisy song audio of the original song audio based on the fused audio signal.
After the noise signal of the non-stationary noise is obtained, the noise signal of the non-stationary noise may be randomly added to the original song signal of the original song audio.
Specifically, when adding the noise signal of the non-stationary noise to the original song signal, the noise signal under the time domain and the original song signal may be processed, and when processing, the noise signal of the non-stationary noise may be randomly superimposed on the original song signal, for example, the noise signal a of the non-stationary noise may be superimposed on the original song signal B, to obtain the superimposed audio signal C; alternatively, a noise signal of non-stationary noise may be inserted into the original song signal, such as the noise signal a of non-stationary noise is inserted between the original song signals B1 and B2.
After randomly adding the noise signal to the original song signal, a fused audio signal can be obtained, and the fused audio signal can be an audio signal in a time domain, so that the noisy song audio of the original song audio can be obtained based on the fused audio signal.
In this embodiment, by randomly adding the noise signal of the non-stationary noise to the original song signal of the original song audio to obtain the fused audio signal, and obtaining the noisy song audio of the original song audio based on the fused audio signal, the randomness and diversity of the occurrence of the non-stationary noise in the noisy song audio can be increased, so that the characteristic frequency point identification model can more accurately identify and reject the characteristic frequency points of the noise signals at different positions in the audio signal.
In one embodiment, as shown in fig. 3, S103 inputting the noisy song signal into the neural network model to be trained may include the steps of:
s301, frame frequency information of a plurality of audio frames of the song audio with noise is obtained.
In practical application, after obtaining the song audio with noise, a plurality of audio frames of the song audio with noise can be obtained, and time-frequency conversion processing is performed on each audio frame, so that an audio signal in the time domain of each audio frame is converted into an audio signal in the frequency domain, and frame frequency information of each audio frame can be obtained.
S302, according to the respective frame sequence of the plurality of audio frames, frame frequency information of the plurality of audio frames is spliced, and frequency spectrum information of the song audio with noise is obtained.
As an example, the frame order may be the order of the framed audio frames in the noisy song audio.
After frame frequency information of a plurality of audio frames is obtained, respective frame orders of the plurality of audio frames can be obtained, and the frame frequency information of the plurality of audio frames can be spliced in sequence according to the frame orders, so that the splicing result can be used as frequency spectrum information of song audio with noise.
In a specific implementation, when splicing, a plurality of audio frames are spliced by taking the frame sequence of the audio frames or the time of the audio frames as the horizontal axis and taking the frequency as the vertical axis, as shown in fig. 4, which shows spectrum information after splicing of a plurality of audio frames.
S303, inputting the spectrum information of the song audio with noise into a neural network model to be trained.
As an example, the neural network model may be a residual network (resnet), such as resnet 18, resnet34, resnet 50, resnet 152, etc., although other types of neural network models may be selected according to the actual implementation.
After the spectrum information of the noisy song audio is obtained, it may be input into the neural network model to be trained.
In this embodiment, frame frequency spectrum information of a plurality of audio frames of the song audio with noise may be obtained, frame frequency spectrum information of the plurality of audio frames may be spliced according to respective frame orders of the plurality of audio frames, spectrum information of the song audio with noise may be obtained, and the spectrum information of the song audio with noise may be input to a neural network model to be trained, so that the neural network model may combine with complete spectrum information of the song audio with noise to screen out feature frequency points, and screening accuracy of the feature frequency points may be improved.
The embodiment of the application also provides an audio fingerprint identification method which can be applied to an application environment shown in fig. 5, wherein the application environment comprises a terminal and a server, and the terminal and the server can be connected through network communication. The server can be deployed with a characteristic frequency point identification model, the terminal can acquire target song audio of the audio fingerprint to be identified, the target audio is sent to the server, and the server identifies the audio fingerprint corresponding to the target song audio.
It can be appreciated that the above application scenario is only an example, and does not constitute a limitation on the audio fingerprint identification method provided in the embodiments of the present application. For example, the feature frequency point recognition model may be deployed on a terminal, where after the terminal obtains the target audio provided by the user, the terminal may recognize an audio fingerprint of the target audio by using the audio fingerprint recognition method provided by the embodiment, and send the audio fingerprint to the server for related song retrieval.
The server in this embodiment may be an independent physical server, or may be a server cluster formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, and the like.
In one embodiment, as shown in fig. 6, an audio fingerprint recognition method is provided, which is illustrated by taking the application of the method to the server in fig. 5 as an example, and may include the following steps:
s601, target song audio of the audio fingerprint to be identified is acquired.
In a specific implementation, the target song audio of the audio fingerprint to be identified may be obtained, where the audio signal of the target song audio may include a noise signal in addition to the song signal, and of course, the target song audio may also include no noise signal.
The target audio may be a humming song or a turner song recorded by the user, and the user may upload the target song audio through the terminal to trigger searching for a user-required song associated with the target song audio and meeting a preset condition.
S602, inputting the audio signal of the target song audio to the trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under the frequency domain.
After the target song audio is obtained, the audio signal of the target song audio can be input into a trained characteristic frequency point identification model, and a plurality of characteristic frequency points of the audio signal of the target song audio under the frequency domain are output by the characteristic frequency point identification model, wherein the characteristic frequency point identification model in the embodiment can be obtained by training according to the training method of the characteristic frequency point identification model in the embodiment.
In a specific implementation, after the target song audio is obtained, an audio signal of the target song audio in a frequency domain may be obtained. Specifically, after the target song audio is obtained, if the audio format of the target song audio is an mp3 file, the target song audio can be converted into a file with a preset format and a sampling rate of 8KHz, and the target song audio after format conversion is framed according to a preset byte number, for example, the target song audio is divided into an audio frame according to 512 bytes. After a plurality of audio frames of the target song audio are acquired, frame frequency information of each audio frame can be acquired, and the frame frequency information of the plurality of audio frames is spliced according to a frame sequence, so that the spliced frame frequency information can be input into a characteristic frequency point identification model.
S603, determining an audio fingerprint of the target song audio based on the characteristic frequency points.
After the plurality of characteristic frequency points of the target song audio are acquired, an audio fingerprint of the target song audio can be generated based on the plurality of characteristic frequency points.
In this embodiment, a target song audio of an audio fingerprint to be identified may be obtained, an audio signal of the target song audio is input to a trained feature frequency point identification model, and a plurality of feature frequency points of the audio signal of the target song audio output by the feature frequency point identification model in a frequency domain are obtained, where the feature frequency point identification model is obtained according to a training method of the feature frequency point identification model, and further the audio fingerprint of the target song audio may be determined based on the plurality of feature frequency points. In the scheme of the embodiment, the trained characteristic frequency point identification model can accurately remove the characteristic frequency point associated with the noise signal in the input audio signal, only the characteristic frequency point associated with the song signal in the audio signal is reserved, interference caused by the noise signal in the audio signal is effectively eliminated in the process of extracting the characteristic frequency point, the accuracy of the characteristic frequency point obtained by identification is improved, and the identification accuracy of the audio fingerprint is further improved.
In one embodiment, S603 determines an audio fingerprint of the target song audio based on the plurality of characteristic frequency points, which may include the steps of:
acquiring at least one group of adjacent characteristic frequency points in a plurality of characteristic frequency points; determining the audio fingerprint of each group of adjacent characteristic frequency points based on the frequency value and sampling time of each characteristic frequency point in each group of adjacent characteristic frequency points; an audio fingerprint sequence is generated based on the audio fingerprints of each set of adjacent feature frequency points as the audio fingerprint of the target song audio.
In practical application, after a plurality of feature frequency points are obtained, at least one group of adjacent feature frequency points can be obtained from the plurality of feature frequency points, namely, a group of adjacent feature frequency points can be obtained based on two adjacent feature frequency points. And further, for each set of adjacent feature frequency points, the audio fingerprint of each set of adjacent feature frequency points can be determined according to the frequency value and the sampling time of each feature frequency point in the set of adjacent feature frequency points.
Specifically, the relative position of the adjacent characteristic frequency points can be determined based on the frequency value and sampling time of each characteristic frequency point in the group of adjacent characteristic frequency points, and the relative position is determinedFor example, if the frequency values and sampling times of the set of neighboring feature frequency points M, N are (t 1, f 1) and (t 2, f 2), respectively, Δt of t1 and t2 may be obtained, and an audio fingerprint may be generated based on t1, f2, and Δt, which may be recorded as (t 1 HashCode), where hashcode= (f) 1 ,f 2 ,Δt)。
After the audio fingerprint of each group of adjacent characteristic frequency points is acquired, an audio fingerprint sequence can be generated based on the audio fingerprints of the adjacent characteristic frequency points of the song, and the audio fingerprint sequence is used as the audio fingerprint of the target song audio.
In this embodiment, at least one set of adjacent feature frequency points in the plurality of feature frequency points may be obtained, based on the frequency value and sampling time of each feature frequency point in each set of adjacent feature frequency points, an audio fingerprint of each set of adjacent feature frequency points is determined, and an audio fingerprint sequence is generated based on the audio fingerprints of each set of adjacent feature frequency points, so as to be used as an audio fingerprint of a target song audio, thereby ensuring the accuracy of the feature frequency points and the finally obtained audio fingerprint, that is, only the feature frequency points may be screened for the song signal portion in the audio signal, the influence caused by interference noise in the target song audio may be filtered, and the finally generated audio fingerprint may more accurately reflect the features of the song portion in the target song audio, so that the audio fingerprint may be better matched with the related original song; meanwhile, the audio fingerprint of the target song audio can be obtained with high stability because the encoding can be performed according to the relative positions of the adjacent characteristic frequency points.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the spectral data of songs. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a training method of a characteristic frequency point recognition model and/or an audio fingerprint recognition method.
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
and adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
In one embodiment, the steps of the other embodiments described above are also implemented when the processor executes a computer program.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model;
an audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
In one embodiment, the steps of the other embodiments described above are also implemented when the processor executes a computer program.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
Determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
and adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
In one embodiment, the computer program, when executed by a processor, also implements the steps of the other embodiments described above.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model;
An audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
In one embodiment, the computer program, when executed by a processor, also implements the steps of the other embodiments described above.
It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. The training method of the characteristic frequency point identification model is characterized by comprising the following steps of:
acquiring noisy song audio of original song audio; the noisy song signal of the noisy song audio comprises a noise signal and an original song signal of the original song audio;
determining a reference characteristic frequency point under the original song signal frequency domain;
inputting the song signals with noise into a neural network model to be trained, and acquiring predicted characteristic frequency points associated with original song signals in the song signals with noise in a frequency domain through the neural network model;
And adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training ending condition is met, so as to obtain a trained characteristic frequency point identification model.
2. The method of claim 1, wherein the characteristic frequency points are local peak points, and the obtaining noisy song audio of the original song audio comprises:
acquiring a noise signal of non-stationary noise, wherein a plurality of frequency points in a noise signal frequency domain comprise at least one local peak point;
and carrying out fusion processing on the noise signal and an original song signal of the original song audio in the song library, and obtaining the noisy song audio of the original song audio based on a fusion result.
3. The method of claim 2, wherein the acquiring the noise signal of the non-stationary noise comprises:
acquiring preset non-stationary noise of a plurality of types, wherein the non-stationary noise of the plurality of types comprises at least two of the following: speaking sound, transient noise and environmental noise;
at least one type of non-stationary noise is randomly determined from among the plurality of types of non-stationary noise, and a noise signal of the at least one type of non-stationary noise is acquired.
4. The method according to claim 2, wherein the fusing the noise signal and the original song signal of the original song audio in the library, and obtaining the noisy song audio of the original song audio based on the fusion result, includes:
randomly adding the noise signals of the non-stationary noise to the original song signals of the original song audio to obtain fused audio signals;
and acquiring the noisy song audio of the original song audio based on the fused audio signal.
5. The method of claim 1, wherein the inputting the noisy song signal into the neural network model to be trained comprises:
acquiring frame frequency information of a plurality of audio frames of the song audio with noise;
splicing frame frequency information of the plurality of audio frames according to respective frame orders of the plurality of audio frames to obtain frequency spectrum information of the song audio with noise;
and inputting the spectrum information of the noisy song audio into a neural network model to be trained.
6. The method according to any one of claims 1-5, wherein the adjusting model parameters of the neural network model based on the difference value between the predicted characteristic frequency point and the reference characteristic frequency point until the training end condition is satisfied, to obtain a trained characteristic frequency point identification model, includes:
Determining a model loss value based on a difference value between the predicted characteristic frequency point and the reference characteristic frequency point; the model loss value and the difference value are positively correlated;
and adjusting model parameters of the neural network model according to the model loss value until the training ending condition is met, and obtaining a trained characteristic frequency point identification model.
7. An audio fingerprint identification method, the method comprising:
acquiring target song audio of an audio fingerprint to be identified;
inputting the audio signal of the target song audio to a trained characteristic frequency point identification model to obtain a plurality of characteristic frequency points of the audio signal of the target song audio output by the characteristic frequency point identification model under a frequency domain, wherein the characteristic frequency point identification model is obtained according to the training method of the characteristic frequency point identification model according to any one of claims 1-6;
an audio fingerprint of the target song audio is determined based on the plurality of characteristic frequency points.
8. The method of claim 7, wherein the determining the audio fingerprint of the target song audio based on the plurality of characteristic frequency points comprises:
acquiring at least one group of adjacent characteristic frequency points in the plurality of characteristic frequency points;
Determining the audio fingerprint of each group of adjacent characteristic frequency points based on the frequency value and sampling time of each characteristic frequency point in each group of adjacent characteristic frequency points;
and generating an audio fingerprint sequence based on the audio fingerprints of each group of adjacent characteristic frequency points, and taking the audio fingerprint sequence as the audio fingerprint of the target song audio.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.
CN202211094118.6A 2022-09-08 2022-09-08 Feature frequency point recognition model training and audio fingerprint recognition method, device and product Pending CN116092521A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211094118.6A CN116092521A (en) 2022-09-08 2022-09-08 Feature frequency point recognition model training and audio fingerprint recognition method, device and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211094118.6A CN116092521A (en) 2022-09-08 2022-09-08 Feature frequency point recognition model training and audio fingerprint recognition method, device and product

Publications (1)

Publication Number Publication Date
CN116092521A true CN116092521A (en) 2023-05-09

Family

ID=86208861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211094118.6A Pending CN116092521A (en) 2022-09-08 2022-09-08 Feature frequency point recognition model training and audio fingerprint recognition method, device and product

Country Status (1)

Country Link
CN (1) CN116092521A (en)

Similar Documents

Publication Publication Date Title
CN110990273B (en) Clone code detection method and device
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
CN108510982B (en) Audio event detection method and device and computer readable storage medium
US20140280304A1 (en) Matching versions of a known song to an unknown song
EP2657884B1 (en) Identifying multimedia objects based on multimedia fingerprint
CN111739539B (en) Method, device and storage medium for determining number of speakers
KR20170053525A (en) Apparatus and method for training neural network, apparatus and method for speech recognition
US11514925B2 (en) Using a predictive model to automatically enhance audio having various audio quality issues
Khan et al. A novel audio forensic data-set for digital multimedia forensics
CN113593606B (en) Audio recognition method and device, computer equipment and computer-readable storage medium
CN114363695B (en) Video processing method, device, computer equipment and storage medium
CN110442855A (en) A kind of speech analysis method and system
Yang et al. Approaching optimal embedding in audio steganography with GAN
CN114461943B (en) Deep learning-based multi-source POI semantic matching method and device and storage medium thereof
Imoto et al. Acoustic scene analysis from acoustic event sequence with intermittent missing event
CN116092521A (en) Feature frequency point recognition model training and audio fingerprint recognition method, device and product
CN114566160A (en) Voice processing method and device, computer equipment and storage medium
CN115116469A (en) Feature representation extraction method, feature representation extraction device, feature representation extraction apparatus, feature representation extraction medium, and program product
CN111310176B (en) Intrusion detection method and device based on feature selection
CN114155841A (en) Voice recognition method, device, equipment and storage medium
CN111078877B (en) Data processing method, training method of text classification model, and text classification method and device
CN113889081A (en) Speech recognition method, medium, device and computing equipment
CN109671440B (en) Method, device, server and storage medium for simulating audio distortion
US10861436B1 (en) Audio call classification and survey system
CN115440198B (en) Method, apparatus, computer device and storage medium for converting mixed audio signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination