CN116781944A - Song detection method, device, equipment and readable storage medium - Google Patents

Song detection method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN116781944A
CN116781944A CN202310932072.9A CN202310932072A CN116781944A CN 116781944 A CN116781944 A CN 116781944A CN 202310932072 A CN202310932072 A CN 202310932072A CN 116781944 A CN116781944 A CN 116781944A
Authority
CN
China
Prior art keywords
audio
song
audio file
training
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310932072.9A
Other languages
Chinese (zh)
Inventor
兰翔
曾锐鸿
马金龙
熊佳
焦南凯
盘子圣
王伟喆
黎子骏
黄祥康
吴文亮
邓其春
张政统
谢睿
徐志坚
陈光尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Quyan Network Technology Co ltd
Original Assignee
Guangzhou Quyan Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Quyan Network Technology Co ltd filed Critical Guangzhou Quyan Network Technology Co ltd
Priority to CN202310932072.9A priority Critical patent/CN116781944A/en
Publication of CN116781944A publication Critical patent/CN116781944A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a song detection method, a device, equipment and a readable storage medium, wherein the method can acquire an audio file containing audio stream data uploaded by a host; judging whether the audio file also contains song information or not; if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not; if not, inputting the audio stream data in the audio file to a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints. Therefore, the application provides a song detection flow, which can improve the efficiency of audio auditing while guaranteeing the auditing accuracy.

Description

Song detection method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of information identification technologies, and in particular, to a song detection method, apparatus, device, and readable storage medium.
Background
Along with the expansion of the live broadcast field, the live broadcast audience group is increasingly increased, so that the network environment is further standardized, the propagation of bad contents is avoided, the physical and psychological health development of the live broadcast audience group is ensured, and the auditing of the contents played by a host in a live broadcast room is indispensable. Wherein, the host sings the song in the living broadcast room as the common content of living broadcast, but the song singed by the host may belong to forbidden song. Thus, auditing songs by a host singing belongs to one of the important content of live auditing. Based on this, it is desirable to provide a song detection method for auditing songs sung by a host.
Disclosure of Invention
In view of the above, the present application provides a song detection method, apparatus, device, and readable storage medium for auditing songs by a host singing.
In order to achieve the above object, the following solutions have been proposed:
a song detection method, comprising:
acquiring an audio file which is uploaded by a main broadcasting end and contains audio stream data;
judging whether the audio file also contains song information or not;
if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
if not, inputting the audio stream data in the audio file to a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
Optionally, the training process of the feature extraction model includes:
acquiring an initial feature extraction model and a training set, wherein the training set consists of two types of training audio, one type of training audio is derived from audio fragments of songs shown by different broadcasters in a live broadcast room, the other type of training audio is derived from audio fragments of songs shown by different broadcasters in the live broadcast room, and each training audio is marked with a corresponding training spectrogram;
inputting each training audio to the initial feature extraction model in sequence to obtain a predicted spectrogram output by the initial feature extraction model;
and adjusting parameters of the initial feature extraction model according to the predicted spectrogram and the training spectrogram of the input training audio until the initial feature extraction model meets preset conditions, and taking the initial feature extraction model obtained by final training as the feature extraction model.
Optionally, acquiring the training set includes:
acquiring live videos from live rooms of each anchor;
intercepting audio fragments of songs shown by a host in a live broadcasting room from each live broadcasting video, and playing the audio fragments of the songs by the host in the live broadcasting room;
and sequentially generating training frequency spectrograms corresponding to music audios in each audio fragment, taking the generated training frequency spectrograms as labeling labels of the audio fragments to form training audios, and forming the training set by the training audios.
Optionally, the determining whether the audio file further includes song information includes:
and judging whether the audio file contains song identification, original singing identification, album identification and/or producer identification.
Optionally, extracting an audio fingerprint from the spectrogram includes:
selecting all maximum points from the spectrogram, and determining the moment and the amplitude value corresponding to each maximum point;
and forming the audio fingerprint according to the corresponding moment and amplitude value of each maximum point.
Optionally, the forming an audio fingerprint according to the moment and the amplitude value corresponding to each maximum point includes:
generating a hash value corresponding to each amplitude value;
and sequencing the hash values according to the corresponding moments of the amplitude values to form the audio fingerprint.
Optionally, the determining, according to the audio fingerprint, a song identifier matched with the audio file includes:
matching the audio fingerprint with each song in a preset song library, and calculating the similarity between the fingerprint information of each song and the audio fingerprint;
selecting the maximum similarity from the similarities as a target similarity, and comparing the target similarity with a preset similarity threshold;
and when the target similarity exceeds a preset similarity threshold, determining the name of the song corresponding to the target similarity, and taking the name as a song identifier matched with the audio file.
A song detection apparatus comprising:
the acquisition module is used for acquiring the audio file which is uploaded by the anchor and contains the audio stream data;
the judging module is used for judging whether the audio file also contains song information;
the extraction module is used for extracting song information from the audio file if the judgment module determines that the audio file also contains the song information, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
and the determining module is used for inputting the audio stream data in the audio file into a preset feature extraction model if the judging module determines that the audio file does not contain song information, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
A song detection apparatus comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement each step of the song detection method described above.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a song detection method as described above.
According to the technical scheme, the song detection method provided by the application obtains the audio file containing the audio stream data uploaded by the anchor; judging whether the audio file also contains song information or not; if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not; therefore, when the audio file contains song information, the song identification is determined according to the song information, and live broadcast auditing is completed by determining the song identification, so that the flow and efficiency of live broadcast auditing are further accelerated; if not, inputting the audio stream data in the audio file into a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints; therefore, the application can extract the spectrogram of the song audio by utilizing the characteristic extraction model, and even if the audio stream data contains the ambient sound, communication interaction sound and other interference sounds, only the spectrogram corresponding to the song audio in the audio stream data can be extracted, thereby avoiding the influence of the interference sound in the live broadcast scene and further improving the accuracy of audio auditing. Therefore, the application provides a song detection process, which can pertinently audit the songs played in the live broadcasting room according to the content contained in the audio file, determine whether to play or sing the forbidden songs in the live broadcasting room, and improve the efficiency of audio audit while ensuring the accuracy of audit.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a song detection method disclosed in an embodiment of the present application;
fig. 2 is a block diagram of a song detecting apparatus according to an embodiment of the present application;
fig. 3 is a block diagram of a hardware structure of a song detecting apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The song detection method provided by the application can be applied to numerous general or special computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor devices, distributed computing environments that include any of the above devices or devices, and the like.
The song detection method of the present application will be described in detail with reference to fig. 1, and includes the following steps:
step S1, obtaining an audio file which is uploaded by a main broadcasting end and contains audio stream data.
Specifically, an audio file uploaded by each anchor through the terminal may be received, where the audio file includes audio stream data. The audio stream data may be an audio stream formed by a host playing songs in a live broadcast room, or may be an audio stream formed by a host playing songs in a live broadcast room. The audio stream data may include song information such as song identification, singer identification, album identification, original song identification, etc.
And step S2, judging whether the audio file also contains song information, if so, executing step S3, and if not, executing step S4.
Specifically, the audio file processing program may be utilized to determine whether song information can be extracted from the audio file, and if so, determine that song information is included in the lyric file, and execute step S3; if not, determining that song information is not contained in the lyric file, and executing step S4.
The song information may be used to determine a song identification, which may be used to determine whether the audio stream data in the audio file contains a contraband song.
The forbidden song can be a song which affects the physical and mental health development of the live audience group, and also can be a song containing sensitive words.
And S3, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information.
Specifically, when it is determined that the audio file contains song information, the song information may be directly extracted from the audio file, and all song identifiers corresponding to the audio file may be determined based on the song information.
The song identifications may be used to determine whether the audio file contains a contraband song, wherein the determination of whether the audio file contains a contraband song may be made by determining the category to which each song identification belongs.
And S4, inputting the audio stream data in the audio file into a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
In particular, the audio stream data may be input to a trained feature extraction model, which may be utilized to extract a spectrogram based on song audio in the audio stream data.
The audio stream data may include only song audio, or may include data such as environmental audio in a live broadcasting room, communication interaction audio between a host and a viewer, song audio, etc., and the feature extraction model may extract only a spectrogram corresponding to the song audio in the audio stream data.
An audio fingerprint corresponding to the audio file may be extracted based on the spectrogram.
And determining all song identifications corresponding to the audio fingerprints, and taking each song identification as the song identification corresponding to the audio file.
The song identification may be used to determine whether the audio file uploaded by the host relates to a contraband song.
According to the technical scheme, the song detection method provided by the embodiment of the application obtains the audio file containing the audio stream data uploaded by the anchor; judging whether the audio file also contains song information or not; if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not; therefore, when the audio file contains song information, the song identification is determined according to the song information, and live broadcast auditing is completed by determining the song identification, so that the flow and efficiency of live broadcast auditing are further accelerated; if not, inputting the audio stream data in the audio file into a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints; therefore, the application can extract the spectrogram of the song audio by utilizing the characteristic extraction model, and even if the audio stream data contains the ambient sound, communication interaction sound and other interference sounds, only the spectrogram corresponding to the song audio in the audio stream data can be extracted, thereby avoiding the influence of the interference sound in the live broadcast scene and further improving the accuracy of audio auditing. Therefore, the application provides a song detection process, which can pertinently audit the songs played in the live broadcasting room according to the content contained in the audio file, determine whether to play or sing the forbidden songs in the live broadcasting room, and improve the efficiency of audio audit while ensuring the accuracy of audit.
In some embodiments of the present application, the training of the feature extraction model may be completed in advance, and the trained feature extraction model may be stored, and when song detection is required, the feature extraction model is called to complete the training, so as to further improve the efficiency of live broadcast auditing of the present application, based on which the training process of the feature extraction model may be increased. Next, the training process will be described in detail, with the following steps:
s5, acquiring an initial feature extraction model and a training set.
Specifically, an unsupervised training mode may be adopted in advance to obtain an initial feature extraction model, and the initial feature extraction model may be a speech pre-training model.
A training set of two types of training audio may be obtained, and the training set may contain a plurality of training audio.
One type of training audio can be derived from audio fragments formed by playing songs in a live broadcast room by different anchor, the other type of training audio can be derived from audio fragments formed by playing songs in the live broadcast room by different anchor, and the label tag of each training audio is a training spectrogram corresponding to the training audio.
The training spectrogram may be a spectrogram of the musical composition audio in the corresponding training audio.
S6, sequentially inputting each training audio to the initial feature extraction model to obtain a predicted spectrogram output by the initial feature extraction model.
Specifically, training audio can be selected randomly from the training set in sequence and input into the initial feature extraction model, and a predictive spectrogram is output based on music audio in the training audio by using the initial feature extraction model.
And S7, adjusting parameters of the initial feature extraction model according to the predicted spectrogram and the training spectrogram of the input training audio until the initial feature extraction model meets preset conditions, and taking the initial feature extraction model obtained by final training as the feature extraction model.
Specifically, the current iteration number may be determined.
And calculating a loss value between the predicted spectrogram and the training spectrogram under the current iteration times, and adjusting the parameters of the initial feature extraction model according to the loss value until the current iteration times exceed a preset iteration threshold and/or the loss value is smaller than the preset loss threshold.
The initial feature extraction model finally obtained is a feature extraction model for extracting a spectrogram in audio stream data.
According to the technical scheme, an optional mode for obtaining the feature extraction model through training is added, and the live broadcast auditing efficiency can be improved and the auditing process can be accelerated through the mode.
In some embodiments of the present application, the process of acquiring the training set in step S5 is described in detail as follows:
s50, acquiring live videos from the live rooms of the main broadcasters.
In particular, live video may be acquired from various different live rooms.
S51, intercepting audio fragments of songs shown in the live broadcasting room by the anchor from each live broadcasting video, and playing the audio fragments of the songs in the live broadcasting room by the anchor.
In particular, it may be determined whether there is an audio clip of the performance song and/or an audio clip of the play song in each live video in turn.
If so, the audio fragment of the song shown by the host in the live broadcast room is intercepted from the live video, and/or the audio fragment of the song is played by the host in the live broadcast room.
If not, the live video is not intercepted.
Wherein the duration of different audio segments may be different.
The duration threshold of the audio segment may be preset, and in the process of capturing the audio segment, the audio segment of the song shown by the anchor in the live broadcast room is captured according to the duration threshold, and/or the audio segment of the song is played by the anchor in the live broadcast room.
S52, sequentially generating training spectrograms corresponding to music audios in each audio fragment, and taking the generated training spectrograms as labeling labels of the audio fragments to form training audios, wherein each training audio forms the training set.
Specifically, it is possible to sequentially extract musical composition audio related to a song in each audio piece and extract a spectrogram of the musical composition audio as a training spectrogram.
And marking the training spectrogram as a marking label of the audio fragment in the audio fragment to form training audio.
Each training audio constitutes a training set.
According to the technical scheme, the embodiment provides an optional training set acquisition mode, and the audio clips related to song audio and the training spectrogram can be further and better acquired through the mode, so that the accuracy of a feature extraction model obtained by utilizing the training set is improved.
In some embodiments of the present application, the process of determining whether the audio file further includes song information in step S2 is described in detail as follows:
s20, judging whether the audio file contains song identification, original singing identification, album identification and/or producer identification.
Specifically, the song information may include basic information related to songs such as song identification, original song identification, album identification, producer identification, and singer identification.
It may be identified whether basic information related to the song exists in the audio file.
The song identification may be determined by underlying information associated with the song.
According to the technical scheme, the alternative mode for judging whether the audio file further contains song information is provided, and song detection efficiency can be further improved through the mode, so that live broadcast auditing efficiency is improved.
In some embodiments of the present application, the process of extracting the audio fingerprint from the spectrogram in step S4 is described in detail, and the steps are as follows:
s40, selecting all maximum points from the spectrogram, and determining the corresponding moment and amplitude value of each maximum point.
Specifically, a maximum point may be found from the spectrogram output by the feature extraction model.
The maximum value point may be a coordinate point whose corresponding amplitude value is larger than that of the adjacent coordinate point in both the time dimension and the frequency dimension.
The moment and amplitude value corresponding to each maximum point can be determined.
S41, forming the audio fingerprint according to the corresponding moment and amplitude value of each maximum value point.
Specifically, the amplitude values may be combined according to the time sequence to form an audio fingerprint.
According to the technical scheme, the embodiment provides an optional mode for extracting the audio fingerprint, the audio fingerprint can be better extracted from the spectrogram through the mode, the calculated amount of subsequent audio fingerprint identification is simplified, and the speed of live broadcast auditing is further improved.
In some embodiments of the present application, the process of forming the audio fingerprint in step S41 according to the time and amplitude value corresponding to each maximum point is described in detail as follows:
s410, generating a hash value corresponding to each amplitude value.
Specifically, a hash algorithm may be employed to generate a hash value corresponding to each amplitude value.
S411, sequencing the hash values according to the corresponding moments of the amplitude values to form an audio fingerprint.
Specifically, the hash values may be combined to form the audio fingerprint based on the order of the moments.
As can be seen from the above technical solution, the present embodiment provides an optional manner of forming an audio fingerprint by using each amplitude value, by which frequency information can be hashed further, so that audio stream data is converted into a hash value to form an audio fingerprint, which is convenient for subsequent fingerprint matching.
In some embodiments of the present application, the process of determining the song identifier matching the audio file according to the audio fingerprint in step S4 is described in detail, and the steps are as follows:
s43, matching the audio fingerprint with each song in a preset song library, and calculating the similarity between the fingerprint information of each song and the audio fingerprint.
Specifically, a library may be pre-established, in which hash values corresponding to respective songs are stored.
The song spectrogram of each song can be generated, and fingerprint information corresponding to the song is formed by utilizing the maximum value point in the song spectrogram.
The euclidean distance between the audio fingerprint and the fingerprint information of each song in the library can be calculated, and the euclidean distance is used as the similarity between the fingerprint information of the song and the audio fingerprint.
S44, selecting the maximum similarity from the similarities as a target similarity, and comparing the target similarity with a preset similarity threshold.
Specifically, the similarity can be ranked according to the magnitude of the numerical value, a ranking result is obtained, and the similarity with the largest numerical value is selected from the ranking result as the target similarity.
Whether the target similarity is smaller than a preset similarity threshold can be judged, if yes, a prompt indicating that the audio file does not have the matched song is sent out, and if not, step S45 is executed.
The similar threshold may be set according to actual demands, for example, may be set to 80%.
S45, determining the name of the song corresponding to the target similarity, and taking the name as a song identifier matched with the audio file.
Specifically, the name of the song corresponding to the target similarity may be determined, and the name may be corresponding to the audio file, which is used as the song identifier corresponding to the audio file.
As can be seen from the above technical solution, the present embodiment provides an optional manner of determining song identifiers of an audio file according to an audio fingerprint, and by using the foregoing manner, matching of the audio fingerprint with each song in a song library can be further completed, so as to complete determination of song identifiers.
Next, a detailed description will be given of the song detection apparatus provided by the present application, and the song detection apparatus set forth below may be cross-referenced with the song detection method provided above.
As can be seen with reference to fig. 2, the song detection apparatus may include:
the acquisition module 1 is used for acquiring an audio file which is uploaded by a main broadcasting end and contains audio stream data;
a judging module 2, configured to judge whether the audio file further contains song information;
the extracting module 3 is configured to extract song information from the audio file if the judging module determines that the audio file further includes song information, and determine a song identifier corresponding to the audio file according to the song information, where the song identifier is used to determine whether the audio file includes a forbidden song;
and the determining module 4 is used for inputting the audio stream data in the audio file into a preset feature extraction model if the judging module determines that the audio file does not contain song information, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting an audio fingerprint from the spectrogram, and determining a song identifier matched with the audio file according to the audio fingerprint.
Further, the song detection apparatus may further include:
the training set acquisition module is used for acquiring an initial feature extraction model and a training set, wherein the training set consists of two types of training audio, one type of training audio is derived from audio fragments of songs shown by different broadcasters in a live broadcast room, the other type of training audio is derived from audio fragments of songs shown by different broadcasters in the live broadcast room, and each training audio is marked with a corresponding training spectrogram;
the predictive spectrogram generation module is used for sequentially inputting each training audio into the initial feature extraction model to obtain a predictive spectrogram output by the initial feature extraction model;
and the parameter adjustment module is used for adjusting the parameters of the initial feature extraction model according to the predicted spectrogram and the training spectrogram of the input training audio until the initial feature extraction model meets preset conditions, and taking the initial feature extraction model finally obtained by training as the feature extraction model.
Further, the training set acquisition module may include:
the live video acquisition unit is used for acquiring live videos from each live broadcasting room of each main broadcasting;
an audio segment intercepting unit, configured to intercept, from each of the live videos, an audio segment of a song shown by a main cast in a live broadcast room, and an audio segment of a song shown by the main cast in the live broadcast room;
the training audio generation unit is used for sequentially generating training spectrograms corresponding to music audios in each audio fragment, taking the generated training spectrograms as labeling labels of the audio fragments to form training audios, and forming the training set by the training audios.
Further, the judging module may include:
and the identification identifying unit is used for judging whether the audio file contains song identification, original singing identification, album identification and/or producer identification.
Further, the determining module may include:
a maximum value point selecting unit, configured to select all maximum value points from the spectrogram, and determine a moment and an amplitude value corresponding to each maximum value point;
and the amplitude value utilization unit is used for forming an audio fingerprint according to the moment corresponding to each maximum value point and the amplitude value.
Further, the amplitude value utilization unit may include:
the hash value generation component is used for generating hash values corresponding to the amplitude values;
and the hash value ordering component is used for sequentially ordering the hash values according to the corresponding moments of the amplitude values to form the audio fingerprint.
Further, the determining module may further include:
the audio fingerprint matching unit is used for matching the audio fingerprint with each song in a preset song library and calculating the similarity between the fingerprint information of each song and the audio fingerprint;
the target similarity determining unit is used for selecting the maximum similarity from the similarities as target similarity and comparing the target similarity with a preset similarity threshold;
and the song identification determining unit is used for determining the name of the song corresponding to the target similarity when the target similarity determining unit determines that the target similarity exceeds a preset similarity threshold, and taking the name as the song identification matched with the audio file.
The song detection apparatus provided by the embodiment of the application can be applied to song detection equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 3 shows a block diagram of a hardware structure of the song detection apparatus, and referring to fig. 3, the hardware structure of the song detection apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
acquiring an audio file which is uploaded by a main broadcasting end and contains audio stream data;
judging whether the audio file also contains song information or not;
if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
if not, inputting the audio stream data in the audio file to a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a readable storage medium storing a program adapted to be executed by a processor, the program being configured to:
acquiring an audio file which is uploaded by a main broadcasting end and contains audio stream data;
judging whether the audio file also contains song information or not;
if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
if not, inputting the audio stream data in the audio file to a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Various embodiments of the present application may be combined with each other. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A song detection method, comprising:
acquiring an audio file which is uploaded by a main broadcasting end and contains audio stream data;
judging whether the audio file also contains song information or not;
if yes, extracting the song information from the audio file, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
if not, inputting the audio stream data in the audio file to a preset feature extraction model, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
2. The song detection method of claim 1, wherein the training process of the feature extraction model comprises:
acquiring an initial feature extraction model and a training set, wherein the training set consists of two types of training audio, one type of training audio is derived from audio fragments of songs shown by different broadcasters in a live broadcast room, the other type of training audio is derived from audio fragments of songs shown by different broadcasters in the live broadcast room, and each training audio is marked with a corresponding training spectrogram;
inputting each training audio to the initial feature extraction model in sequence to obtain a predicted spectrogram output by the initial feature extraction model;
and adjusting parameters of the initial feature extraction model according to the predicted spectrogram and the training spectrogram of the input training audio until the initial feature extraction model meets preset conditions, and taking the initial feature extraction model obtained by final training as the feature extraction model.
3. The song detection method of claim 2, wherein obtaining a training set comprises:
acquiring live videos from live rooms of each anchor;
intercepting audio fragments of songs shown by a host in a live broadcasting room from each live broadcasting video, and playing the audio fragments of the songs by the host in the live broadcasting room;
and sequentially generating training frequency spectrograms corresponding to music audios in each audio fragment, taking the generated training frequency spectrograms as labeling labels of the audio fragments to form training audios, and forming the training set by the training audios.
4. The song detection method of claim 1, wherein said determining whether song information is also included in the audio file comprises:
and judging whether the audio file contains song identification, original singing identification, album identification and/or producer identification.
5. The song detection method of claim 1, wherein extracting audio fingerprints from the spectrogram comprises:
selecting all maximum points from the spectrogram, and determining the moment and the amplitude value corresponding to each maximum point;
and forming the audio fingerprint according to the corresponding moment and amplitude value of each maximum point.
6. The song detection method of claim 5, wherein forming an audio fingerprint based on the time and amplitude values corresponding to each maximum point comprises:
generating a hash value corresponding to each amplitude value;
and sequencing the hash values according to the corresponding moments of the amplitude values to form the audio fingerprint.
7. The song detection method of claim 1, wherein the determining, from the audio fingerprint, a song identification that matches the audio file comprises:
matching the audio fingerprint with each song in a preset song library, and calculating the similarity between the fingerprint information of each song and the audio fingerprint;
selecting the maximum similarity from the similarities as a target similarity, and comparing the target similarity with a preset similarity threshold;
and when the target similarity exceeds a preset similarity threshold, determining the name of the song corresponding to the target similarity, and taking the name as a song identifier matched with the audio file.
8. A song detection apparatus, comprising:
the acquisition module is used for acquiring the audio file which is uploaded by the anchor and contains the audio stream data;
the judging module is used for judging whether the audio file also contains song information;
the extraction module is used for extracting song information from the audio file if the judgment module determines that the audio file also contains the song information, and determining a song identifier corresponding to the audio file according to the song information, wherein the song identifier is used for determining whether the audio file contains forbidden songs or not;
and the determining module is used for inputting the audio stream data in the audio file into a preset feature extraction model if the judging module determines that the audio file does not contain song information, extracting a spectrogram of song audio in the audio stream data by using the feature extraction model, extracting audio fingerprints from the spectrogram, and determining song identifications matched with the audio file according to the audio fingerprints.
9. A song detection apparatus comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the song detection method according to any one of claims 1-7.
10. A readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the steps of the song detection method according to any one of claims 1-7.
CN202310932072.9A 2023-07-26 2023-07-26 Song detection method, device, equipment and readable storage medium Pending CN116781944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310932072.9A CN116781944A (en) 2023-07-26 2023-07-26 Song detection method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310932072.9A CN116781944A (en) 2023-07-26 2023-07-26 Song detection method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116781944A true CN116781944A (en) 2023-09-19

Family

ID=88011646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310932072.9A Pending CN116781944A (en) 2023-07-26 2023-07-26 Song detection method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116781944A (en)

Similar Documents

Publication Publication Date Title
EP2638516B1 (en) Syndication including melody recognition and opt out
JP5833235B2 (en) Method and system for identifying the contents of a data stream
US9142000B2 (en) Media rights management using melody identification
JP4945877B2 (en) System and method for recognizing sound / musical signal under high noise / distortion environment
JP4640407B2 (en) Signal processing apparatus, signal processing method, and program
CN103729368B (en) A kind of robust audio recognition methods based on local spectrum iamge description
CN109117622B (en) Identity authentication method based on audio fingerprints
CN107181986A (en) The matching process and device of video and captions
CN111723235B (en) Music content identification method, device and equipment
CN109410972B (en) Method, device and storage medium for generating sound effect parameters
CN109271501B (en) Audio database management method and system
Ghosal et al. Song/instrumental classification using spectrogram based contextual features
CN111899762B (en) Melody similarity evaluation method and device, terminal equipment and storage medium
Gurjar et al. Comparative Analysis of Music Similarity Measures in Music Information Retrieval Systems.
CN116781944A (en) Song detection method, device, equipment and readable storage medium
KR20200118587A (en) Music recommendation system using intrinsic information of music
CN113747233B (en) Music replacement method and device, electronic equipment and storage medium
CN115329125A (en) Song skewer burning splicing method and device
CN113032616B (en) Audio recommendation method, device, computer equipment and storage medium
CN112435688B (en) Audio identification method, server and storage medium
Gao et al. Popular song summarization using chorus section detection from audio signal
Deepsheka et al. Recurrent neural network based music recognition using audio fingerprinting
Gao et al. Octave-dependent probabilistic latent semantic analysis to chorus detection of popular song
US9953032B2 (en) System and method for characterization of multimedia content signals using cores of a natural liquid architecture system
CN115866279A (en) Live video processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination