CN117390215A - Method, apparatus and storage medium for retrieving audio - Google Patents

Method, apparatus and storage medium for retrieving audio Download PDF

Info

Publication number
CN117390215A
CN117390215A CN202311253261.XA CN202311253261A CN117390215A CN 117390215 A CN117390215 A CN 117390215A CN 202311253261 A CN202311253261 A CN 202311253261A CN 117390215 A CN117390215 A CN 117390215A
Authority
CN
China
Prior art keywords
audio
song
similarity
determining
version
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311253261.XA
Other languages
Chinese (zh)
Inventor
陈颖
龚韬
谭志力
苏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202311253261.XA priority Critical patent/CN117390215A/en
Publication of CN117390215A publication Critical patent/CN117390215A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • G06F16/634Query by example, e.g. query by humming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/638Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, an apparatus, and a storage medium for retrieving audio, which belong to the technical field of audio recognition. In the audio retrieval process, if the retrieved audio set comprises song audio of a non-quality version, the server adjusts the song audio to an audio set composed of the audio of the quality version, and then determines a retrieval result based on the adjusted audio set and sends the retrieval result to a terminal of a user. When the user listens to each song audio in the search result, the user time is not wasted due to the listening to the song audio with the non-quality version, so that the search efficiency is improved.

Description

Method, apparatus and storage medium for retrieving audio
Technical Field
The present disclosure relates to the field of audio recognition technologies, and in particular, to a method, an apparatus, and a storage medium for retrieving audio.
Background
Humming search is a music search method, and searches songs based on the characteristics of the music, such as pitch, melody and lyrics. When humming search is performed, the user can search out the corresponding song by humming a small piece of music or lyrics.
In the related art, during the course of retrieving songs, some non-quality versions of audio, such as pirated songs, low-quality music-turning songs, etc., are retrieved. The user needs to listen to each audio in the search result to find the audio wanted by the user, however, listening to the audio of the non-quality version in the search result inevitably wastes time of the user, and thus the search efficiency is low.
Disclosure of Invention
The present disclosure provides a method, apparatus, and storage medium for retrieving audio, which can solve the problems of the related art.
In a first aspect, there is provided a method of retrieving audio, the method comprising:
acquiring retrieval information of audio to be retrieved, which is sent by a terminal, wherein the retrieval information comprises characteristic information of the audio to be retrieved;
identifying similar audio corresponding to the audio to be retrieved to obtain an audio set;
adjusting the audio set according to preset information to obtain an adjusted audio set consisting of high-quality version audio, wherein the preset information is determined according to at least one of the playing amount, release time and labels of the audio;
determining a search result starting time corresponding to each audio in the adjusted audio set based on the characteristic information of the audio to be searched, wherein the search result starting time is used as a playing starting time when the terminal plays the corresponding audio;
and sending the audio related information corresponding to each audio in the adjusted audio set and the search result starting time to the terminal.
In one possible implementation manner, the preset information is a pre-established correspondence between the non-quality version audio and the quality version audio;
the step of adjusting the audio set according to preset information to obtain an adjusted audio set composed of high-quality version audio, including:
and replacing the non-quality version audio in the audio set with the corresponding quality version audio according to the pre-established corresponding relation between the non-quality version audio and the quality version audio, so as to obtain an adjusted audio set.
In one possible implementation, the method further includes:
based on the characteristic information of each audio in an audio library, determining the similarity between different audio in the audio library;
determining a plurality of audio groups based on the similarity between different audio in the audio library, wherein the similarity between any two audio in the same audio group is greater than a similarity threshold;
for each audio group, determining a premium version of audio from among a plurality of audio of the audio group, if the audio group includes other audio in addition to the premium version of audio, determining the other audio as a non-premium version of audio;
and establishing a corresponding relation between the non-quality version audio and the quality version audio based on the determined non-quality version audio and the corresponding quality version audio.
In a possible implementation manner, the identifying similar audio corresponding to the audio to be retrieved to obtain an audio set includes:
for each audio in an audio library, determining the similarity of the audio and the audio to be searched based on the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched;
and determining the audio with the similarity meeting the specified condition as the similar audio corresponding to the audio to be searched, and forming an audio set.
In a possible implementation, the characteristic information comprises melody characteristics and/or lyrics characteristics.
In one possible implementation, the method further includes:
and for each audio in the adjusted audio set, determining the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched, determining at least one target segment with the corresponding similarity larger than a similarity threshold value, and determining the starting time of a search result based on the target segment with the previous playing time.
In one possible implementation manner, the determining the search result start time based on the target segment with the previous play time includes:
and determining the starting time of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation manner, the determining the search result start time based on the target segment with the previous play time includes:
and determining the starting time of the previous segment of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation, the segments are sentences of the audio.
In a second aspect, there is provided an apparatus for retrieving audio, the apparatus comprising:
the acquisition module is used for acquiring retrieval information of the audio to be retrieved, which is sent by the terminal, wherein the retrieval information comprises characteristic information of the audio to be retrieved;
the identification module is used for identifying similar audio corresponding to the audio to be retrieved to obtain an audio set;
the adjusting module is used for adjusting the audio set according to preset information to obtain an adjusted audio set composed of high-quality version audio, wherein the preset information is determined according to at least one of the playing amount, the release time and the label of the audio;
the determining module is used for determining the search result starting time corresponding to each audio in the adjusted audio set based on the characteristic information of the audio to be searched, wherein the search result starting time is used as the playing starting time when the terminal plays the corresponding audio;
and the feedback module is used for sending the audio related information corresponding to each audio in the adjusted audio set and the search result starting time to the terminal.
In one possible implementation manner, the preset information is a pre-established correspondence between the non-quality version audio and the quality version audio;
the adjusting module is used for:
and replacing the non-quality version audio in the audio set with the corresponding quality version audio according to the pre-established corresponding relation between the non-quality version audio and the quality version audio, so as to obtain an adjusted audio set.
In one possible implementation, the adjusting module is further configured to:
based on the characteristic information of each audio in an audio library, determining the similarity between different audio in the audio library;
determining a plurality of audio groups based on the similarity between different audio in the audio library, wherein the similarity between any two audio in the same audio group is greater than a similarity threshold;
for each audio group, determining a premium version of audio from among a plurality of audio of the audio group, if the audio group includes other audio in addition to the premium version of audio, determining the other audio as a non-premium version of audio;
and establishing a corresponding relation between the non-quality version audio and the quality version audio based on the determined non-quality version audio and the corresponding quality version audio.
In one possible implementation manner, the identification module is configured to:
for each audio in an audio library, determining the similarity of the audio and the audio to be searched based on the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched;
and determining the audio with the similarity meeting the specified condition as the similar audio corresponding to the audio to be searched, and forming an audio set.
In a possible implementation, the characteristic information comprises melody characteristics and/or lyrics characteristics.
In one possible implementation manner, the determining module is further configured to:
and for each audio in the adjusted audio set, determining the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched, determining at least one target segment with the corresponding similarity larger than a similarity threshold value, and determining the starting time of a search result based on the target segment with the previous playing time.
In one possible implementation manner, the determining module is configured to:
and determining the starting time of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation manner, the determining module is configured to:
and determining the starting time of the previous segment of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation, the segments are sentences of the audio.
In a third aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement a method as described in the first aspect and possible implementations thereof.
In a fourth aspect, a computer readable storage medium is provided, in which at least one instruction is stored, which is loaded and executed by the processor to implement a method as described in the first aspect and possible implementations thereof.
In a fifth aspect, a computer program product is provided, the computer program product comprising computer program code for, when executed by a computer device, performing the method of the first aspect and possible implementations thereof.
The technical scheme provided by the embodiment of the disclosure has the beneficial effects that at least:
in the embodiment of the disclosure, in the process of retrieving audio, if the retrieved audio set includes song audio of a non-quality version, the server adjusts the retrieved audio set to an audio set composed of audio of a quality version, and then determines a retrieval result based on the adjusted audio set and sends the retrieval result to the terminal of the user. When the user listens to each song audio in the search result, the user time is not wasted due to the listening to the song audio with the non-quality version, so that the search efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a computer device according to an embodiment of the present disclosure;
FIG. 2 is a method flow diagram of a method of retrieving audio provided by an embodiment of the present disclosure;
FIG. 3 is a flow chart of a method for establishing a correspondence between non-premium version audio and premium version audio provided by an embodiment of the present disclosure;
FIG. 4 is a method flow diagram of a method of retrieving audio provided by an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an apparatus for retrieving audio according to an embodiment of the present disclosure.
Detailed Description
For the purposes of clarity, technical solutions and advantages of the present disclosure, the following further details the embodiments of the present disclosure with reference to the accompanying drawings.
The embodiment of the disclosure provides a method for retrieving audio, and an execution subject of the method may be a computer device, and the computer device may be a server. The computer device may be a single server or a server group, if the computer device is a single server, the server may be responsible for all the processes in the following schemes, if the computer device is a server group, different servers in the server group may be respectively responsible for different processes in the following schemes, and specific process allocation conditions may be set by a technician according to actual requirements at will, which will not be described herein.
The server may be a background server of an application program, which may have an audio recognition function, which may be music player software or the like. In the embodiment of the disclosure, the music player software is used to retrieve audio as an example for carrying out the detailed description of the scheme, and other cases are similar to the description of the embodiment will not be repeated.
Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure, and from a hardware perspective, the structure of the computer device may be as shown in fig. 1, and includes a processor 110, a memory 120, and a communication unit 130.
The processor 110 may be a central processing unit (central processing unit, CPU) or a system on chip (SoC) or the like, and the processor 110 may be configured to identify similar song audio corresponding to the audio to be retrieved, or the like.
The memory 120 may include various volatile memories or nonvolatile memories, such as Solid State Disk (SSD), dynamic random access memory (dynamic random access memory, DRAM) memory, and the like. The memory 120 may be used to store initial data, intermediate data, and result data used in recording and retrieving audio, for example, correspondence of non-premium version audio and premium version audio of a song, and the like.
The communication part 130 may be a wired network connector, an Ultra Wideband (UWB), a wireless fidelity (wireless fidelity, wiFi) module, a bluetooth module, a cellular network communication module, etc. The communication unit 130 may be used for data transmission with other devices, which may be servers, terminals, or the like. For example, search information carrying audio to be searched, etc. is received.
In daily life and work, users often have a need to search for audio, and the embodiments of the present disclosure provide a method for searching for audio, which can search for various types of audio, such as song audio, speech audio, phase sound audio, and so on. In the embodiment of the present disclosure, the song audio is taken as an example to perform detailed description of the scheme, and other cases are similar and will not be repeated.
Users often need to retrieve songs when using a music player, typically by searching for song names or singer names. When the user does not know the song name or singer name of the song to be searched, a humming search button can be clicked in music player software installed on the terminal, then a small piece of song is hummed to serve as audio to be searched, the terminal sends the audio to be searched to a background server, the background server searches the song according to the audio to be searched and sends a search result to the terminal, and the terminal displays the search result. The embodiment of the disclosure provides a method for retrieving audio, and the corresponding processing flow can be shown in fig. 2.
201, the server acquires retrieval information of audio to be retrieved, which is sent by the terminal.
The retrieval information includes feature information of the audio to be retrieved, and the feature information may be audio data of the audio to be retrieved or information obtained by extracting features of the audio data by using a machine learning model.
When the user does not know the song name to perform song retrieval, the user can click a button of 'humming retrieval' or 'listening song recognition' in the music player software, then the terminal starts to record the audio in response to the clicking operation, then the recorded audio is used as the audio to be retrieved and retrieval information is generated, and then the terminal can send the retrieval information to the server, wherein the retrieval information is used for indicating the server to retrieve the audio to be retrieved.
202, the server identifies similar song audio corresponding to the audio to be retrieved, and obtains a song audio set.
The server may first obtain feature information of the audio to be retrieved, where the feature information may include melody features, lyric features, and the like. Then, the server can calculate the similarity between the feature information of each segment of the song audio in the audio library and the feature information of the audio to be searched through a similarity algorithm, then calculate the average value of the similarity between each segment of the song audio, and can determine the similarity between each song audio in the audio library and the audio to be searched through a similar method by taking the average value as the similarity between the song audio and the audio to be searched. The song audio in the audio library is divided into a plurality of segments according to the lyric text information in advance, and each segment can correspond to one sentence of lyrics.
The characteristic information may include melody characteristics and/or lyrics characteristics. When the feature information includes the melody feature, the melody similarity between each segment of the song audio and the audio to be retrieved may be calculated, and the melody similarity may be used as the similarity between the song audio and the audio to be retrieved. When the characteristic information includes lyrics, the lyrics similarity between each segment of the song audio and the audio to be searched can be calculated, and the lyrics similarity can be used as the similarity between the song audio and the audio to be searched. When the characteristic information includes melody characteristics and lyric characteristics, the melody similarity and the lyric similarity can be weighted and summed, and the obtained weighted value is used as the similarity between the song audio and the audio to be searched.
The specific process of obtaining melody similarity according to melody characteristics and lyric similarity according to lyric characteristics may be as follows:
(1) Melody features
The melody feature may specifically be pitch, which is a sequence of the pitch composition of each audio frame. The server may match the similarity of the pitch of each segment of song audio to the pitch of the audio to be retrieved to determine the melody similarity. The method comprises the steps that a server inputs audio to be searched into a pitch extraction algorithm to obtain a pitch corresponding to the audio to be searched, then a distance measurement method is used for calculating the distance between the pitch corresponding to the audio to be searched and the pitch corresponding to each segment of song audio in a music library, and similarity matching is carried out according to the calculated distance, wherein the shorter the distance is, the higher the similarity is. The pitch corresponding to each song audio in the music library can be obtained through a corresponding musical instrument digital interface (musical instrument digital interface, MIDI) file, and the distance measurement method can be Euclidean distance, dynamic time planning and the like.
(2) Lyrics feature
The lyric features can be lyric text information for identifying the audio to be searched by adopting an automatic voice recognition technology, and also can be lyric phoneme information. The server can perform similarity matching on the lyrics text information of each section of the song audio and the lyrics text information of the audio to be searched, and specifically, the lyrics similarity can be determined by adopting a lyrics searching technology. The server may also perform similarity matching on lyrics and phonemes information of each segment of the song audio and lyrics and phonemes information of the audio to be retrieved, and specifically may determine the lyrics similarity by using a phoneme search technique.
After the server calculates the similarity between each song audio in the audio library and the audio to be searched by using the characteristic information, the song audio with the similarity meeting the specified condition can be determined and used as the similar song audio corresponding to the audio to be searched, and the similar song audio forms a song audio set. The specified condition may be that the similarity with the audio to be retrieved is ranked higher, the similarity with the audio to be retrieved is greater than a similarity threshold, and the like.
When the specified condition is that the similarity ranking with the audio to be searched is top, the similarity ranking can be ordered in descending order, so that the song audio with the top N similarity ranking (N is a positive integer and N is generally 5) is obtained and used as the similar song audio corresponding to the audio to be searched, and the similar song audio forms a song audio set.
203, adjusting the audio set according to the preset information to obtain an adjusted audio set composed of high-quality version audio.
The preset information can be determined according to at least one of the playing amount of the audio, the release time and the label.
The preset information may be a pre-established correspondence between non-premium version audio and premium version audio. The corresponding processing of this step may be: the server replaces the non-premium version audio in the song audio set with the corresponding premium version audio based on the pre-established correspondence between the non-premium version audio and the premium version audio of the song, and an adjusted song audio set is obtained.
Since the song audio set is determined only by similarity matching in step 202, there may be non-quality version audio with lower quality but higher similarity to humming audio in the obtained song audio set, and for each audio in the song audio set, the server may search for the audio in the non-quality version audio included in the pre-established correspondence between non-quality version audio and quality version audio of the song, and if the audio can be found, further confirm the corresponding quality version audio in the correspondence, and then replace the audio in the set with the corresponding quality version audio. For example, song Audio a 1 For the non-quality version of the song A singed by a unknown singer, determining the song audio a according to the corresponding relation between the non-quality version of the song A and the quality version of the song A 1 If the audio is of a non-quality version, the song audio a is further determined in the corresponding relation 0 For premium version of Song A, song A is played 1 Replacement with song audio a 0 . The establishment of correspondence between non-premium version audio and premium version audio of a song is described in detail below.
204, the server feeds back the search information based on the adjusted song audio set.
The server takes the song related information corresponding to each song audio in the adjusted song audio set as a search result, and then sends the search result to the terminal for sending the search information. After receiving the search result, the terminal displays the options of each song audio related to the search result, and the related song information corresponding to the song audio can be displayed in the options. The user can click on an option of a certain song audio, the terminal responds to the click operation to send a request for acquiring the song audio to the server, and the terminal receives the song audio sent by the server and plays the song audio. Wherein the song-related information may be song name, singer name, etc.
In the embodiment of the disclosure, when some non-quality version song audio is retrieved in the process of retrieving audio, the server replaces the non-quality version song audio with the corresponding quality version song audio, and then sends the adjusted retrieval result to the terminal of the user. When the user listens to each song audio in the search result, the user time is not wasted due to the listening to the song audio with the non-quality version, so that the search efficiency is improved.
The embodiment of the disclosure provides a method for establishing a correspondence between non-premium version audio and premium version audio of a song, and a corresponding processing flow may be shown in fig. 3.
301, the server determines the similarity between different song audios in the audio library based on the characteristic information of each song audio in the audio library.
The method of determining the similarity between the audio of different songs in the audio library is similar to the process in step 202 and will not be described again here.
302, the server determines a plurality of audio groups based on the similarity between the audio of different songs in the audio library.
Wherein the similarity between any two song audio in the same audio group is greater than a similarity threshold.
Multiple song audio belonging to the same audio group may be considered multiple different singing versions of the same song.
303, the server determines a premium version of audio from among the plurality of song audio of each audio group, and if other song audio is included in the audio group in addition to the premium version of audio, determines the other song audio as non-premium version audio.
There are a number of specific ways to determine premium versions of audio in a plurality of song audio of an audio group:
in one mode, the song audio with the highest playing amount in the plurality of song audio in the audio group is determined as the high-quality version audio. Among a plurality of different singing versions of a song, the song audio with the highest playing amount may be regarded as the song audio with higher quality, and thus may be regarded as the premium version audio.
In the second mode, the song audio with the earliest release time among the plurality of song audio in the audio group is determined as the high-quality version audio. Among a plurality of different singing versions of a song, the song audio that is released earliest may be regarded as the song audio of higher quality, and thus may be regarded as the premium version audio.
In a third mode, song audio having a preset quality tag among a plurality of song audio of an audio group is determined as quality version audio. Wherein, the quality label can be 'original singing' or 'quality turningover', etc., and the quality label can be added when the song is imported into the audio library. Among a plurality of different singing versions of a song, song audio with a premium label may be considered as higher quality song audio, and thus may be regarded as premium version audio.
After determining the premium version of audio in each audio group, if other song audio is also included in the audio group, the other song audio may be determined to be a non-premium version of audio.
304, the server establishes a corresponding relationship between the non-premium version audio and the premium version audio of the song based on the determined non-premium version audio and the corresponding premium version audio.
And obtaining song related information of the premium version audio and song related information of the non-premium version audio in each audio group according to the premium version audio and the non-premium version audio in each audio group determined in the step 303, and correspondingly storing the song related information of the premium version audio and the non-premium version audio as the corresponding relation of the non-premium version audio and the premium version audio of the song.
In the embodiment of the disclosure, the corresponding relation between the non-quality version audio and the quality version audio of the song is pre-established, so that the accuracy and quality of the determined quality version song audio can be improved, and the quality of the song audio in a retrieval result obtained in the retrieval process is improved.
The embodiment of the disclosure provides a method for retrieving audio, and the corresponding processing flow can be shown in fig. 4.
401, the server acquires retrieval information carrying audio to be retrieved, which is sent by the terminal.
The server identifies similar song audio corresponding to the audio to be retrieved, and obtains a song audio set 402.
403, the server replaces the non-premium version audio in the song audio set with the corresponding premium version audio based on the pre-established correspondence between the non-premium version audio and the premium version audio of the song, and obtains an adjusted song audio set.
The specific processing of steps 401-403 is similar to that of steps 201-203 and will not be described again here.
404, the server determines the similarity between the feature information of each segment of the song audio and the feature information of the audio to be retrieved, determines at least one target segment with the corresponding similarity larger than a similarity threshold value, and determines the starting time of the target segment with the previous playing time as the starting time of the retrieval result.
The search result start time is the start time of a part matched with the audio to be searched in the song audio serving as the search result.
Since humming is generally 2-3 words of lyrics, the server can combine the feature information of M (M may be a preset positive integer) words of adjacent lyrics of the song audio, where the M words of adjacent lyrics are used as a segment. Different segments may have the same sentence between them. For example, a song audio has 9 lyrics A, B, C, D, E, F, G, H, I, 3 adjacent lyrics are preset as one segment, and the song audio can be divided into ABC, BCD, CDE, DEF, EFG, FGH, GHI seven segments. The specific process of determining the similarity of the characteristic information of each segment of song audio to the characteristic information of the audio to be retrieved may be as follows:
if m=1, in the adjusted song audio set, the similarity between the feature information of each segment of each song audio and the feature information of the audio to be retrieved may be directly used, where the similarity is calculated in the processing of step 402.
If M >1, each song audio in the adjusted song audio set may be segmented, where each segment includes M sentences of adjacent lyrics, and then a similarity between the feature information of each segment of each song audio and the feature information of the audio to be retrieved is calculated.
After the similarity of the characteristic information of each segment of the song audio and the characteristic information of the audio to be searched is determined, at least one target segment, of which the similarity is larger than a similarity threshold value, corresponding to each song audio is determined, and the starting time of the target segment with the previous playing time is determined and is used as the starting time of the search result.
And 405, the server sends the song related information corresponding to each song audio in the adjusted song audio set and the search result starting time to the terminal.
The search result starting time is used as the playing starting time when the terminal plays the corresponding song audio.
The server takes the song related information and the search result starting time corresponding to each song audio in the adjusted song audio set as a search result, and then sends the search result to a terminal which sends the search information. After receiving the search result, the terminal displays the options of each song audio related to the search result, and the related song information corresponding to the song audio and the corresponding search result starting time can be displayed in the options. The user may click on an option of a certain song audio, the terminal transmits a request for acquiring the song audio to the server in response to the click operation, the terminal receives the song audio transmitted by the server, and then starts playing the song audio from the start time of the search result.
Optionally, in the step 404, the start time of the previous segment of the target segment with the previous playing time may also be determined as the search result start time. Because, for the segment where the start time of the audio to be retrieved is located (called the start segment), the part of the audio to be retrieved in the start segment may be short, so that the similarity between the audio to be retrieved and the start segment may be low, so that the method can better ensure the integrity of the playing retrieval result.
In the embodiment of the disclosure, in the process of retrieving audio, if the retrieved audio set includes song audio of a non-quality version, the server adjusts the retrieved audio set to an audio set composed of audio of a quality version, and then determines a retrieval result based on the adjusted audio set and sends the retrieval result to the terminal of the user. When the user listens to each song audio in the search result, the user time is not wasted due to the listening to the song audio with the non-quality version, so that the search efficiency is improved.
Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
Based on the same technical concept, the embodiment of the present application further provides an apparatus for retrieving audio, where the apparatus is applied to the server in the foregoing embodiment, as shown in fig. 5, and the apparatus includes:
the obtaining module 510 is configured to obtain search information of an audio to be searched sent by a terminal, where the search information includes feature information of the audio to be searched;
the identifying module 520 is configured to identify similar audio corresponding to the audio to be retrieved, so as to obtain an audio set;
the adjusting module 530 is configured to adjust the audio set according to preset information, to obtain an adjusted audio set composed of high-quality version audio, where the preset information is determined according to at least one of a playing amount, a release time, and a tag of the audio;
a determining module 540, configured to determine, based on the feature information of the audio to be retrieved, a search result start time corresponding to each audio in the adjusted audio set, where the search result start time is used as a play start time when the terminal plays the corresponding audio;
and the feedback module 550 is configured to send, to the terminal, the audio-related information and the search result start time corresponding to each audio in the adjusted audio set.
In one possible implementation manner, the preset information is a pre-established correspondence between the non-quality version audio and the quality version audio;
the adjusting module 530 is configured to:
and replacing the non-quality version audio in the audio set with the corresponding quality version audio according to the pre-established corresponding relation between the non-quality version audio and the quality version audio, so as to obtain an adjusted audio set.
In one possible implementation, the adjusting module 530 is further configured to:
based on the characteristic information of each audio in an audio library, determining the similarity between different audio in the audio library;
determining a plurality of audio groups based on the similarity between different audio in the audio library, wherein the similarity between any two audio in the same audio group is greater than a similarity threshold;
for each audio group, determining a premium version of audio from among a plurality of audio of the audio group, if the audio group includes other audio in addition to the premium version of audio, determining the other audio as a non-premium version of audio;
and establishing a corresponding relation between the non-quality version audio and the quality version audio based on the determined non-quality version audio and the corresponding quality version audio.
In one possible implementation, the identifying module 520 is configured to:
for each audio in an audio library, determining the similarity of the audio and the audio to be searched based on the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched;
and determining the audio with the similarity meeting the specified condition as the similar audio corresponding to the audio to be searched, and forming an audio set.
In a possible implementation, the characteristic information comprises melody characteristics and/or lyrics characteristics.
In one possible implementation, the determining module 540 is further configured to:
and for each audio in the adjusted audio set, determining the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched, determining at least one target segment with the corresponding similarity larger than a similarity threshold value, and determining the starting time of a search result based on the target segment with the previous playing time.
In one possible implementation, the determining module 540 is configured to:
and determining the starting time of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation, the determining module 540 is configured to:
and determining the starting time of the previous segment of the target segment with the previous playing time as the starting time of the search result.
In one possible implementation, the segments are sentences of the audio.
The acquisition module 510, the identification module 520, the adjustment module 530, the determination module 540, and the feedback module 550 may be implemented by a processor in a server, or the processor may be implemented in conjunction with a memory, or the processor may execute program instructions in the memory.
In the embodiment of the disclosure, in the process of retrieving audio, if the retrieved audio set includes song audio of a non-quality version, the server adjusts the retrieved audio set to an audio set composed of audio of a quality version, and then determines a retrieval result based on the adjusted audio set and sends the retrieval result to the terminal of the user. When the user listens to each song audio in the search result, the user time is not wasted due to the listening to the song audio with the non-quality version, so that the search efficiency is improved.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
In the audio searching apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the device for retrieving audio provided in the above embodiment and the method embodiment for retrieving audio belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of retrieving audio in the above-described embodiments. For example, the computer readable storage medium may be read-only memory (ROM), random-access memory (random access memory, RAM), compact-disk-read-only memory (compact disc read-only memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, comprising at least one instruction for loading and executing by a processor to implement a method of retrieving audio.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals (including but not limited to signals transmitted between the user terminal and other devices, etc.) referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions.

Claims (11)

1. A method of retrieving audio, the method comprising:
acquiring retrieval information of audio to be retrieved, which is sent by a terminal, wherein the retrieval information comprises characteristic information of the audio to be retrieved;
identifying similar audio corresponding to the audio to be retrieved to obtain an audio set;
adjusting the audio set according to preset information to obtain an adjusted audio set consisting of high-quality version audio, wherein the preset information is determined according to at least one of the playing amount, release time and labels of the audio;
determining a search result starting time corresponding to each audio in the adjusted audio set based on the characteristic information of the audio to be searched, wherein the search result starting time is used as a playing starting time when the terminal plays the corresponding audio;
and sending the audio related information corresponding to each audio in the adjusted audio set and the search result starting time to the terminal.
2. The method according to claim 1, wherein the preset information is a pre-established correspondence between non-premium version audio and premium version audio;
the step of adjusting the audio set according to preset information to obtain an adjusted audio set composed of high-quality version audio, including:
and replacing the non-quality version audio in the audio set with the corresponding quality version audio according to the pre-established corresponding relation between the non-quality version audio and the quality version audio, so as to obtain an adjusted audio set.
3. The method according to claim 2, wherein the method further comprises:
based on the characteristic information of each audio in an audio library, determining the similarity between different audio in the audio library;
determining a plurality of audio groups based on the similarity between different audio in the audio library, wherein the similarity between any two audio in the same audio group is greater than a similarity threshold;
for each audio group, determining a premium version of audio from among a plurality of audio of the audio group, if the audio group includes other audio in addition to the premium version of audio, determining the other audio as a non-premium version of audio;
and establishing a corresponding relation between the non-quality version audio and the quality version audio based on the determined non-quality version audio and the corresponding quality version audio.
4. A method according to any one of claims 1-3, wherein said identifying similar audio corresponding to said audio to be retrieved results in an audio collection comprising:
for each audio in an audio library, determining the similarity of the audio and the audio to be searched based on the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched;
and determining the audio with the similarity meeting the specified condition as the similar audio corresponding to the audio to be searched, and forming an audio set.
5. The method of claim 4, wherein the characteristic information comprises melodic characteristics and/or lyric characteristics.
6. The method according to claim 4, wherein the method further comprises:
and for each audio in the adjusted audio set, determining the similarity of the characteristic information of each segment of the audio and the characteristic information of the audio to be searched, determining at least one target segment with the corresponding similarity larger than a similarity threshold value, and determining the starting time of a search result based on the target segment with the previous playing time.
7. The method of claim 6, wherein determining a search result start time based on the target segment preceding the play time comprises:
and determining the starting time of the target segment with the previous playing time as the starting time of the search result.
8. The method of claim 6, wherein determining a search result start time based on the target segment preceding the play time comprises:
and determining the starting time of the previous segment of the target segment with the previous playing time as the starting time of the search result.
9. The method of claim 4, wherein the segments are sentences of the audio.
10. A computer device comprising a memory and a processor, the memory for storing computer instructions; the processor executes the computer instructions stored in the memory to cause the computer device to perform the method of any one of the preceding claims 1-9.
11. A computer readable storage medium, characterized in that the computer readable storage medium stores computer program code which, in response to being executed by a computer device, performs the method of any of the preceding claims 1-9.
CN202311253261.XA 2023-09-26 2023-09-26 Method, apparatus and storage medium for retrieving audio Pending CN117390215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311253261.XA CN117390215A (en) 2023-09-26 2023-09-26 Method, apparatus and storage medium for retrieving audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311253261.XA CN117390215A (en) 2023-09-26 2023-09-26 Method, apparatus and storage medium for retrieving audio

Publications (1)

Publication Number Publication Date
CN117390215A true CN117390215A (en) 2024-01-12

Family

ID=89467473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311253261.XA Pending CN117390215A (en) 2023-09-26 2023-09-26 Method, apparatus and storage medium for retrieving audio

Country Status (1)

Country Link
CN (1) CN117390215A (en)

Similar Documents

Publication Publication Date Title
US6476306B2 (en) Method and a system for recognizing a melody
US9396257B2 (en) Query by humming for ringtone search and download
KR100776495B1 (en) Method for search in an audio database
JP5115966B2 (en) Music retrieval system and method and program thereof
CN100437572C (en) Audio fingerprinting system and method
CN101452696B (en) Signal processing device, signal processing method and program
KR100895009B1 (en) System and method for recommending music
US8892565B2 (en) Method and apparatus for accessing an audio file from a collection of audio files using tonal matching
EP1695239A1 (en) Searching in a melody database
WO2017056982A1 (en) Music search method and music search device
CN110010159B (en) Sound similarity determination method and device
CN101996627A (en) Speech processing apparatus, speech processing method and program
EP3839938B1 (en) Karaoke query processing system
CN109271501B (en) Audio database management method and system
CN117390215A (en) Method, apparatus and storage medium for retrieving audio
KR20070048484A (en) Apparatus and method for classification of signal features of music files, and apparatus and method for automatic-making playing list using the same
JP7428182B2 (en) Information processing device, method, and program
CN110532419B (en) Audio processing method and device
US11114079B2 (en) Interactive music audition method, apparatus and terminal
KR20070016750A (en) Ubiquitous music information retrieval system and method based on query pool with feedback of customer characteristics
CN113129856A (en) Music score automatic correction method based on big data
CN113032616A (en) Audio recommendation method and device, computer equipment and storage medium
US20110077756A1 (en) Method for identifying and playing back an audio recording
Tripathy et al. Query by humming system
EP4250134A1 (en) System and method for automated music pitching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination