CN113536026A

CN113536026A - Audio searching method, device and equipment

Info

Publication number: CN113536026A
Application number: CN202010286315.2A
Authority: CN
Inventors: 夏朱荣; 张士伟; 唐铭谦
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2021-10-22
Anticipated expiration: 2040-04-13
Also published as: CN113536026B

Abstract

The application discloses an audio searching method, device and equipment. Wherein, the method comprises the following steps: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip. The method and the device solve the technical problems of poor universality and poor search efficiency of the audio search method in the prior art.

Description

Audio searching method, device and equipment

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio search method, apparatus and device.

Background

When video materials are collected based on video data generated by multimedia artificial intelligence technology, besides the video structural dimension, the related materials need to be collected for the audio of a specific musical instrument.

In the audio search method generally adopted in the prior art, sound source separation is performed on all audio in a search library, namely, one neural network model of each musical instrument extracts corresponding musical instrument audio, and then when musical instrument search is performed, a corresponding audio source file with audio signal intensity not lower than a certain threshold value is directly searched.

However, the above prior art has the following disadvantages: if the specific musical instrument to be retrieved does not have a corresponding separation network, if an unknown audio signal is encountered, the audio retrieval work cannot be carried out, and the universality is poor; a plurality of neural network models are needed to carry out library refreshing operation on the search library, and the generalization capability is poor.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides an audio searching method, an audio searching device and audio searching equipment, and aims to at least solve the technical problems of poor universality and searching efficiency of the audio searching method in the prior art.

According to an aspect of an embodiment of the present application, there is provided an audio search method, including: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

According to another aspect of the embodiments of the present application, there is also provided an audio search method, including: acquiring an audio search request message, wherein the information carried in the audio search request message at least comprises: storing position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result; feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and (5) the search result is obtained.

According to another aspect of the embodiments of the present application, there is also provided an audio search method, including: acquiring an audio calling request message, wherein calling parameters carried in the audio calling request message include: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result; feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and (5) the search result is obtained.

According to another aspect of the embodiments of the present application, there is also provided an audio search apparatus, including: the generating module is used for inputting an original multimedia file into a neural network model and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; the search module is configured to query, from a preset retrieval area, an audio segment similar to the audio feature to be searched to obtain a query result, where the query result includes: the audio clip and the time range corresponding to the audio clip.

According to another aspect of the embodiments of the present application, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute any one of the audio search methods.

According to another aspect of the embodiments of the present application, there is also provided an audio search apparatus, including: a processor; and a memory, connected to the processor, for providing instructions to the processor for processing the following processing steps: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

In this embodiment of the present application, an original multimedia file is input to a neural network model, and an audio feature to be searched is output, where the neural network model is a model generated by machine learning training using a plurality of sets of data, and the plurality of sets of data include: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

It is easy to note that, in the embodiment of the present application, a neural network model with strong generalization capability is generated through machine learning training by using multiple sets of data, that is, only one neural network model is configured, so that the universality of the audio search method can be effectively enhanced, and the operation of refreshing the search library through multiple audio separation models can be avoided. Even if an unknown musical instrument audio sample is encountered, the original multimedia file is input into the neural network model to obtain the corresponding audio feature to be searched, an audio fragment similar to the audio feature to be searched can be obtained by inquiring in a preset retrieval area, and a credible inquiry result is obtained.

Therefore, the purpose of enhancing the universality and the searching efficiency of the audio searching method is achieved, the technical effect of improving the credibility of the audio searching result is achieved, and the technical problem that the universality and the searching efficiency of the audio searching method in the prior art are poor is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing an audio search method according to an embodiment of the present application;

FIG. 2 is a flow chart of an audio search method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a scene of an audio search method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a scene of an audio search method according to an embodiment of the present application;

FIG. 5 is a flow chart of another audio search method according to an embodiment of the application;

FIG. 6 is a flow chart of yet another audio search method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an audio search apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an audio search apparatus according to an embodiment of the present invention;

fig. 9 is a block diagram of another computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

sound source separation: audio source separation refers to a process of separating a single or a plurality of desired audio signals from a mixed music signal.

Audio searching: refers to the process of finding the most similar audio in the audio library for a given query audio by constructing a similarity measure between the audios.

Example 1

In accordance with an embodiment of the present application, there is provided an embodiment of an audio search method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the embodiment 1 of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or mobile device) for implementing the audio search method, and as shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more processors 102 (shown with 102a, 102b, … …, 102n in the figure) (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication function. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the audio search method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the audio search method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the foregoing operating environment, the present application provides an audio search method as shown in fig. 2, where fig. 2 is a flowchart of an audio search method according to an embodiment of the present application, and as shown in fig. 2, the audio search method includes the following method steps:

step S202, inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results;

step S204, inquiring an audio segment similar to the audio feature to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises: the audio clip and the time range corresponding to the audio clip.

As an alternative embodiment, the preset search area includes one of the following: audio feature library, video feature library. The original multimedia file includes at least one of: an original audio file, an original video file, or an original picture.

The method for searching musical instrument audio proposed in the present application may be applied to the following application scenarios, but is not limited to the following application scenarios: for example, an application scene in which an audio/video clip is searched for in a movie or a television show; and in the process of watching the anchor live broadcast, searching an application scene of the anchor display commodity according to the original audio file, the original video file or the original picture.

Optionally, the neural network model is a model generated by machine learning training using multiple sets of data, for example, an audio feature model; the above-mentioned multigroup data include: a plurality of different types of instrument audio samples and sound source separation results. In an alternative embodiment of the present application, the neural network model may be generated by machine learning training using a plurality of sets of data, i.e., a plurality of different types of instrument audio samples and sound source separation results.

Outputting audio features to be searched by inputting an original multimedia file into a neural network model, wherein the neural network model is generated by machine learning training using a plurality of sets of data, and the plurality of sets of data include: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

As an optional embodiment, based on the neural network model, the input original multimedia file may be time-segmented according to a time schedule of the original multimedia file to obtain a plurality of audio segments; inputting the audio segments into the neural network model, acquiring a plurality of candidate features in the last network layer adjacent to the output layer of the neural network model, performing weighted average processing on the candidate features, and outputting the audio features to be searched; similarity measurement processing is carried out on the audio features to be searched and the plurality of audio features in the preset retrieval area, namely Euclidean distances between the audio features to be searched and each of the audio features are calculated, and a plurality of calculation results are obtained; and sequencing the plurality of calculation results according to the Euclidean distance to obtain the sequencing result, and taking the audio clip of the TopN and the time range corresponding to the audio clip as the final output query result based on the sequencing result.

The embodiment of the application provides a universal audio searching method for musical instruments, which can retrieve similar audio segments from an audio library or a video library based on an original multimedia file of a given musical instrument; the generalization capability of the audio characteristic model is improved through the audio data expansion method and the multi-label learning in the embodiment of the application; by training the audio feature extraction model with strong generalization capability, the universality is obviously enhanced, and a credible query result can be given by a vectorization recall mode even when an unknown musical instrument audio sample is encountered; only one neural network model is configured in the method, the universality of the audio searching method can be effectively enhanced, and the operation of refreshing the search library through a plurality of audio separation models can be further avoided.

As an optional embodiment, the neural network model may be generated by two steps of training, the first step is to train to obtain the neural network model by taking multi-classification as a target, and the second step is to perform combination and intersection on the original training data to perform multi-label learning based on the neural network model obtained by the training of the first step, so as to further improve the generalization capability of the neural network model.

In an optional embodiment, the method further includes:

step S302, obtaining an audio classification model by adopting the multiple groups of data through machine learning training;

step S304, carrying out combined cross processing on the multiple groups of data to obtain first mixed data;

step S306, performing multi-label training on the first mixed data to obtain the neural network model.

In a first step, fig. 3 is a schematic view of a scene of an audio search method according to an embodiment of the present application, and as shown in fig. 3, the first step is a multi-classification training stage, which may obtain an audio classification model through machine learning training by using the above multiple sets of data, that is, by using multiple different types of instrument audio samples and audio source separation results; performing combined cross processing, for example, audio mixing processing, on the multiple sets of data to obtain first mixed data; and performing multi-label training on the first mixed data to obtain the neural network model so as to realize audio search processing on the specified musical instrument 1 or musical instrument 2.

It should be noted that, in the first step, the multi-label training may be performed on the first mixed data according to a general audio classification model training manner to obtain a neural network model, so that the specific multi-label training process is not described in detail.

In another optional embodiment, the method further includes:

step S402, the first mixed data and the language audio sample are combined and processed in an intersecting way to obtain second mixed data;

and S404, performing multi-label training on the second mixed data, and adjusting the neural network model.

In the second step, as shown in fig. 3, the second step is a multi-label training stage, and the first mixed data and the speech audio sample are combined and processed in an intersecting manner to obtain second mixed data; and performing multi-label training on the second mixed data, and adjusting the neural network model to realize audio search processing on the specified musical instrument 1 and/or musical instrument 2.

It should be noted that, in the second step, since there is no audio signal of a single sound source in the actual application scene, the combined cross processing is performed on the basis of the first mixed data and the language audio sample to obtain second mixed data, where the number of cross is randomly generated and the language audio sample is randomly added, for example, the voice of a person shown in fig. 3 is randomly added to adapt to the synthetic scene of a movie or a tv series, and the training target is changed to a multi-label corresponding to the cross audio track; and then training the second mixed data according to a multi-label training mode, wherein the training mode can still be a general audio classification model training mode, and thus the description is not repeated.

In an optional embodiment, inputting the original multimedia file into the neural network model, and outputting the audio feature to be searched includes:

step S502, carrying out segmentation processing on the original multimedia file according to the time progress to obtain a plurality of audio segments;

step S504, the plurality of audio frequency segments are input into the neural network model, and a plurality of candidate characteristics in the last network layer adjacent to the output layer of the neural network model are obtained;

step S506, performing weighted average processing on the multiple candidate features, and outputting the audio feature to be searched.

As an alternative embodiment, as shown in fig. 4, based on the neural network model, the input original multimedia file may be segmented according to the time schedule of the original multimedia file, so as to obtain a plurality of audio segments, which may be, for example, but not limited to, audio segment 1 and audio segment 2 shown in fig. 4; and inputting the audio segments into the neural network model, acquiring a plurality of candidate features in the last network layer adjacent to the output layer of the neural network model, and finally outputting the audio features to be searched by performing weighted average processing on the candidate features.

By the embodiment of the application, even when an unknown musical instrument audio sample is encountered, the corresponding audio feature to be searched is obtained by inputting the original multimedia file into the neural network model, an audio fragment similar to the audio feature to be searched can be obtained by inquiring in the preset retrieval area, and a credible inquiry result is obtained.

In another optional embodiment, the querying, from the preset retrieval area, an audio segment similar to the audio feature to be searched for, and obtaining the query result includes:

step S602, carrying out similarity measurement processing on the audio features to be searched and a plurality of audio features in the preset retrieval area to obtain a sequencing result;

step S604, determining the query result based on the sorting result.

In an optional embodiment, the performing similarity measurement processing on the audio feature to be searched and the plurality of audio features in the preset retrieval area to obtain the ranking result includes:

step S702, calculating the Euclidean distance between the audio feature to be searched and each audio feature in the plurality of audio features to obtain a plurality of calculation results;

step S704, sorting the plurality of calculation results according to the euclidean distance to obtain the sorting result.

In the above optional embodiment, as shown in fig. 4, similarity measurement processing is performed on the audio feature to be searched and the plurality of audio features in the preset retrieval area, that is, a euclidean distance between the audio feature to be searched and each of the plurality of audio features is calculated, so as to obtain a plurality of calculation results; and sequencing the plurality of calculation results according to the Euclidean distance to obtain the sequencing result, and taking the audio clip of the TopN and the time range corresponding to the audio clip as the final output query result based on the sequencing result.

In another optional embodiment, in an application scenario in which a search anchor displays a commodity in a live broadcasting process, if a user wants to search the commodity displayed by the search anchor in the live broadcasting process in an e-commerce platform and the user cannot provide detailed information such as a name of the commodity, the detailed information of the searched commodity can be determined by providing an original multimedia file acquired in the live broadcasting process according to the schematic diagram shown in fig. 4, so as to achieve a technical effect of searching for the corresponding commodity in the e-commerce platform.

For example, a director shows a water color pen in a live broadcast room and explains information such as product category, product advantages, and instructions of the water color pen in detail, and a user wants to search for the water color pen in a tv-commercial platform, the user can intercept (record) a segment or save an original multimedia file, because the original multimedia file includes: explaining the original audio file of the water color pen, explaining the original video file of the water color pen and the original picture of the water color pen, as the processing flow shown in fig. 4, based on the original multimedia file, the audio clip similar to the audio feature to be searched can be inquired from the preset retrieval area, the detail information of the searched commodity is determined, and then the corresponding commodity can be searched in the e-commerce platform, so that the commodity searching efficiency of the user and the yield of the live broadcast platform are effectively improved.

In an embodiment of the present application, there is also an optional embodiment, in which the audio search method further includes:

step S710, extracting auxiliary verification information corresponding to the query result from the original multimedia file, wherein the auxiliary verification information includes at least one of the following: video information, image information;

step S712, the query result is verified by using the auxiliary verification information.

In the above optional embodiment, since the query result includes: after determining the query result based on the sorting result, the audio clip and the time range corresponding to the audio clip may further extract auxiliary verification information corresponding to the query result from the original multimedia file, and optionally, extract auxiliary verification information from the original multimedia file according to the time range corresponding to the audio clip, for example, if the time range corresponding to the audio clip is 11 to 13s, extract video information or image information in the original multimedia file with a time range of 11 to 13 s.

The extracted video information or image information is compared with the inquired audio clip, so that the purpose of checking the credibility of the inquiry result can be achieved, for example, if the indication of the comparison result is consistent, the credibility of the inquiry result is determined to be higher, and the inquiry result can be directly output; and if the comparison result indicates inconsistency, determining that the credibility of the query result is low, inquiring the audio segments with similar audio features to be searched again from the preset retrieval area to obtain a new query result, and verifying the new query result by adopting the auxiliary verification information.

In the foregoing operating environment, the present application further provides an alternative embodiment of the audio search method shown in fig. 5, and compared with the audio search method shown in fig. 2, the audio search method shown in fig. 5 does not need to use multiple sets of data to generate a neural network model through machine learning training in the implementation process, and can still achieve the technical effect of effectively enhancing the universality of the audio search method.

Fig. 5 is a flowchart of another audio searching method according to an embodiment of the present application, and as shown in fig. 5, the alternative audio searching method includes the following method steps:

step S802, obtaining an audio search request message, wherein the information carried in the audio search request message at least comprises: storing position information of an original multimedia file;

step S804, based on the audio search request message, obtaining an audio clip similar to the original multimedia file, and obtaining a search result;

step S806, feeding back an audio search response message, where the information carried in the audio search response message at least includes: and (5) the search result is obtained.

In this embodiment of the present application, by obtaining an audio search request message, information carried in the audio search request message at least includes: storing position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result; feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and (5) the search result is obtained.

By the embodiment of the application, based on the original multimedia file of a given musical instrument, the audio clip similar to the original multimedia file can be retrieved from the audio library or the video library according to the storage position information of the original multimedia file input by the individual user, and the audio clip is fed back to the individual user in a mode of feeding back the audio search response message, so that the aim of feeding back the credible audio search result to the individual user can be achieved.

It should be noted that, the information carried in the audio search request message at least includes: storing position information of an original multimedia file; therefore, the alternative audio search method shown in fig. 5 is more suitable for the personal user, and based on the obtained audio search request information input by the personal user, for example, the storage location information specifically depending on the original multimedia file, an audio clip similar to the original multimedia file is obtained, and the searched search result is fed back to the personal user.

As an alternative embodiment, obtaining an audio clip similar to the original multimedia file based on the audio search request message, and obtaining the search result includes:

step S902, classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files;

step S904, searching for an audio clip similar to the audio feature to be searched, and obtaining the search result.

In the above optional embodiment, after the audio search request message is obtained, the original multimedia file is determined according to the storage location information of the original multimedia file carried in the audio search request message, the original multimedia file is compared with the similarity of the audio samples of the musical instruments of different types respectively to obtain a comparison result, and the classification to which the original multimedia file belongs is determined according to the comparison result to obtain a classification result; and then, the original multimedia file can be segmented based on the classification result, the audio features to be searched are extracted, and the sound source type of the original multimedia file is determined.

In another optional embodiment, the classifying the original multimedia file, and the extracting the audio features to be searched includes:

step S1002, comparing the similarity of the original multimedia file with various different types of musical instrument audio samples respectively to obtain comparison results;

step S1004, determining the classification of the original multimedia file according to the comparison result to obtain a classification result;

step S1006, performing segmentation processing on the original multimedia file based on the classification result, and extracting the audio feature to be searched.

In the above optional embodiment, the original multimedia file may be respectively subjected to similarity comparison with multiple different types of instrument audio samples, when the comparison result indicates that the similarity between the original multimedia file and a certain type of instrument audio sample is greater than or equal to a preset similarity threshold, it is determined that the original multimedia file and the type of instrument audio sample belong to the same class, and then the original multimedia file may be subjected to segmentation processing according to the class to which the type of instrument audio sample belongs to obtain multiple audio segments, and the audio features to be searched are extracted from the multiple audio segments.

In the foregoing operating environment, the present application further provides an embodiment of another alternative audio search method as shown in fig. 6, and compared with the foregoing audio search method shown in fig. 2, in the implementation process of the audio search method shown in fig. 6, a neural network model is generated through machine learning training without using multiple sets of data, and a technical effect of effectively enhancing the universality of the audio search method can still be achieved.

Fig. 6 is a flowchart of another audio searching method according to an embodiment of the present application, and as shown in fig. 6, the alternative audio searching method includes the following steps:

step S1102, obtaining an audio call request message, where call parameters carried in the audio call request message include: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file;

step S1104, obtaining an audio clip similar to the original multimedia file based on the audio calling request message, and obtaining a search result;

step S1106, feeding back an audio call response message, where the information carried in the audio call response message at least includes: and (5) the search result is obtained.

In this embodiment of the present application, by obtaining an audio call request message, a call parameter carried in the audio call request message includes: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result; feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and (5) the search result is obtained.

Through the above embodiments of the present application, based on an original multimedia file of a given musical instrument, the call parameters carried in the audio call request message input by the enterprise user may include: the application identification information, the application authorization information and the storage position information of the original multimedia file are used for retrieving an audio clip similar to the original multimedia file from an audio library or a video library, and the audio clip is fed back to the enterprise user in a mode of feeding back audio calling response information, so that the aim of feeding back a credible audio search result to the enterprise user can be achieved.

It should be noted that, the information carried in the audio invocation request message at least includes: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; therefore, the alternative audio search method shown in fig. 6 is more suitable for enterprise users, and based on the obtained audio search request information input by the individual user, for example, specifically depending on the application identification information, the application authorization information, and the storage location information of the original multimedia file, an audio clip similar to the original multimedia file is obtained, and the searched search result is fed back to the enterprise user.

Optionally, the application identification information is used to identify an APP identifier of the enterprise user, for example, dog search music, google music, and the like, the application authorization information is used to represent whether the enterprise user successfully registers, and when the application identification information and the application authorization information are both verified to pass and meet an authorization condition, it is determined that the audio search can be implemented according to the audio call request message of the enterprise user without relying on a neural network model.

In an optional embodiment, obtaining an audio clip similar to the original multimedia file based on the audio call request message, and obtaining the search result includes:

step S1202, classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files;

step S1204, searching for an audio segment similar to the audio feature to be searched, and obtaining the search result.

In the above optional embodiment, after the audio call request message is obtained, the original multimedia file is determined according to the storage location information of the original multimedia file carried in the audio call request message, the original multimedia file is compared with the similarity of the audio samples of the musical instruments of different types respectively to obtain a comparison result, and the class to which the original multimedia file belongs is determined according to the comparison result to obtain a classification result; and then, the original multimedia file can be segmented based on the classification result, and the audio features to be searched are extracted.

In an optional embodiment, the classifying the original multimedia file, and the extracting the audio features to be searched includes:

step S1302, comparing the original multimedia file with a plurality of different types of musical instrument audio samples respectively to obtain comparison results;

step S1304, determining the classification of the original multimedia file according to the comparison result to obtain a classification result;

step S1306, performing segmentation processing on the original multimedia file based on the classification result, and extracting the audio feature to be searched.

In the above optional embodiment, the original multimedia file may be respectively subjected to similarity comparison with multiple different types of musical instrument audio samples, when the comparison result indicates that the similarity between the original multimedia file and a certain type of musical instrument audio sample is greater than or equal to a preset similarity threshold, it is determined that the original multimedia file and the certain type of musical instrument audio sample belong to the same class, and then, according to the class to which the certain type of musical instrument audio sample belongs, the original multimedia file may be subjected to segmentation processing to obtain multiple audio segments, and the audio feature to be searched is extracted from the multiple audio segments, that is, the sound source type of the original multimedia file is determined.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is further provided an apparatus embodiment for implementing the audio search method, and fig. 7 is a schematic structural diagram of an audio search apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus includes: a generation module 50 and a search module 52, comprising:

a generating module 50, configured to input an original multimedia file into a neural network model, and output an audio feature to be searched, where the neural network model is a model generated by using multiple sets of data through machine learning training, and the multiple sets of data include: a plurality of different types of instrument audio samples and sound source separation results; the search module 52 is configured to query, from a preset search area, an audio segment similar to the audio feature to be searched to obtain a query result, where the query result includes: the audio clip and the time range corresponding to the audio clip.

It should be noted here that the generating module 50 and the searching module 52 correspond to steps S202 to S204 in embodiment 1, and the two modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure of embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

According to an embodiment of the present application, there is provided another embodiment of an apparatus for implementing the audio search method, where the audio search apparatus includes: a first acquisition module 60, a second acquisition module 62, and a feedback module 64, wherein:

a first obtaining module 60, configured to obtain an audio search request message, where information carried in the audio search request message at least includes: storing position information of an original multimedia file; a second obtaining module 62, configured to obtain an audio clip similar to the original multimedia file based on the audio search request message, so as to obtain a search result; a first feedback module 64, configured to feed back an audio search response message, where information carried in the audio search response message at least includes: and (5) the search result is obtained.

It should be noted here that the first obtaining module 60, the second obtaining module 62 and the feedback module 64 correspond to steps S802 to S806 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

According to an embodiment of the present application, there is also provided another apparatus embodiment for implementing the audio search method, where the audio search apparatus includes: a third acquisition module 70, a fourth acquisition module 72, and a second feedback module 74, wherein:

the third obtaining module 70 obtains the audio invoking request message, where the invoking parameter carried in the audio invoking request message includes: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; a fourth obtaining module 72, configured to obtain an audio clip similar to the original multimedia file based on the audio call request message, and obtain a search result; the second feedback module 74 feeds back an audio invoking response message, where the information carried in the audio invoking response message at least includes: and (5) the search result is obtained.

It should be noted here that the third obtaining module 70, the fourth obtaining module 72 and the second feedback module 74 correspond to steps S1102 to S1106 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

It should be noted that, reference may be made to the relevant description in embodiment 1 for a preferred implementation of this embodiment, and details are not described here again.

Example 3

According to an embodiment of the present application, there is further provided an embodiment of an audio search device, which may be any one of computing devices in a computing device group. Fig. 8 is a schematic structural diagram of an audio search apparatus according to an embodiment of the present invention, as shown in fig. 8, the audio search apparatus including: a processor 600 and a memory 602, wherein,

a processor 600; and a memory 602, connected to the processor 600, for providing the processor with instructions to process the following processing steps: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

Example 4

According to an embodiment of the present application, there is further provided an embodiment of a computer terminal, where the computer terminal may be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: obtaining an audio classification model by adopting the multiple groups of data through machine learning training; performing combined cross processing on the multiple groups of data to obtain first mixed data; and performing multi-label training on the first mixed data to obtain the neural network model.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: performing combined cross processing on the first mixed data and the language audio samples to obtain second mixed data; and performing multi-label training on the second mixed data, and adjusting the neural network model.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: carrying out segmentation processing on the original multimedia file according to the time schedule to obtain a plurality of audio segments; inputting the audio segments into the neural network model, and acquiring a plurality of alternative features in the last network layer adjacent to the output layer of the neural network model; and carrying out weighted average processing on the plurality of candidate characteristics and outputting the audio characteristics to be searched.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: carrying out similarity measurement processing on the audio features to be searched and the plurality of audio features in the preset retrieval area to obtain a sequencing result; and determining the query result based on the sorting result.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: calculating the Euclidean distance between the audio feature to be searched and each audio feature in the plurality of audio features to obtain a plurality of calculation results; and sorting the plurality of calculation results according to the Euclidean distance to obtain the sorting result.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: acquiring an audio search request message, wherein the information carried in the audio search request message at least comprises: storing position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result; feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and (5) the search result is obtained.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files; and searching the audio frequency segments similar to the audio frequency characteristics to be searched to obtain the searching result.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: respectively comparing the original multimedia file with the similarity of various different types of musical instrument audio samples to obtain comparison results; determining the classification of the original multimedia file according to the comparison result to obtain a classification result; and carrying out segmentation processing on the original multimedia file based on the classification result, and extracting the audio features to be searched.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio search method: acquiring an audio calling request message, wherein calling parameters carried in the audio calling request message include: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result; feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and (5) the search result is obtained.

Optionally, fig. 9 is a block diagram of another structure of a computer terminal according to an embodiment of the present application, and as shown in fig. 9, the computer terminal may include: one or more processors 702 (only one of which is shown), memory 704, and a peripheral interface 706.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the audio search method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by operating the software programs and modules stored in the memory, so as to implement the audio search method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

Optionally, the processor may further execute the program code of the following steps: obtaining an audio classification model by adopting the multiple groups of data through machine learning training; performing combined cross processing on the multiple groups of data to obtain first mixed data; and performing multi-label training on the first mixed data to obtain the neural network model.

Optionally, the processor may further execute the program code of the following steps: performing combined cross processing on the first mixed data and the language audio samples to obtain second mixed data; and performing multi-label training on the second mixed data, and adjusting the neural network model.

Optionally, the processor may further execute the program code of the following steps: carrying out segmentation processing on the original multimedia file according to the time schedule to obtain a plurality of audio segments; inputting the audio segments into the neural network model, and acquiring a plurality of alternative features in the last network layer adjacent to the output layer of the neural network model; and carrying out weighted average processing on the plurality of candidate characteristics and outputting the audio characteristics to be searched.

Optionally, the processor may further execute the program code of the following steps: carrying out similarity measurement processing on the audio features to be searched and the plurality of audio features in the preset retrieval area to obtain a sequencing result; and determining the query result based on the sorting result.

Optionally, the processor may further execute the program code of the following steps: calculating the Euclidean distance between the audio feature to be searched and each audio feature in the plurality of audio features to obtain a plurality of calculation results; and sorting the plurality of calculation results according to the Euclidean distance to obtain the sorting result.

Optionally, the processor may further execute the program code of the following steps: acquiring an audio search request message, wherein the information carried in the audio search request message at least comprises: storing position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result; feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and (5) the search result is obtained.

Optionally, the processor may further execute the program code of the following steps: classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files; and searching the audio frequency segments similar to the audio frequency characteristics to be searched to obtain the searching result.

Optionally, the processor may further execute the program code of the following steps: respectively comparing the original multimedia file with the similarity of various different types of musical instrument audio samples to obtain comparison results; determining the classification of the original multimedia file according to the comparison result to obtain a classification result; and carrying out segmentation processing on the original multimedia file based on the classification result, and extracting the audio features to be searched.

Optionally, the processor may further execute the program code of the following steps: acquiring an audio calling request message, wherein calling parameters carried in the audio calling request message include: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result; feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and (5) the search result is obtained.

It can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the computer terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 5

According to an embodiment of the present application, there is also provided an embodiment of a storage medium. Optionally, in this embodiment, the storage medium may be configured to store program codes executed by the audio search method provided in embodiment 1.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results; inquiring the audio segments similar to the audio features to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises the following steps: the audio clip and the time range corresponding to the audio clip.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining an audio classification model by adopting the multiple groups of data through machine learning training; performing combined cross processing on the multiple groups of data to obtain first mixed data; and performing multi-label training on the first mixed data to obtain the neural network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing combined cross processing on the first mixed data and the language audio samples to obtain second mixed data; and performing multi-label training on the second mixed data, and adjusting the neural network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: carrying out segmentation processing on the original multimedia file according to the time schedule to obtain a plurality of audio segments; inputting the audio segments into the neural network model, and acquiring a plurality of alternative features in the last network layer adjacent to the output layer of the neural network model; and carrying out weighted average processing on the plurality of candidate characteristics and outputting the audio characteristics to be searched.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: carrying out similarity measurement processing on the audio features to be searched and the plurality of audio features in the preset retrieval area to obtain a sequencing result; and determining the query result based on the sorting result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: calculating the Euclidean distance between the audio feature to be searched and each audio feature in the plurality of audio features to obtain a plurality of calculation results; and sorting the plurality of calculation results according to the Euclidean distance to obtain the sorting result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring an audio search request message, wherein the information carried in the audio search request message at least comprises: storing position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result; feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and (5) the search result is obtained.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files; and searching the audio frequency segments similar to the audio frequency characteristics to be searched to obtain the searching result.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: respectively comparing the original multimedia file with the similarity of various different types of musical instrument audio samples to obtain comparison results; determining the classification of the original multimedia file according to the comparison result to obtain a classification result; and carrying out segmentation processing on the original multimedia file based on the classification result, and extracting the audio features to be searched.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring an audio calling request message, wherein calling parameters carried in the audio calling request message include: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file; acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result; feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and (5) the search result is obtained.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. An audio search method, comprising:

inputting an original multimedia file into a neural network model, and outputting audio features to be searched, wherein the neural network model is generated by machine learning training by using multiple groups of data, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results;

inquiring an audio segment similar to the audio feature to be searched from a preset retrieval area to obtain an inquiry result, wherein the inquiry result comprises: the audio clip and the time range corresponding to the audio clip.

2. The method of claim 1, further comprising:

obtaining an audio classification model by machine learning training by adopting the multiple groups of data;

performing combined cross processing on the multiple groups of data to obtain first mixed data;

and performing multi-label training on the first mixed data to obtain the neural network model.

3. The method of claim 2, further comprising:

performing combined cross processing on the first mixed data and the language audio samples to obtain second mixed data;

and performing multi-label training on the second mixed data, and adjusting the neural network model.

4. The method according to any one of claims 1 to 3, wherein the original multimedia file is input to the neural network model, and the outputting the audio feature to be searched comprises:

carrying out segmentation processing on the original multimedia file according to the time progress to obtain a plurality of audio segments;

inputting the audio segments into the neural network model, and acquiring a plurality of alternative features in the last network layer adjacent to the output layer of the neural network model;

and carrying out weighted average processing on the plurality of candidate characteristics and outputting the audio characteristics to be searched.

5. The method of claim 1, wherein the querying for the audio segment similar to the audio feature to be searched from the preset retrieval area, and obtaining the query result comprises:

carrying out similarity measurement processing on the audio features to be searched and the plurality of audio features in the preset retrieval area to obtain a sequencing result;

determining the query result based on the ranking result.

6. The method according to claim 5, wherein the performing similarity measurement processing on the audio feature to be searched and the plurality of audio features in the preset retrieval area to obtain the ranking result comprises:

calculating the Euclidean distance between the audio feature to be searched and each audio feature in the plurality of audio features to obtain a plurality of calculation results;

and sequencing the plurality of calculation results according to the Euclidean distance to obtain the sequencing result.

7. The method of claim 1, wherein the preset search area comprises one of:

audio feature library, video feature library.

8. The method of claim 1, further comprising:

extracting auxiliary verification information corresponding to the query result from the original multimedia file, wherein the auxiliary verification information comprises at least one of the following: video information, image information;

and verifying the query result by adopting the auxiliary verification information.

9. An audio search method, comprising:

acquiring an audio search request message, wherein information carried in the audio search request message at least comprises: storing position information of an original multimedia file;

acquiring an audio clip similar to the original multimedia file based on the audio search request message to obtain a search result;

feeding back an audio search response message, wherein the information carried in the audio search response message at least comprises: and the search result is obtained.

10. The method of claim 9, wherein obtaining an audio clip similar to the original multimedia file based on the audio search request message, and obtaining the search result comprises:

classifying the original multimedia files, and extracting audio features to be searched, wherein the audio features to be searched are used for indicating the sound source types of the original multimedia files;

and searching audio segments similar to the audio features to be searched to obtain the search result.

11. The method of claim 10, wherein the original multimedia file is classified, and extracting the audio feature to be searched comprises:

respectively comparing the original multimedia file with the similarity of various different types of musical instrument audio samples to obtain comparison results;

determining the classification of the original multimedia file according to the comparison result to obtain a classification result;

and carrying out segmentation processing on the original multimedia file based on the classification result, and extracting the audio features to be searched.

12. An audio search method, comprising:

acquiring an audio calling request message, wherein calling parameters carried in the audio calling request message comprise: the method comprises the steps of applying identification information, application authorization information and storage position information of an original multimedia file;

acquiring an audio clip similar to the original multimedia file based on the audio calling request message to obtain a search result;

feeding back an audio call response message, wherein the information carried in the audio call response message at least comprises: and the search result is obtained.

13. The method of claim 12, wherein obtaining an audio clip similar to the original multimedia file based on the audio call request message, and obtaining the search result comprises:

14. The method of claim 13, wherein the original multimedia file is classified, and extracting the audio feature to be searched comprises:

15. An audio search apparatus, comprising:

the generating module is used for inputting an original multimedia file into a neural network model and outputting audio features to be searched, wherein the neural network model is generated by using multiple groups of data through machine learning training, and the multiple groups of data comprise: a plurality of different types of instrument audio samples and sound source separation results;

the search module is configured to query an audio segment similar to the audio feature to be searched from a preset search area to obtain a query result, where the query result includes: the audio clip and the time range corresponding to the audio clip.

16. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, the device where the storage medium is located is controlled to execute the audio search method according to any one of claims 1 to 14.

17. An audio search device, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: