CN115565508A - Song matching method and device, electronic equipment and storage medium - Google Patents

Song matching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115565508A
CN115565508A CN202211175487.8A CN202211175487A CN115565508A CN 115565508 A CN115565508 A CN 115565508A CN 202211175487 A CN202211175487 A CN 202211175487A CN 115565508 A CN115565508 A CN 115565508A
Authority
CN
China
Prior art keywords
song
tone
information
category
timbre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211175487.8A
Other languages
Chinese (zh)
Inventor
魏耀都
许成林
郑羲光
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202211175487.8A priority Critical patent/CN115565508A/en
Publication of CN115565508A publication Critical patent/CN115565508A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/005Algorithms for electrophonic musical instruments or musical processing, e.g. for automatic composition or resource allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a song matching method and device, electronic equipment and a storage medium, and relates to the field of internet. The method comprises the following steps: in response to a song matching request of a singing object, determining first tone color information based on a voice signal of the singing object, wherein the first tone color information is used for representing at least one tone color category of the voice signal; determining at least one target song from a song library based on the first tone color information and a plurality of second tone color information, wherein the similarity between the tone color category of the target song and the tone color category of the singing object is larger than a similarity threshold value; and returning the at least one target song to the singing object so as to match the singing object with the at least one target song. According to the technical scheme disclosed by the invention, the similarity between the tone category of the target song matched with the singing object and the tone category of the singing object is higher, the singing object is suitable for singing, and the accuracy of song matching is improved.

Description

Song matching method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of internet, and in particular, to a song matching method and apparatus, an electronic device, and a storage medium.
Background
With the development of computer technology, singing using singing software has become a popular trend. However, the number of popular songs exceeds one million, so that how to sing for the song-singing object to match the proper song from the huge number of songs is a problem to be solved.
At present, it is common to acquire registration information and behavior records of a singing object, such as song listening records and song singing records authorized by the singing object, and make labels for songs in a song library, such as release year, song style type, and the like. And matching songs for the singing objects according to the registration information and the association between the behavior records and the song labels of the singing objects.
The technical scheme has the problems that a plurality of songs are associated with the registration information and the behavior record of the singing object, most of the songs do not accord with the tone of the singing object and are not suitable for the singing object to sing, and the matching accuracy is low.
Disclosure of Invention
The invention provides a song matching method, a song matching device, electronic equipment and a storage medium, wherein a target song matched with a singing object to sing is determined based on the tone color characteristics of the singing object and the tone color characteristics of songs in a song library, so that the tone color of the target song and the tone color of the singing object have higher similarity, the target song is suitable for the singing object to sing, and the accuracy of song matching is improved. The technical scheme of the disclosure is as follows:
according to an aspect of the embodiments of the present disclosure, there is provided a song matching method, including:
in response to a song matching request of a singing object, determining first tone color information based on a voice signal of the singing object, wherein the first tone color information is used for representing at least one tone color category of the voice signal;
determining at least one target song from a song library based on the first tone color information and a plurality of second tone color information, wherein the second tone color information is used for representing at least one tone color category of songs in the song library, and the similarity between the tone color category of the target song and the tone color category of the singing object is larger than a similarity threshold value;
returning the at least one target song to the singing object so that the singing object is matched with the at least one target song.
According to another aspect of the embodiments of the present disclosure, there is provided a song matching apparatus including:
a first determination unit configured to determine first timbre information based on a speech signal of a singing object in response to a song matching request of the singing object, the first timbre information representing at least one timbre category of the speech signal;
a second determination unit configured to determine at least one target song from a song library based on the first timbre information and a plurality of second timbre information, the second timbre information being used for representing at least one timbre category of songs in the song library, a similarity between the timbre category of the target song and the timbre category of the singing object being greater than a similarity threshold;
a matching unit configured to return the at least one target song to the singing object so that the singing object matches the at least one target song.
In some embodiments, the first determination unit comprises:
a dividing subunit configured to divide a voice signal of the singing object into a plurality of voice segments;
the extracting subunit is configured to perform feature extraction on any voice segment to obtain a tone characteristic of the voice segment, wherein the tone characteristic is used for representing a tone category of the voice segment;
and the clustering subunit is configured to cluster the tone features of the plurality of voice fragments to obtain at least one tone category and at least one category feature of the voice signals, wherein the category feature is used for representing a clustering center.
In some embodiments, the dividing subunit is configured to equally divide the voice signal into a plurality of voice segments according to a target duration; or, according to a plurality of sentences included in the voice signal, dividing the voice signal into a plurality of voice segments, wherein one voice segment includes one sentence.
In some embodiments, the extracting subunit is configured to perform spectrum feature extraction on the speech segment to obtain a mel cepstrum feature of the speech segment; determining a plurality of speech frame characteristics of the speech segment based on the Mel cepstral features; and determining the tone color characteristics of the voice segments based on the plurality of voice frame characteristics.
In some embodiments, the extracting the features of the voice segment to obtain the tone features of the voice segment is implemented based on an audio feature extractor;
the training step of the audio feature extractor comprises:
performing spectrum feature extraction on a sample audio signal based on a spectrum feature extraction layer in the audio feature extractor to obtain a sample Mel cepstrum feature of the sample audio signal;
processing the sample Mel cepstrum features based on a tone feature extraction layer in the audio feature extractor to obtain sample tone features of the sample audio signals;
processing the sample tone color characteristics based on an object discriminator, a pitch discriminator and an audio type discriminator in the audio characteristic extractor to obtain object loss, pitch loss and audio type loss;
training the audio feature extractor based on the object loss, the pitch loss, and the audio type loss.
In some embodiments, the clustering subunit is configured to cluster the tone features of the plurality of speech segments based on a plurality of clustering information, respectively, to obtain a clustering result of the plurality of clustering information, where the clustering information is used to indicate a category number when clustering is performed, and the clustering result is used to represent inter-class distances and intra-class distances; determining target clustering information based on clustering results of the plurality of clustering information, wherein the target clustering information is the clustering information with the largest ratio of the average inter-class distance to the average intra-class distance; determining the at least one timbre category and the at least one category feature of the speech signal based on the clustering result of the target clustering information.
In some embodiments, the apparatus further comprises:
a signal acquisition unit configured to acquire the voice signal input by the singing object within a history period; or returning prompt information to the singing object to acquire the voice signal input by the singing object based on the prompt information.
In some embodiments, the second determining unit is configured to obtain at least one third tone information from the plurality of second tone information, the number of tone categories of the at least one third tone information being not greater than the number of tone categories indicated by the first tone information; acquiring at least one tone information with the similarity greater than a similarity threshold value with the first tone information from the at least one third tone information; and determining the at least one target song corresponding to the at least one tone color information from the song library.
In some embodiments, the second determining unit is further configured to determine, for any third tone information, a first similarity between at least one first tone category in the first tone information and at least one second tone category in the third tone information; for any first timbre category in the first timbre information, determining a second similarity of the first timbre category, the second similarity being based on a minimum value of first similarities between the first timbre category and the at least one second timbre category; determining a sum of at least one second similarity of the at least one first timbre category as a similarity between the first timbre information and the third timbre information.
In some embodiments, the apparatus further comprises:
a song dividing unit configured to divide the song into a plurality of song segments for any one song in the song library;
the characteristic extraction unit is configured to perform characteristic extraction on any song fragment to obtain the tone characteristic of the song fragment, wherein the tone characteristic is used for expressing the tone category of the song fragment;
and the clustering unit is configured to cluster the tone color characteristics of the plurality of song segments to obtain at least one tone color category and at least one category characteristic of the songs, wherein the category characteristic is used for representing a clustering center.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including:
one or more processors;
a memory for storing the processor executable program code;
wherein the processor is configured to execute the program code to implement the song matching method described above.
According to another aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium in which program code is provided, which, when executed by a processor of an electronic device, enables the electronic device to perform the above-described song matching method.
According to another aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the song matching method described above.
The embodiment of the disclosure provides a song matching method, which determines at least one tone category of a voice signal based on the voice signal of a singing object, so that at least one target song similar to the tone category of the voice signal can be determined from a song library based on the at least one tone category, and then the at least one target song is returned to the singing object, so that the singing object is matched with the at least one target song, the similarity between the tone category of the target song matched with the singing object and the tone category of the singing object is higher, the method is suitable for the singing object to sing, and the accuracy of song matching is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a diagram illustrating an implementation environment for a song matching method according to an exemplary embodiment.
Fig. 2 is a flow diagram illustrating a song matching method according to an example embodiment.
Fig. 3 is a flow diagram illustrating another song matching method, according to an example embodiment.
FIG. 4 is a flow diagram illustrating training of an audio feature extractor, according to an example embodiment.
Fig. 5 is a flow chart illustrating a process of extracting timbre features of a singing object according to an exemplary embodiment.
FIG. 6 is a flowchart illustrating extraction of timbre features of songs from a gallery in accordance with an exemplary embodiment.
Fig. 7 is a block diagram illustrating a song matching method according to an example embodiment.
Fig. 8 is a block diagram illustrating a song matching apparatus according to an exemplary embodiment.
Fig. 9 is a block diagram illustrating another song matching apparatus according to an example embodiment.
FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Fig. 11 is a schematic diagram illustrating a configuration of a server according to an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the speech signals referred to in this application are all acquired with sufficient authorization.
FIG. 1 is a diagram illustrating an implementation environment for a song matching method according to an exemplary embodiment. Taking the electronic device as an example provided as a server, referring to fig. 1, the implementation environment specifically includes: a terminal 101 and a server 102.
The terminal 101 may be at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a laptop computer, and the like. An application for song matching may be installed and run on the terminal 101. The terminal 101 may be connected to the server 102 through a wireless network or a wired network.
The terminal 101 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 101. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only a few, or the number of the terminals may be several tens or hundreds, or more, and the number of the terminals and the type of the device are not limited in the embodiments of the present disclosure.
The server 102 may be at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The server 102 may be connected to the terminal 101 and other terminals through a wireless network or a wired network. Optionally, the number of the servers may be more or less, and the embodiment of the present disclosure does not limit this. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.
Fig. 2 is a flow chart illustrating a song matching method, as shown in fig. 2, performed by a server, according to an exemplary embodiment, including the following steps.
In response to a song matching request of a singing object, the server determines first tone information based on a voice signal of the singing object, the first tone information being used to represent at least one tone category of the voice signal in step S201.
In the embodiment of the present disclosure, when a singing object uses an application program for singing or any application program containing a singing voice recording function, a song matching request can be sent to a server through a terminal by triggering a song-ordering operation on the application program. The server is used for providing background service for the application level. The server may acquire the voice signal of the singing object collected during a history period or the voice signal of the singing object collected in real time in response to the song matching request. The server can determine at least one tone category of the voice signal according to the extracted features by extracting the features of the voice signal singing the song object. The at least one timbre category is capable of reflecting a timbre of the singing object.
In step S202, the server determines at least one target song from the song library based on the first timbre information and a plurality of second timbre information, the second timbre information being used for representing at least one timbre category of songs in the song library, a similarity between the timbre category of the target song and the timbre category of the singing object being greater than a similarity threshold.
In the disclosed embodiment, the song library is a collection of songs, and each song in the song library has a corresponding tone category. For any song, the song may correspond to one or more timbre categories. For example, the song starts with a bass tone, the middle with a treble tone and ends with a special cavity, so that the song corresponds to three different timbre categories. Optionally, the server stores a plurality of second timbre information of a plurality of songs in the song library, and if the song library is updated, the server may obtain and store the second timbre information of the newly added song.
After the server acquires the first tone information, the server can determine a target song similar to the tone of the singing object from the plurality of songs based on the similarity degree between at least one tone category of the voice signal and at least one tone category of each song in the song library. Because the similarity between the tone categories of the voice signals of the target song and the singing object is larger than the similarity threshold value, the tone of the target song is not greatly different from that of the singing object, and the singing object is suitable for singing. The similarity threshold is used to determine the degree of similarity between the timbre category of the target song and the timbre category of the speech signal of the singing object.
In step S203, the server returns at least one target song to the singing object so that the singing object matches the at least one target song.
In the embodiment of the present disclosure, after the server determines at least one target song from a plurality of songs in the song library, the target song may be displayed on an application program interface, so that the singing object can select a song from the target songs returned by the server to sing.
The embodiment of the disclosure provides a song matching method, which determines at least one tone category of a voice signal based on the voice signal of a singing object, so that at least one target song similar to the tone category of the voice signal can be determined from a song library based on the at least one tone category, and then the at least one target song is returned to the singing object, so that the singing object is matched with the at least one target song, the similarity between the tone category of the target song matched with the singing object and the tone category of the singing object is higher, the method is suitable for the singing object to sing, and the accuracy of song matching is improved.
In some embodiments, determining the first timbre information based on the speech signal of the singing object comprises:
dividing a voice signal of a singing object into a plurality of voice fragments;
for any voice segment, carrying out feature extraction on the voice segment to obtain a tone color feature of the voice segment, wherein the tone color feature is used for representing the tone color category of the voice segment;
and clustering the tone features of the plurality of voice segments to obtain at least one tone category and at least one category feature of the voice signals, wherein the category feature is used for representing a clustering center.
The tone of the singing object can be accurately reflected by clustering the tone characteristics of the singing object.
In some embodiments, dividing the speech signal of the singing object into a plurality of speech segments comprises any one of:
equally dividing the voice signal into a plurality of voice segments according to the target duration;
the speech signal is divided into a plurality of speech segments according to a plurality of sentences included in the speech signal, and one speech segment includes one sentence.
The voice signal of the singing object is divided into a plurality of voice fragments through two modes, and the voice signal of the singing object can be processed more accurately.
In some embodiments, performing feature extraction on the speech segment to obtain a timbre feature of the speech segment includes:
extracting the frequency spectrum characteristic of the voice segment to obtain the Mel cepstrum characteristic of the voice segment;
determining a plurality of speech frame characteristics of the speech segments based on the Mel cepstrum characteristics;
based on the plurality of speech frame characteristics, a timbre characteristic of the speech segment is determined.
Through extracting the tone characteristic of the voice fragment, the tone characteristic of the singing object can be obtained, and the accuracy of song matching is improved.
In some embodiments, feature extraction is performed on the voice segments to obtain timbre features of the voice segments, and the feature extraction is implemented based on an audio feature extractor;
the training step of the audio feature extractor comprises:
performing spectrum feature extraction on the sample audio signal based on a spectrum feature extraction layer in the audio feature extractor to obtain a sample Mel cepstrum feature of the sample audio signal;
processing the Mel cepstrum characteristics of the sample based on a tone characteristic extraction layer in the audio characteristic extractor to obtain sample tone characteristics of the sample audio signal;
processing the sample tone color characteristics based on an object discriminator, a pitch discriminator and an audio type discriminator in the audio characteristic extractor to obtain object loss, pitch loss and audio type loss;
the audio feature extractor is trained based on object loss, pitch loss, and audio type loss.
By training the audio characteristic extractor, the tone color characteristics of the singing object can be accurately extracted, and the accuracy of song matching is improved.
In some embodiments, clustering the timbre features of the plurality of speech segments to obtain at least one timbre category and at least one category feature of the speech signal comprises:
clustering the tone characteristics of the voice segments respectively based on the clustering information to obtain clustering results of the clustering information, wherein the clustering information is used for indicating the number of classes during clustering, and the clustering results are used for indicating the inter-class distance and the intra-class distance;
determining target clustering information based on clustering results of a plurality of clustering information, wherein the target clustering information is the clustering information with the largest ratio of the average inter-class distance to the average intra-class distance;
at least one timbre category and at least one category feature of the speech signal are determined based on the clustering result of the target clustering information.
By clustering the tone characteristics of the singing objects based on the plurality of clustering results, various different clustering results can be obtained, so that the target clustering information determined based on the different clustering results can accurately reflect the tone of the singing objects.
In some embodiments, the method further comprises:
acquiring a voice signal input by a singing object in a historical time period; alternatively, the first and second liquid crystal display panels may be,
and returning prompt information to the singing object to acquire a voice signal input by the singing object based on the prompt information.
By collecting the voice signal recorded by the singing object in real time, the tone color information determined based on the voice signal can reflect the current tone color of the singing object.
In some embodiments, determining at least one target song from the song library based on the first timbre information and the plurality of second timbre information comprises:
acquiring at least one piece of third tone information from the plurality of pieces of second tone information, wherein the tone category number of the at least one piece of third tone information is not more than the tone category number indicated by the first tone information;
acquiring at least one tone information with the similarity greater than the similarity threshold value with the first tone information from at least one third tone information;
and determining at least one target song corresponding to the at least one tone color information from the song library.
The third tone information is obtained based on the category number, so that the songs in the song library are screened, and the target song meeting the similarity condition is determined from the screened songs based on the similarity, so that the determined tone of the target song has higher similarity with the tone of the singing object, and the accuracy of matching the songs for the singing object is improved.
In some embodiments, the method further comprises:
for any third tone information, determining a first similarity between at least one first tone category in the first tone information and at least one second tone category in the third tone information;
for any first tone category in the first tone color information, determining a second similarity of the first tone color category, wherein the second similarity is the minimum value of first similarities between the first tone color category and at least one second tone color category;
the sum of at least one second similarity of at least one first tonal category is determined as a similarity between the first and third tonal information.
The target song is determined based on the similarity between the first tone information and the third tone information, so that the target song and the singing object have higher similarity, and the accuracy of song matching is improved.
In some embodiments, the method further comprises:
for any song in the song library, dividing the song into a plurality of song segments;
for any song segment, carrying out feature extraction on the song segment to obtain the tone color feature of the song segment, wherein the tone color feature is used for expressing the tone color category of the song segment;
and clustering the tone color characteristics of the plurality of song segments to obtain at least one tone color category and at least one category characteristic of the songs, wherein the category characteristic is used for representing a clustering center.
Through feature extraction and clustering of the song segments, tone information of the songs in the song library can be obtained, so that the songs with tone similar to the tone of the singing object can be selected, and the accuracy of song matching is improved.
Fig. 3 is a flow diagram illustrating another song matching method, as shown in fig. 3, performed by a server, according to an example embodiment.
In step S301, the server acquires a voice signal of a singing object in response to a song matching request of the singing object.
In the embodiment of the disclosure, when a singing object requests a song, the server sends a song matching request to the server through the terminal, responds to the song matching request, can acquire the voice signal of the singing object in different modes, and further determines the tone of the singing object based on the voice signal so as to match a proper song for the singing object.
In some embodiments, the server may obtain the voice signal of the singing object in the following two ways.
In the first mode, the server acquires a voice signal that has been previously stored by the singing object as the voice signal of the singing object. Accordingly, the server acquires the voice signal input by the singing object in the historical time period. The historical time period may be one day, three days, or seven days, which is not limited in the embodiments of the present disclosure. It should be noted that the voice signal input in the history time period of the singing object is the voice signal with the mute part removed. By acquiring the voice signals input in the historical time period, the server can match songs for the singing object without frequently inputting the voice signals by the singing object, and the song matching efficiency is improved.
And in the second mode, the server collects the voice signals recorded by the singing object in real time. Correspondingly, after receiving the song matching request, the server returns prompt information to the singing object and acquires the voice signal input by the singing object based on the prompt information. The prompting information can be used for prompting the singing object to sing a high pitch, a low pitch or a wanted cavity, and the content of the prompting information is not limited in the embodiment of the disclosure. By collecting the voice signal recorded by the singing object in real time, the tone color information determined based on the voice signal can reflect the current tone color of the singing object.
For example, the server transmits a prompt message to the terminal of the singing object after receiving the song matching request of the singing object. The terminal displays the prompt message on the application program interface, and the singing object records according to the prompt message displayed on the application program interface. If the prompt information prompts the singing object to sing a high pitch, the singing object can record a high pitch audio; if the prompt information prompts the singing object to sing a bass, the singing object can record a bass audio; if the prompt information is used for prompting the singing object to sing a singing cavity which the singing wants to use, the singing object can record a section of singing cavity audio. The server acquires all audio recorded by the singing object according to the prompt message, and the audio is used as a voice signal used by the singing object in the song ordering process after the mute part is removed. The server matches songs for the singing object according to the voice signals.
In step S302, the server divides a voice signal of a singing object into a plurality of voice fragments.
In the embodiment of the present disclosure, the server can divide the voice signal by adopting a plurality of dividing manners. For any one of the division modes, the server can divide the voice signal of the singing object into a plurality of voice fragments.
In some embodiments, the server may divide the voice signal by duration. Correspondingly, the server equally divides the voice signal into a plurality of voice segments according to the target duration. The target time period may be 1 second, 5 seconds, 10 seconds, or the like, which is not limited by the embodiment of the present disclosure.
For example, the total duration of the voice signal of the singing object is 30 seconds, and the server divides the voice signal once every 5 seconds to obtain 6 voice fragments; alternatively, the server divides every 10 seconds, resulting in 3 speech segments.
In some embodiments, the server may divide the speech signal by statements in the speech signal. Correspondingly, the voice signal is divided into a plurality of voice segments according to a plurality of sentences included in the voice signal, and one voice segment includes one sentence.
For example, the speech signal of the singing object includes three sentences, and the server divides the speech signal into three speech segments. Wherein each speech segment comprises a sentence.
In some embodiments, the server may first divide the speech signal according to the sentences included in the speech signal, and then equally divide the speech segment corresponding to each sentence according to the duration to obtain a plurality of speech segments.
For example, the server may divide the speech signal of the singing object into three speech segments by including three sentences in the speech signal. For a voice segment with the duration of 10 seconds, the server divides the voice segment once every 5 seconds to obtain 2 voice segments; alternatively, the server divides every 2 seconds to get 5 speech segments.
In step S303, for any voice segment, the server performs feature extraction on the voice segment to obtain a tone feature of the voice segment, where the tone feature is used to represent a tone category of the voice segment.
In this embodiment, the server may extract the spectral feature of the speech segment first, and then determine the timbre feature of the speech segment based on the extracted spectral feature. Correspondingly, the server extracts the frequency spectrum feature of the voice segment to obtain the Mel cepstrum feature of the voice segment. The server then determines a plurality of speech frame characteristics of the speech segments based on the mel cepstral features. Finally, the server determines the tone color characteristic of the voice segment based on the plurality of voice frame characteristics. Wherein, the voice frame feature represents the tone feature of each frame of the voice segment. And determining the tone color characteristics of the voice segment based on the tone color characteristics of all the voice frames of the voice segment. Through extracting the tone characteristic of the voice fragment, the tone characteristic of the singing object can be obtained, and the accuracy of song matching is improved.
In some embodiments, the server can perform feature extraction on the voice segments based on an audio feature extractor, which may be trained by the server or directly obtained by the server. Taking the example that the audio feature extractor is trained by the server, the training step of the audio feature extractor includes the following four steps.
Step one, the server extracts the spectral feature of the sample audio signal based on a spectral feature extraction layer in the audio feature extractor to obtain a sample Mel cepstrum feature of the sample audio signal.
Wherein, the sample audio signal is a plurality of original singing audios of a plurality of singers or a plurality of voice audios of a plurality of objects stored in a database.
In some embodiments, the server frames the sample audio signal using a sliding window, resulting in a plurality of sample audio frames. Then, the framed sample audio signal can be transferred to a time-frequency domain by the following formula (1), and the sample audio signal in the time-frequency domain is subjected to spectrum feature extraction by the following formula (2), so as to obtain a sample mel cepstrum feature of the sample audio signal.
S(n,f)=STFT(s(t)) (1)
Wherein S (n, f) represents a sample audio signal in the time-frequency domain; n represents the nth frame of the sample audio signal after framing, the total frame number is N, and the value of N is more than 0 and less than or equal to N; f represents the center frequency, F is the maximum value of the center frequency, and the value of F is more than 0 and less than or equal to F; t represents time, the total duration of the sample audio signal is T, and the value of T is more than 0 and less than or equal to T; s (t) represents a sample audio signal in the time domain; STFT () represents a short-time fourier transform function.
M(n,k)=Mel(|S(n,f)|) (2)
Wherein M (n, k) represents a sample Mel cepstrum feature of the sample audio signal; k represents the dimension number of the Mel cepstrum characteristics; | S (n, f) | represents the amplitude value of the sample audio signal in the time-frequency domain, and Mel () represents an extraction function of Mel cepstrum features.
And step two, the server processes the Mel cepstrum features of the samples based on a tone feature extraction layer in the audio feature extractor to obtain the tone features of the samples of the audio signals.
The server can process the sample Mel cepstrum features into sample tone features at a frame level, and then perform statistical characteristic pooling on the sample tone features at the frame level to obtain the sample tone features. The statistical characteristic pooling process is to calculate the mean and variance of all sample audio frames included in a sentence based on the sample timbre features at the frame level, and determine the mean and variance as the sample timbre features, i.e., the sample timbre features at the sentence level.
In some embodiments, the server processes the above sample mel-frequency cepstrum features by the following formula (3), and can obtain the sample tone-color features at the frame level. The sample tone features at the frame level are pooled by the following formula (4) to obtain sample tone features.
R(n,l)=g(M(n,k)) (3)
Wherein, R (n, l) represents the feature of the audio frame, n represents the nth frame, l is the feature dimension of the nth frame, and g () represents the function for processing the mel-frequency cepstrum feature of the sample by using the deep neural network.
v=statistic pooling(R(n,l)) (4)
Where v is a vector representation of the sample tone features at the sentence level, static firing () represents a statistical property pooling function, and R (n, l) represents the features of the audio frame.
And step three, the server processes the sample tone color characteristics based on an object discriminator, a pitch discriminator and an audio type discriminator in the audio characteristic extractor to obtain object loss, pitch loss and audio type loss.
Wherein, the object discriminator is used for discriminating that the sample audio signal is a singer or an object; the pitch discriminator is used for discriminating whether the sample audio signal is bass, alto or treble; the audio type discriminator discriminates the sample audio signal as the original audio or the voice audio based on the tone characteristic information.
For any sample audio signal, the server may obtain a label of the sample audio signal, where the label is used to indicate that the sample audio signal includes three types with different dimensions, a first dimension is singer or object, a second dimension is bass, middle-pitched or high-pitched, and a third dimension is original audio or voice audio.
For example, for any sample audio signal, if the sample audio signal is a song sung by singer a, the pitch belongs to the bass, and the audio type belongs to the original audio, then the labels of the sample audio signal are singer a, bass, and original audio.
In some embodiments, the server may obtain the prediction probability of the discriminator based on an object discriminator, a pitch discriminator, and an audio type discriminator in the audio feature extractor. Then, the server obtains an object loss, a pitch loss, and an audio type loss based on the label of the sample audio signal and the prediction probability of the discriminator.
For example, the prediction probability of which singer or which object the sample audio signal acquired by the object discriminator is, the prediction probability of which bass, midrange, or treble the sample audio signal acquired by the pitch discriminator is, and the prediction probability of which original or speech audio the sample audio signal acquired by the audio type discriminator is. Taking the object discriminator as an example, for any sample audio signal, the object discriminator discriminates which singer or object the sample audio signal belongs to, and obtains the prediction result of the object discriminator. And acquiring the prediction probability of the object discriminator based on the prediction result and the label of the sample audio signal.
In some embodiments, the object loss of the object discriminator is calculated by the following formula (5). Similarly, the server can obtain the pitch loss of the pitch discriminator and the audio type loss of the audio type discriminator.
Figure BDA0003864138080000131
Wherein, J 1 Represents the loss of cross entropy of the object discriminator; c represents the total number of singers and objects in the sample audio signal; s represents the sample audio signal; if the sample audio signal is a singer or an object, P c Is 1, if the sample audio signal is neither singer nor object, P c Is 0; p (c | s) represents the probability that the sample audio signal is predicted as a singer or an object by the object discriminator.
And step four, the server trains the audio feature extractor based on the object loss, the pitch loss and the audio type loss.
The training loss of the discriminator is calculated by the following formula (6), and the audio feature extractor is trained based on the training loss.
J=(1-α-β)*J 1 +α*J 2 +β*J 3 (6)
Wherein J represents the loss of the discriminator; α and β represent weights, which the present disclosure does not limit; j is a unit of 1 Representing the cross entropy loss of the object discriminator; j. the design is a square 2 Represents the cross entropy loss of the pitch discriminator; j. the design is a square 3 Representing the cross-entropy loss of the audio type discriminator. The loss of the discriminator represents the error between the prediction probability and the label, and the smaller the loss, the closer the prediction probability is to the label, namely the closer the prediction probability is to the true value.
For example, FIG. 4 is a flow chart illustrating training of an audio feature extractor according to an example embodiment. Referring to fig. 4, the audio feature extractor includes a spectral feature extraction layer, a tone feature extraction layer, an object discriminator, a pitch discriminator, and an audio type discriminator. The server inputs the sample audio signals in the database into the audio feature extractor, and performs spectral feature extraction on the sample audio signals based on a spectral feature extraction layer of the audio feature extractor to obtain spectral features of the sample audio signals. Then, the server performs tone feature extraction on the spectral feature of the sample audio signal based on a tone feature extraction layer of the audio feature extractor to obtain a tone feature of the sample audio signal. Then, the server discriminates the timbre features of the sample audio signal based on the object discriminator, the pitch discriminator and the audio type discriminator to obtain discrimination results of the three discriminators. And finally, the server obtains object loss, pitch loss and audio type loss based on the discrimination result and the label of the sample audio signal, and adjusts the parameters of the audio feature extractor based on the loss.
In step S304, the server clusters the tone color features of the multiple voice segments to obtain first tone color information of the voice signal, where the first tone color information includes at least one tone color category and at least one category feature of the voice signal, and the category feature is used to represent a clustering center.
In the embodiment of the present disclosure, the server clusters the tone features of the plurality of voice segments based on the distance between the tone features of the voice segments. The distance may be an euclidean distance, a mahalanobis distance, or a cosine distance, which is not limited in this disclosure. The clustering process refers to a process of classifying the tone features of similar voice segments into one class, and classifying the tone features of dissimilar voice segments into different classes.
For any tone color feature, it can be represented by an array with dimension M × 1, i.e. the tone color feature is represented by an array of M rows and 1 columns, and M has the same meaning as the feature dimension l. If the number of the tone color features of the singing object is R, the tone color features of the voice segments can be represented by an array of R M1. Wherein, the value of R is a positive integer.
In some embodiments, the server can perform clustering based on a plurality of clustering information, resulting in a plurality of different clustering results. The clustering information is used to indicate the number of categories when clustering is performed, i.e. the number of categories into which the tone features of the speech segments are classified. Correspondingly, the server clusters the tone features of the voice segments respectively based on the clustering information to obtain clustering results of the clustering information. Then, the server determines target clustering information based on a clustering result of the plurality of clustering information. Finally, the server determines at least one tone category and at least one category feature of the voice signal based on the clustering result of the target clustering information. The class features represent the average value of all the tone features in the class corresponding to the tone class, and one tone class corresponds to one class feature. The clustering result is used to represent inter-class distances and intra-class distances. The inter-class distance represents the distance between class features of two different classes. The intra-class distance represents the distance between the timbre features within each class and the class features within that class. The target clustering information is the clustering information with the largest ratio of the average inter-class distance to the average intra-class distance. By clustering the tone characteristics of the singing objects based on the plurality of clustering results, various different clustering results can be obtained, so that the target clustering information determined based on the different clustering results can accurately reflect the tone of the singing objects.
For example, the plurality of clustering information indicates that the number of tone color categories in the clustering process is increased from 2, and the maximum value of the tone color categories is the number of tone color features of the voice segments. The tone color class is represented by K _ user and the class features are represented by an array with dimension M x 1. Assume that the timbre features of a speech segment are 4 in total, represented by an array of 4 dimensions M x 1. The server may cluster the tone features based on the clustering information of tone category 2, the clustering information of tone category 3, and the clustering information of tone category 4. If the K _ user is 2, the server clusters the 4 tone features to obtain 2 tone categories and 2 category features, namely 2 arrays with the dimensionality of M x 1; if the K _ user is 3, the server clusters the 4 tone features to obtain 3 tone categories and 3 category features, namely 3 arrays with the dimensionality of M x 1; if K _ user is 4, the server clusters the 4 tone features to obtain 4 tone categories and 4 category features, that is, 4 arrays with dimensions M × 1. Taking the tone category as 3 as an example, the category feature represents an average value of the tone features included in each category, and is represented by an array of 3 dimensions M × 1. The inter-class distance is expressed as the distance between the class features corresponding to the 3 timbre classes. The intra-class distance represents the variance of the distance between the timbre features within each class and the class features of the class. The average inter-class distance represents the average of the inter-class distances between the 3 tone categories. The average intra-class distance represents an average value of intra-class distances corresponding to the 3 tone color classes. The server can obtain the ratio of the average inter-class distance to the average intra-class distance when the tone class is 3. Similarly, the server can obtain a clustering result with a tone category of 2 and a clustering result with a tone category of 4, and a ratio of the average inter-class distance to the average intra-class distance when the tone category is 2 and a ratio of the average inter-class distance to the average intra-class distance when the tone category is 4. And the server determines the maximum ratio from the 3 ratios and determines the clustering information corresponding to the maximum ratio as target clustering information.
In order to make the flow described in the above steps S301 to S304 easier to understand, fig. 5 is a flowchart illustrating a method of extracting the tone color feature of a singing object according to an exemplary embodiment. Referring to fig. 5, the method comprises the following steps: 501. the singing object sends a song matching request to the server through the terminal. 502. The server determines whether there is an audio signal of the singing object, if there is an audio signal of the singing object, step 503 is executed, and if there is no audio signal of the singing object, step 504 is executed. 503. The server acquires the audio signals stored in the historical time period of the singing object. 504. The server sends prompt information to the terminal of the singing object to prompt the singing object to record. 505. If the singing object is to record multiple timbres, step 504 is executed, otherwise step 506 is executed. 506. The silent part is removed as a voice signal of a singing object. 507. The server divides the voice signal of the singing object to obtain a plurality of voice fragments. 508. And the server extracts the tone features of the voice segments to obtain a plurality of tone features. 509. The server clusters the multiple tone features to obtain at least one tone category. 510. The server obtains class characteristics for each tone class.
In step S305, the server determines at least one target song from the song library based on the first timbre information and a plurality of second timbre information, the second timbre information being indicative of at least one timbre category of songs in the song library, a similarity between the timbre category of the target song and the timbre category of the singing object being greater than a similarity threshold.
In the embodiment of the disclosure, after the server responds to the song matching request of the singing object and acquires the first tone color information of the singing object, at least one target song is determined based on the similarity between the plurality of second tone color information and the first tone color information. Because the similarity between the voice signals of the target song and the singing object is larger than the similarity threshold value, the tone colors of the target song and the singing object are similar, and the singing method is suitable for the singing object to sing.
In some embodiments, the server is capable of first filtering out at least one third tone information from the plurality of second tone information based on the first tone information, and then determining at least one target song from the at least one third tone information. Correspondingly, the server acquires at least one third tone information from the plurality of second tone information. Then, the server acquires at least one tone color information with the similarity greater than the similarity threshold value with the first tone color information from the at least one third tone color information. And finally, the server determines at least one target song corresponding to the at least one tone color information from the song library. Wherein the number of tone color categories of the at least one third tone color information is not greater than the number of tone color categories indicated by the first tone color information. It should be noted that the number of the target songs may be the number of requests for matching by the singing object, or the number of random matches by the server for the singing object, which is not limited in the embodiment of the present disclosure. The third tone information is obtained based on the category number, so that the songs in the song library are screened, and the target song meeting the similarity condition is determined from the screened songs based on the similarity, so that the determined tone of the target song has higher similarity with the tone of the singing object, and the accuracy of matching the songs for the singing object is improved.
In some embodiments, the server determines a similarity between the first and third tone color information based on a similarity between the respective tone color categories. Correspondingly, the server determines a first similarity between at least one first tone category in the first tone information and at least one second tone category in the third tone information for any third tone information. Then, the server determines a second similarity of the first tone color category for any first tone color category in the first tone color information. Finally, the server determines the sum of at least one second similarity of at least one first timbre category as the similarity between the first timbre information and the third timbre information. Wherein the second similarity is based on a minimum of the first similarities between the first timbre category and the at least one second timbre category. The target song is determined based on the similarity between the first tone color information and the third tone color information, so that the target song and the singing object have higher similarity, and the accuracy of song matching is improved.
In some embodiments, for any song in the song library, the server obtains an audio signal for the song. If the server stores the audio without accompaniment of the song, the server acquires the audio without accompaniment and removes the mute part as the audio signal of the song. If the server does not store the unaccompanied audio of the song, the server extracts the audio of the song from the audio containing the accompaniment and removes the mute part to be used as the audio signal of the song. Then, the server performs feature extraction on the audio signals of all songs in the song library based on the audio feature extractor obtained by training in step S303 or obtained directly to obtain a plurality of timbre features of the audio signals of all songs in the song library, and performs step S304 on the plurality of timbre features to obtain a plurality of second timbre information and store the plurality of second timbre information. And if the song library is updated, the server acquires and stores a plurality of second tone color information of the newly added songs. Wherein one song corresponds to one second tone information. For the audio signal of any song, the server divides the audio signal into a plurality of song segments and obtains the tone color characteristics of the song segments. Accordingly, the server divides the audio signal of any song in the song library into a plurality of song segments. And then, the server performs feature extraction on any song segment to obtain the tone features of the song segment. And finally, clustering the tone features of the plurality of song segments by the server to obtain at least one tone category and at least one category feature of the song. Wherein the tone color feature is used for representing the tone color category of the song segment, and the category feature is used for representing the clustering center. Through feature extraction and clustering of the song segments, tone information of the songs in the song library can be obtained, so that the songs with tone similar to the tone of the singing object can be selected, and the accuracy of song matching is improved.
For example, assuming that the first tone color information of the singing object is an array with 4 dimensions of M × 1, the first tone color category is 4, and the server determines songs with the tone color category number not greater than 4 from the song library as alternative songs. For the audio signal of any alternative song, assuming that the number of the tone color categories of the audio signal of the alternative song is 3, namely the second tone color category is 3, the server obtains third tone color information of the alternative song, namely an array with 3 dimensions of M1. For any array corresponding to the first tone category, calculating the distance of category features between the array and 3 arrays of the audio signals of the alternative song respectively, obtaining 3 distances, and determining the 3 distances as a first similarity. The server takes the minimum of the 3 distances as the second similarity. Similarly, the server can respectively obtain the second similarity between the remaining 3 arrays in the first timbre information of the singing object and the 3 arrays of the alternative song audio signals, and can obtain 3 second similarities. The sum of the 4 second similarities is determined as the similarity between the timbre of the singing object and the timbre of the alternative song. Similarly, the server can obtain the similarity between the tone of all songs in the song library and the tone of the singing object. And the server determines at least one target song by sequencing the similarity from small to large.
In order to make the process of the server obtaining the second timbre information of each song in the song library easier to understand, fig. 6 is a flowchart illustrating extracting the timbre features of songs in the song library according to an exemplary embodiment. Referring to fig. 6, the method comprises the following steps: 601. the server selects a song from the song library. 602. The server determines whether there is accompaniment-free audio for the song, if so, performs step 603, and if not, performs step 604. 603. The server acquires the accompaniment-free audio. 604. The server extracts the audio of the song from the audio with the accompaniment. 605. The silent part is removed as the audio signal of the song. 606. The server divides the audio signals of the songs to obtain a plurality of song segments. 607. And the server extracts the tone features of the song fragments to obtain a plurality of tone features. 608. The server clusters the multiple tone features to obtain at least one tone category. 609. The server obtains class characteristics for each tone class.
In step S306, the server returns at least one target song to the singing object so that the singing object matches the at least one target song.
In the embodiment of the disclosure, the server determines at least one target song from a plurality of songs in the song library, wherein the target song has a tone color similar to that of the singing object and is suitable for the singing object to sing. The server displays the target song on an application program interface of the singing object, so that the singing object can select the song from the target song returned by the server to sing.
The foregoing steps S301 to S306 exemplarily show an implementation manner of the song matching method provided in the present application, and in order to make the song matching method easier to understand, fig. 7 is a block diagram illustrating a song matching method according to an exemplary embodiment. Referring to fig. 7, the server performs feature extraction on the audio signals of all songs in the song library to obtain the tone color features of all songs in the song library, and stores the tone color features in the server. The server responds to the song matching request of the singing object, and performs feature extraction on the voice signal of the singing object to obtain the tone color feature of the singing object. Based on the tone color characteristics of the singing object and the tone color characteristics of the songs in the song library, the server obtains the similarity between the singing object and the songs in the song library. Based on the similarity, the server matches songs suitable for singing for the singing object.
The embodiment of the disclosure provides a song matching method, which determines at least one tone category of a voice signal based on the voice signal of a singing object, so that at least one target song similar to the tone category of the voice signal can be determined from a song library based on the at least one tone category, and then the at least one target song is returned to the singing object, so that the singing object is matched with the at least one target song, the similarity between the tone category of the target song matched with the singing object and the tone category of the singing object is higher, the method is suitable for the singing object to sing, and the accuracy of song matching is improved.
Fig. 8 is a block diagram illustrating a song matching apparatus according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a first determination unit 801, a second determination unit 802, and a matching unit 803.
A first determination unit 801 configured to determine first timbre information based on a voice signal of a singing object in response to a song matching request of the singing object, the first timbre information being for representing at least one timbre category of the voice signal;
a second determining unit 802 configured to determine at least one target song from the song library based on the first timbre information and a plurality of second timbre information, the second timbre information being used for representing at least one timbre category of the song in the song library, a similarity between the timbre category of the target song and the timbre category of the singing object being greater than a similarity threshold;
a matching unit 803 configured to return the at least one target song to the singing object so as to match the singing object with the at least one target song.
In some embodiments, FIG. 9 is a block diagram illustrating another song matching apparatus according to an example embodiment. Referring to fig. 9, the first determination unit 801 includes:
a dividing subunit 8011 configured to divide the voice signal of the singing object into a plurality of voice segments;
an extracting sub-unit 8012 configured to, for any of the voice segments, perform feature extraction on the voice segment to obtain a tone color feature of the voice segment, where the tone color feature is used to represent a tone color category of the voice segment;
a clustering subunit 8013 configured to cluster the tone color features of the plurality of speech segments, resulting in at least one tone color class and at least one class feature of the speech signal, where the class feature is used to represent a clustering center.
In some embodiments, the dividing subunit 8011 is configured to divide the speech signal into a plurality of speech segments equally according to the target duration; or, the voice signal is divided into a plurality of voice segments according to a plurality of sentences included in the voice signal, and one voice segment includes one sentence.
In some embodiments, the extracting sub-unit 8012 is configured to perform spectrum feature extraction on the speech segment to obtain a mel cepstrum feature of the speech segment; determining a plurality of speech frame characteristics of the speech segments based on the Mel cepstrum characteristics; based on the plurality of speech frame characteristics, a timbre characteristic of the speech segment is determined.
In some embodiments, the clustering subunit 8013 is configured to cluster the tone features of the multiple speech segments respectively based on multiple pieces of clustering information, to obtain a clustering result of the multiple pieces of clustering information, where the clustering information is used to indicate a category number during clustering, and the clustering result is used to indicate an inter-class distance and an intra-class distance; determining target clustering information based on clustering results of a plurality of clustering information, wherein the target clustering information is the clustering information with the largest ratio of the average inter-class distance to the average intra-class distance; at least one timbre category and at least one category feature of the speech signal are determined based on the clustering result of the target clustering information.
In some embodiments, referring to fig. 9, the apparatus further comprises:
a signal acquisition unit 804 configured to acquire a voice signal input by a singeing object in a history period; or, the prompt information is returned to the singing object, and the voice signal input by the singing object based on the prompt information is acquired.
In some embodiments, the second determining unit 802 is configured to obtain at least one third tone information from the plurality of second tone information, where the number of tone categories of the at least one third tone information is not greater than the number of tone categories indicated by the first tone information; acquiring at least one tone information with the similarity greater than the similarity threshold value with the first tone information from at least one third tone information; and determining at least one target song corresponding to the at least one tone color information from the song library.
In some embodiments, the second determining unit 802 is further configured to determine, for any of the third tone information, a first similarity between at least one first tone category in the first tone information and at least one second tone category in the third tone information; for any first tone color category in the first tone color information, determining a second similarity of the first tone color category, wherein the second similarity is the minimum value of first similarities between the first tone color category and at least one second tone color category; the sum of at least one second similarity of at least one first tonal category is determined as a similarity between the first and third tonal information.
In some embodiments, referring to fig. 9, the apparatus further comprises:
a song dividing unit 805 configured to divide a song into a plurality of song segments for any one song in the song library;
a feature extraction unit 806, configured to perform feature extraction on any song segment to obtain a tone feature of the song segment, where the tone feature is used to represent a tone category of the song segment;
a clustering unit 807 configured to cluster the tone-color characteristics of the plurality of song segments to obtain at least one tone-color category and at least one category characteristic of the songs, wherein the category characteristic is used for representing a clustering center.
The embodiment of the disclosure provides a song matching device, which determines at least one tone category of a voice signal based on the voice signal of a singing object, so that at least one target song similar to the tone category of the voice signal can be determined from a song library based on the at least one tone category, and then the at least one target song is returned to the singing object, so that the singing object is matched with the at least one target song, the similarity between the tone category of the target song matched with the singing object and the tone category of the singing object is higher, the song matching device is suitable for the singing object to sing, and the song matching accuracy is improved.
Fig. 10 is a block diagram illustrating an electronic device 1000 in accordance with an example embodiment. In general, the electronic device 1000 includes: a processor 1001 and a memory 1002.
The processor 1001 may include one or more processing cores, such as 4-core processors, 8-core processors, and so on. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one program code for execution by the processor 1001 to implement the song matching method provided by the method embodiments of the present disclosure.
In some embodiments, the electronic device 1000 may further include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1004, a display screen 1005, a camera assembly 1006, an audio circuit 1007, and a power supply 1008.
The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1004 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1004 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.
A display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1005 may be one, providing a front panel of the electronic device 1000; in other embodiments, the display screens 1005 may be at least two, respectively disposed on different surfaces of the electronic device 1000 or in a folded design; in some embodiments, the display screen 1005 may be a flexible display screen, disposed on a curved surface or on a folded surface of the electronic device 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.
The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1001 for processing or inputting the electric signals into the radio frequency circuit 1004 for realizing voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the electronic device 1000. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker and can also be a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.
The power supply 1008 is used to power the various components in the electronic device 1000. The power source 1008 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1008 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery can also be used to support fast charge technology.
Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of the electronic device 1000 and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.
When the computer device is configured as a server, fig. 11 is a schematic structural diagram of a server according to an exemplary embodiment, where the server 1100 may generate a large difference due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memories 1102 store at least one computer program, and the at least one computer program is loaded and executed by the processors 1101 to implement the song matching method provided by the above-described method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 1002 comprising instructions, executable by the processor 1001 of the electronic device 1000 to perform the song matching method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the above-described song matching method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A song matching method, comprising:
in response to a song matching request of a singing object, determining first tone color information based on a voice signal of the singing object, wherein the first tone color information is used for representing at least one tone color category of the voice signal;
determining at least one target song from a song library based on the first tone color information and a plurality of second tone color information, wherein the second tone color information is used for representing at least one tone color category of songs in the song library, and the similarity between the tone color category of the target song and the tone color category of the singing object is larger than a similarity threshold value;
and returning the at least one target song to the singing object so as to match the singing object with the at least one target song.
2. The song matching method of claim 1, wherein the determining first timbre information based on the speech signal of the singing object comprises:
dividing the voice signal of the singing object into a plurality of voice fragments;
for any voice segment, carrying out feature extraction on the voice segment to obtain a tone feature of the voice segment, wherein the tone feature is used for representing the tone category of the voice segment;
and clustering the tone features of the plurality of voice segments to obtain at least one tone category and at least one category feature of the voice signals, wherein the category feature is used for representing a clustering center.
3. The song matching method according to claim 2, wherein the dividing of the speech signal of the singing object into a plurality of speech segments comprises any one of:
equally dividing the voice signal into a plurality of voice segments according to a target time length;
dividing the voice signal into a plurality of voice segments according to a plurality of sentences included in the voice signal, wherein one voice segment includes one sentence.
4. The song matching method according to claim 2, wherein the extracting the features of the voice segment to obtain the tone features of the voice segment comprises:
extracting the frequency spectrum feature of the voice segment to obtain the Mel cepstrum feature of the voice segment;
determining a plurality of speech frame characteristics of the speech segment based on the mel cepstrum characteristics;
and determining the tone color characteristics of the voice segments based on the plurality of voice frame characteristics.
5. The song matching method according to claim 2, wherein the feature extraction is performed on the voice segment to obtain the tone features of the voice segment, and the tone features are implemented based on an audio feature extractor;
the training step of the audio feature extractor comprises:
performing spectrum feature extraction on a sample audio signal based on a spectrum feature extraction layer in the audio feature extractor to obtain a sample Mel cepstrum feature of the sample audio signal;
processing the sample Mel cepstrum features based on a tone feature extraction layer in the audio feature extractor to obtain sample tone features of the sample audio signals;
processing the sample tone color characteristics based on an object discriminator, a pitch discriminator and an audio type discriminator in the audio characteristic extractor to obtain object loss, pitch loss and audio type loss;
training the audio feature extractor based on the object loss, the pitch loss, and the audio type loss.
6. The song matching method of claim 2, wherein the clustering the timbre features of the plurality of speech segments to obtain at least one timbre category and at least one category feature of the speech signal comprises:
clustering the tone features of the voice fragments respectively based on a plurality of clustering information to obtain clustering results of the clustering information, wherein the clustering information is used for indicating the category number during clustering, and the clustering results are used for representing the inter-class distance and the intra-class distance;
determining target clustering information based on clustering results of the plurality of clustering information, wherein the target clustering information is the clustering information with the largest ratio of the average inter-class distance to the average intra-class distance;
determining the at least one timbre category and the at least one category feature of the speech signal based on the clustering result of the target clustering information.
7. The song matching method of claim 1, further comprising:
acquiring the voice signal input by the singing object in a historical time period; alternatively, the first and second electrodes may be,
and returning prompt information to the singing object, and acquiring the voice signal input by the singing object based on the prompt information.
8. The song matching method of claim 1, wherein determining at least one target song from a song library based on the first timbre information and a plurality of second timbre information comprises:
acquiring at least one piece of third tone information from the plurality of pieces of second tone information, wherein the tone category number of the at least one piece of third tone information is not more than the tone category number indicated by the first tone information;
acquiring at least one piece of tone color information with the similarity greater than a similarity threshold value with the first tone color information from the at least one piece of third tone color information;
and determining the at least one target song corresponding to the at least one tone color information from the song library.
9. The song matching method of claim 8, further comprising:
for any third tone information, determining a first similarity between at least one first tone category in the first tone information and at least one second tone category in the third tone information;
for any first timbre category in the first timbre information, determining a second similarity of the first timbre category, the second similarity being based on a minimum value of first similarities between the first timbre category and the at least one second timbre category;
determining a sum of at least one second similarity of the at least one first timbre category as a similarity between the first timbre information and the third timbre information.
10. The song matching method of claim 1, further comprising:
for any song in the song library, dividing the song into a plurality of song segments;
for any song segment, performing feature extraction on the song segment to obtain tone features of the song segment, wherein the tone features are used for representing tone categories of the song segment;
and clustering the tone color characteristics of the plurality of song segments to obtain at least one tone color category and at least one category characteristic of the songs, wherein the category characteristic is used for representing a clustering center.
11. A song matching apparatus, comprising:
a first determination unit configured to determine first timbre information based on a voice signal of a singing object in response to a song matching request of the singing object, the first timbre information being indicative of at least one timbre category of the voice signal;
a second determination unit configured to determine at least one target song from a song library based on the first timbre information and a plurality of second timbre information, the second timbre information being used for representing at least one timbre category of songs in the song library, a similarity between the timbre category of the target song and the timbre category of the singing object being greater than a similarity threshold;
a matching unit configured to return the at least one target song to the singing object so as to match the singing object with the at least one target song.
12. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a memory for storing the processor executable program code;
wherein the processor is configured to execute the program code to implement the song matching method of any of claims 1 to 10.
13. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the song matching method of any of claims 1-10.
CN202211175487.8A 2022-09-26 2022-09-26 Song matching method and device, electronic equipment and storage medium Pending CN115565508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211175487.8A CN115565508A (en) 2022-09-26 2022-09-26 Song matching method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211175487.8A CN115565508A (en) 2022-09-26 2022-09-26 Song matching method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115565508A true CN115565508A (en) 2023-01-03

Family

ID=84742901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211175487.8A Pending CN115565508A (en) 2022-09-26 2022-09-26 Song matching method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115565508A (en)

Similar Documents

Publication Publication Date Title
CN110853618B (en) Language identification method, model training method, device and equipment
US9313593B2 (en) Ranking representative segments in media data
CN111179962B (en) Training method of voice separation model, voice separation method and device
CN110503961B (en) Audio recognition method and device, storage medium and electronic equipment
CN109147826B (en) Music emotion recognition method and device, computer equipment and computer storage medium
CN107705783A (en) A kind of phoneme synthesizing method and device
WO2020113733A1 (en) Animation generation method and apparatus, electronic device, and computer-readable storage medium
CN110277106B (en) Audio quality determination method, device, equipment and storage medium
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
CN107146631B (en) Music identification method, note identification model establishment method, device and electronic equipment
CN110992963B (en) Network communication method, device, computer equipment and storage medium
CN108428441B (en) Multimedia file generation method, electronic device and storage medium
CN112153460B (en) Video dubbing method and device, electronic equipment and storage medium
CN111063342A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN112735429B (en) Method for determining lyric timestamp information and training method of acoustic model
WO2015114216A2 (en) Audio signal analysis
CN111524501A (en) Voice playing method and device, computer equipment and computer readable storage medium
WO2019233361A1 (en) Method and device for adjusting volume of music
CN114491140A (en) Audio matching detection method and device, electronic equipment and storage medium
CN112735382B (en) Audio data processing method and device, electronic equipment and readable storage medium
CN113220590A (en) Automatic testing method, device, equipment and medium for voice interaction application
CN115565508A (en) Song matching method and device, electronic equipment and storage medium
CN112786025B (en) Method for determining lyric timestamp information and training method of acoustic model
CN112750425B (en) Speech recognition method, device, computer equipment and computer readable storage medium
CN110400559B (en) Audio synthesis method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination