WO2022012164A1

WO2022012164A1 - Method and apparatus for converting voice into rap music, device, and storage medium

Info

Publication number: WO2022012164A1
Application number: PCT/CN2021/095236
Authority: WO
Inventors: 徐雯
Original assignee: 百果园技术(新加坡)有限公司; 徐雯
Priority date: 2020-07-16
Filing date: 2021-05-21
Publication date: 2022-01-20
Also published as: CN111862913A; CN111862913B

Abstract

A method and apparatus for converting voice into rap music, a computer device, and a computer readable storage medium. The method comprises: identifying a voice segment and processing background music to obtain the character attribute information of characters in the voice segment and the music rhythm information of the background music (S101); according to the character attribute information and the music rhythm information, determining at least one alignment period used for aligning the voice segment with the background music, and obtaining the alignment information table of each alignment period (S102); and according to the alignment information table of the at least one alignment period, controlling the characters in the voice segment to be aligned with rhythm points in the background music so as to obtain aligned audio, and forming rap audio after performing tone change adjustment and special effect processing on the aligned audio (S103).

Description

Method, device, device and storage medium for converting speech into rap music

This application claims the priority of the Chinese patent application with application number 202010688502.3 filed with the China Patent Office on July 16, 2020, the entire contents of which are incorporated into this application by reference.

technical field

The present application relates to the technical field of music production, for example, to a method, apparatus, device and storage medium for converting speech into rap music.

Background technique

With the popularization of K-song software, the research of voice-modification algorithm and vocal-to-music algorithm has gradually attracted widespread attention, and people's interest in automatic voice-modification and speech-to-singing is also increasing. Rap culture has gradually entered the public's field of vision. The characteristic of rap music is that the creator quickly and rhythmically speaks a series of rhythmic words under the background music. The production process of rap music often has to go through a complicated process. For most non-audio processing personnel It will take a long time to learn to use professional audio processing software and perform complex manual operations on audio processing software.

In response to the above problems, some voice conversion software suitable for non-audio processing personnel have appeared. However, different voice conversion software has different defects in the process of realizing voice conversion and rap. The technical solution of the rhythm, which limits the need to read specific lyrics, because the lyrics completely match the background music, so the alignment position of the word and the rhythm point is fixed, this solution can not handle the situation of unknown lyrics content and length, by This reduces the creative space when the user applies the software, thereby limiting the application prospect of the software. Another example is the technical solution of voice-to-rap in another voice conversion software. The algorithm design of audio segmentation and audio alignment is relatively complicated, which increases the difficulty of conversion, and also has the problem of misalignment of voice text and music rhythm. This conversion method is not conducive to effectively processing the music uploaded by the user.

SUMMARY OF THE INVENTION

The present application provides a method, device, device and storage medium for converting speech into rap music, so as to solve the problems of limited speech content and poor speech conversion effect during the process of speech conversion into rap music.

Provides a method of converting speech to rap music, including:

Identifying speech segments and processing background music, and obtaining text attribute information of the text in the speech segment and music rhythm information of the background music;

According to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;

Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period to obtain aligned audio, and perform pitch adjustment and special effects on the aligned audio Rap audio is formed after processing.

Also provided is a device for converting speech into rap music, comprising:

an information determination module, configured to recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music;

an alignment information determination module, configured to determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period ;

The conversion control module is configured to control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, to obtain the aligned audio, and to align the aligned audio. The audio is transposed and processed with special effects to form rap audio.

Also provided is a computer device comprising:

one or more processors;

storage means arranged to store one or more programs;

The one or more programs are executed by the one or more processors, so that the one or more processors implement the above-described method of converting speech to rap music.

Also provided is a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above-mentioned method for converting speech into rap music.

Description of drawings

1 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 1 of the present application;

2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application;

3 is a flow chart of the realization of determining the alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application;

Fig. 4 is the realization flow chart of determining the alignment unit in the alignment period and alignment unit information in a kind of method for converting speech into rap music provided by Embodiment 2 of this application;

Fig. 5 provides the unfolding flow chart of the alignment unit and alignment unit information in a kind of determination alignment cycle that the second embodiment of the application provides;

6 is a structural block diagram of a device for converting speech into rap music provided by Embodiment 3 of the present application;

FIG. 7 is a schematic diagram of a hardware structure of a computer device according to Embodiment 4 of the present application.

detailed description

The embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the present application, the terms "first", "second", "third", etc. are only used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence, nor should they be construed as indicating or implying relative importance. The meanings of the above terms in the present application can be understood according to the situation.

Example 1

1 is a schematic flowchart of a method for converting speech into rap music provided in Embodiment 1 of the present application. The method is suitable for converting a voice segment recorded by a user into rap music. The method can be performed by converting speech into rap music. Means implementation of music, wherein the means may be implemented in software and/or hardware, and may generally be integrated on computer equipment.

In this application mode, a selection interface for background music can be provided to the user first, thereby obtaining the background music selected by the user; after that, a selection interface for voice content can also be provided to the user, thereby obtaining the user's recording by triggering the recording. The voice segment recorded in real time by the button, or the pre-recorded voice segment uploaded by the user by triggering the upload button is obtained; then the method for converting voice into rap music provided by this embodiment can be implemented, so as to convert the obtained voice segment into Said segment with background music.

As shown in FIG. 1 , a method for converting speech into rap music provided in Embodiment 1 of the present application includes the following operations:

S101. Recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music.

In this embodiment, the speech segment can be understood as the real-time recording or pre-recorded speech segment obtained by the user before executing S101, and the background music can be understood as the to-be-used selected by the user from the background music set received before executing S101 music.

In this embodiment, speech recognition can be performed on the speech segment, so that the text serial number of the text included in the speech segment, the pronunciation duration of the text (the start and end time of the text), and the starting position of the first vowel in the text can be obtained. It is also possible to detect and process the music beat of the background music, so as to obtain the rhythm point serial number of the rhythm points included in the background music, the location of the rhythm point, and the rhythm points included in each beat cycle formed by the division. rhythm points and other related music rhythm information.

This embodiment does not limit the methods of speech recognition, text detection, and rhythm point detection, as long as required text attribute information and music rhythm information can be obtained.

S102. Determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period.

For the process of converting the user's speech into a music segment, in addition to the speech recognition and beat detection in S101, the most important link is to align the corresponding text in the speech segment with the rhythm points in the background music. The so-called alignment of speech segments and background music can be considered as dividing the speech into individual words, aligning each word with strong rhythm and regular accent points, which may be accompanied by repetition of some first and last words or middle words. Strengthen the sense of rhythm. Therefore, when implementing the conversion from speech to rap music in this embodiment, it is necessary to first determine an alignment period and a corresponding alignment information table for aligning the speech segment and the background music through S102.

The alignment period can be understood as a minimum repeating unit including rhythm points that can be aligned with all the characters in the speech segment, that is, starting from a time t, the rhythm of the background music is a fixed period that can align all the characters in the speech segment. T to repeat. The alignment information table can be understood as including information on the correspondence between the required rhythm points and the characters to be aligned (such as the rhythm point serial number, text serial number) and the gear ratio when aligning the rhythm points with the characters in one alignment cycle. information declaration form.

The realization of S102 can be expressed as:

First, the total amount of text included in the speech segment can be determined from the text attribute information, and the total amount of rhythm points included in the background music can be determined from the music rhythm information, and the beat cycle formed by dividing these rhythm points. cycle information. The beat period can be understood as a minimum rhythm repetition unit found according to the rhythm points, that is, starting from a time, the rhythm of the background music is repeated in a fixed period Z.

After that, according to the total amount of text and the number of rhythm points included in a beat period, it can be determined whether the beat period can satisfy the condition of being an alignment period. Each tick period is regarded as an alignment period. If the takt period does not satisfy the condition of being an alignment period, the period length of the takt period needs to be updated to obtain a takt period that can be used as an alignment period.

Then, because the rhythm of at least one alignment period is repeated, an alignment period can be randomly selected, combining the start and end times of the characters in the character attribute information, the start position of the first vowel of the characters, and a sequence extracted from the music rhythm information. The rhythm point information of the rhythm points in the alignment period is used to determine the rhythm point to be aligned for each character in the speech segment within the alignment period, and the gear ratio to be possessed when aligning the to-be-aligned rhythm point. The rhythm point serial number, the text serial number of the associated text, and the information table of the corresponding gear ratio are used as the alignment information table of the alignment period.

Finally, the alignment information table can be regarded as the alignment information table of each complete alignment period, and for a non-complete alignment period, part of the alignment information can be extracted from the alignment information table to form the corresponding alignment information table, thus At least one alignment period and an alignment information table corresponding to each alignment period are obtained through S102.

S103. Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, obtain aligned audio, and perform pitch adjustment on the aligned audio As well as special effects processing to form rap audio.

In this embodiment, the matching text and rhythm points can be determined directly through the alignment period formed by the above-mentioned division of the rhythm points of the background music and the alignment information table including the alignment relationship between the text of the speech segment and the rhythm points, and control the rhythm points in the speech segment. The text is aligned with the rhythm points in the background music, and the aligned audio is shifted based on the corresponding gear ratio. After that, you can also adjust the pitch of the shifted audio and add reverb, etc. according to the pitch of the background music. special effects to form converted rap audio.

A method for converting speech into rap music provided in the first embodiment of the present application can first identify the speech segment and process the background music to obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; The text attribute information and the music rhythm information are used to determine at least one alignment period for matching the speech segment with the background music, and the alignment information table of each alignment period is obtained; finally, according to the alignment information table of at least one alignment period, the text in the speech segment and the The rhythm points in the background music are aligned to obtain the aligned audio, and the rap audio is formed after adjusting the pitch of the aligned audio and processing the special effects. The above technical solution effectively realizes the conversion of voice content clips randomly recorded by the user into narration clips matched with background music, simplifies the tedious process of manual audio editing and production, and provides non-professional audio processing personnel with the possibility of rap music production; The above technical solution does not need to limit the voice content to be converted, guarantees the free recording of the voice content to be converted, simplifies the realization process of voice conversion, avoids the misplacement of voice text and music rhythm points, and improves the voice conversion rap music. Scope of application.

As an optional embodiment of Embodiment 1 of the present application, according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain each Before the alignment information table for each alignment period, the method further includes: if the total amount of text in the text attribute information is greater than the total amount of rhythm points in the music rhythm information, ending the process of converting the speech segment into rap music, And output a prompt to regain the speech segment or background music.

In the implementation of the method for converting speech into rap music provided by this embodiment, the execution conditions of the above S102 and S103 may be by default: the total amount of characters in the text attribute information obtained through S101 is less than or equal to the music rhythm information. The total amount of rhythm points, that is, the total number of words in the obtained speech segment is less than or equal to the number of rhythm points in the background music. When the above conditions are not satisfied, it may be considered that there is no condition for continuing to convert speech into rap music. At this time, the operation of this optional embodiment may be performed, that is, when it is determined that the total amount of text is greater than the total amount of rhythm points, the end of Following the execution of the step of converting the speech segment into rap music, a prompt for re-recording the speech segment is output to inform the user to re-record the speech segment. Alternatively, there are other optional operations. For example, in this optional embodiment, a prompt for re-selection of background music may be output to inform the user to re-selection of background music.

The operation of this optional embodiment ensures an effective match between the speech segment to be converted and the background music, thereby improving the user experience of converting speech into rap music.

Embodiment 2

FIG. 2 is a schematic flowchart of a method for converting speech into rap music provided by Embodiment 2 of the present application. Embodiment 2 is described based on Embodiment 1. In this embodiment, recognizing speech segments and processing background music , obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music, including: performing noise reduction processing and endpoint detection processing on the speech segment selected by the user, Obtain the character serial number, start and end time, starting position of the first vowel and the total amount of characters of each character in the speech segment to form the character attribute information of the speech segment; perform rhythm point detection on the background music selected by the user and the beat cycle division, determine the total amount of rhythm points, the rhythm point serial number, and the cycle information of each beat cycle contained in the background music, and constitute the music rhythm information of the background music; wherein, the cycle information includes: cycle number, the number of rhythm points included in the beat cycle, and the rhythm point number and start time of each rhythm point.

In this embodiment, according to the text attribute information and the music rhythm information, at least one alignment period for aligning the speech segment with the background music is determined, and an alignment information table for each alignment period is obtained, including : According to the total amount of text in the text attribute information and the cycle information of each beat cycle in the music rhythm information, determine at least one alignment cycle for aligning the speech segment with the background music; select a complete As the rhythm segment to be aligned, according to the text attribute information and the rhythm point information of the rhythm point to be aligned in the rhythm segment to be aligned, determine at least one alignment unit and corresponding alignment unit information; summarize the at least one alignment unit information A current alignment information table of the to-be-aligned rhythm segment is formed, and an alignment information table of each remaining alignment period is determined according to the current alignment information table.

As shown in FIG. 2 , a method for converting speech into rap music provided by the second embodiment includes the following operations:

S201. Perform noise reduction processing and endpoint detection processing on the speech segment selected by the user, and obtain the character serial number, start and end time, and start of the first vowel of each character in the speech segment through speech recognition of the processed speech segment. The starting position and the total amount of characters constitute the character attribute information of the speech segment.

In this embodiment, the noise processing strategy in audio processing can be used to perform noise reduction processing on the recorded speech segment, and the endpoint detection strategy can be used to remove the mute segment from the noise-reduced speech segment, and then the speech recognition strategy can be used The processed speech segment is recognized to obtain relevant information of each character that constitutes the speech segment.

The information obtained above may include the total amount of characters included in the entire speech segment, the character serial number of each character, the corresponding start and end time of the character in the speech segment, and the start of the first vowel of the corresponding pronunciation of the character. The starting and ending time of the text and the starting position of the first vowel can be regarded as relative time points, that is, the starting time of the first character can be regarded as 0 seconds according to the playback sequence of the entire voice. In this embodiment, the above information may be recorded as text attribute information corresponding to the speech segment.

Exemplarily, Table 1 is a data table of text attribute information. As shown in Table 1, each column in Table 1 can be regarded as a text attribute item, which can at least include the text serial number, the start time of the text, and the first item in the text. The start time of vowels and the end time of words, the number of rows in Table 1 can be regarded as the total number of words included in the speech segment.

Table 1 Text attribute information of text in speech segment

S202, perform rhythm point detection and beat cycle division on the background music selected by the user, determine the total amount of rhythm points, the rhythm point sequence number, and the period information of each beat cycle contained in the background music, and constitute the background music. Music tempo information.

In this embodiment, the rhythm point detection strategy in audio processing can be used to first detect the accent points (ie, rhythm points) of strong rhythm from the background music, and then the beat division strategy can be used to find the pronunciation rule of the detected rhythm points , thereby dividing the beat period with the smallest rhythm repeating unit. For a piece of background music, the detected rhythm point itself has certain attribute information, such as the sequence number of the rhythm point, the total amount of the rhythm point, the position of the rhythm point (that is, the relative time when the rhythm point appears). After the beat detection, corresponding period information will also be formed corresponding to each beat period. Exemplarily, the period information may include: the period number, the number of rhythm points included in the beat period, and the number of rhythm points of each rhythm point. Rhythm point number and rhythm point start time. This embodiment can aggregate these pieces of information to form a piece of music rhythm information.

Exemplarily, this embodiment provides music rhythm information in the form of a data table, thereby displaying the music rhythm information in the form of an information table. As shown in Table 2, Table 2 is a data table of music rhythm information. It can be seen that Table 2 is a cascade table. The first column of Table 2 shows the beat cycle identified by the cycle number, and the second column gives The rhythm point number is displayed, and at the same time, which rhythm points are included in the rhythm period with the cycle number of 1 in the form of cascade. Start time), the number of rhythm points cascaded under each cycle number can be used as the number of rhythm points in the beat cycle.

Table 2 Music rhythm information corresponding to background music

The following S203 to S205 in this embodiment provide an implementation process of determining an alignment period and an alignment information table for aligning the speech segment with the background music by using the text attribute information and the music rhythm information.

S203. Determine at least one alignment period for aligning the speech segment with the background music according to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information.

In this embodiment, it may be determined how many alignment periods for aligning all characters in the speech segment can be included in the entire background music. When the number of alignment periods is greater than 1, the determined last alignment period may be a non- Full cycle (ie, does not contain all text). S203 is equivalent to firstly dividing the background music by a rough alignment period. The whole division process requires the total amount of text included in the speech segment and the number of rhythm points in a complete beat cycle in the background music, and the ratio of the total amount of text to the number of rhythm points in the beat period. Yes, to determine whether to use the beat period as the alignment period directly, or to obtain the alignment period by combining the beat periods.

FIG. 3 is a flow chart of the implementation of determining an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application. As shown in FIG. 3 , according to the total amount of text in the text attribute information and the The period information of each beat period in the music rhythm information determines at least one alignment period for aligning the speech segment with the background music, including:

S2031. Select a complete beat cycle, and acquire the number of rhythm points in the cycle information corresponding to the complete beat cycle.

At least one beat period can be detected in the entire background music. When a beat period is detected, the beat period is considered to be a complete period. When more than one beat period is detected, the last cycle formed by division may be an incomplete period. That is, it does not contain all the rhythm points in a complete beat cycle. In this embodiment, a beat period can be selected from a complete beat period, and the number of rhythm points in the period information corresponding to the beat period can be acquired. The values of the number of rhythm points in different complete beat cycles are the same.

S2032. Determine whether the number of rhythm points is greater than or equal to the total amount of characters. If the number of rhythm points is greater than or equal to the total amount of characters, execute S2033; if the number of rhythm points is less than the total amount of characters, perform S2033. Then execute S2034.

The determination purpose of S2032 is mainly to determine whether a complete beat period obtained by the current detection can accommodate all the characters in the speech segment, if the complete beat period can accommodate all the characters in the speech segment, then execute S2033; If the period cannot accommodate all the characters in the speech segment, S2034 needs to be executed.

S2033, regard each beat period as an alignment period.

When the number of rhythm points is greater than or equal to the total amount of characters, the rhythm period can be directly regarded as an alignment period. When it is determined that a complete beat period can be regarded as an alignment period, other detected complete beat periods can be regarded as a complete alignment period, and an incomplete beat period can be regarded as an incomplete alignment period.

S2034, determine whether the number of beat periods included in the background music is greater than 1, and if the number of beat periods included in the background music is greater than 1, execute S2035; if the number of beat periods included in the background music is greater than 1 If the number is not greater than 1, execute S2036.

When the number of rhythm points is less than the total amount of text, it is equivalent to that a complete beat cycle cannot accommodate all the characters in the speech segment. At this time, the beat cycles need to be merged, and the precondition for merging is the beats included in the background music. The number of cycles is at least two. Determine whether the number of beat periods in the background music is greater than 1 through S2034, if the number of beat periods in the background music is greater than 1, then the merging condition is satisfied, and S2035 can be continued; if the number of beat periods in the background music is not If it is greater than 1, it means that the background music does not match the speech segment, and S2036 needs to be executed.

S2035 , merge the takt cycles in pairs according to the arrangement order of the cycle numbers to form at least one new tick cycle, and return to executing S2031 .

In this embodiment, when the number of beat periods is greater than 1, the beat periods may be merged in pairs in the order of the period numbers, thereby forming a new beat period, and the corresponding period information of the newly formed beat period will also be Corresponding changes occur. Taking the above Table 2 as an example, assuming that the two beat cycles with cycle number 1 and cycle number 2 are merged, the number of rhythm points contained in the new beat period formed is the number of rhythm points contained in the previous two beat periods. sum of numbers. After the takt cycles are merged in the order of the cycle numbers, the number of the takt cycles formed is half or half of the original takt cycles plus 1, and then returns to S2031 to align the cycles according to the cycle information of the newly formed takt cycles is determined, and so on, until a suitable alignment period is found, or when the search fails, the subsequent voice-to-rap music conversion operation is ended.

S2036. End the process of converting the speech segment into rap music, and output a prompt for regaining the speech segment or background music.

In this embodiment, if there is only one beat period, and the number of rhythm points in the beat period is less than the total amount of text, it can be considered that the speech segment does not match the selected background music, and it is necessary to use this embodiment 1 to adjust the number of rhythm points. According to the operation of the selected embodiment, the voice segment is re-uploaded or recorded again, or the background music is re-selected.

S204: Select a complete alignment period as the to-be-aligned rhythm segment, and determine at least one alignment unit and corresponding alignment unit information according to the text attribute information and the rhythm point information of the to-be-aligned rhythm point in the to-be-aligned rhythm segment.

After the above-mentioned division of the alignment period, an alignment period may be used as a reference to determine the matching situation of each character included in the speech segment with respect to the rhythm points within the alignment period. In this embodiment, the matching between the rhythm points included in a period of time and the characters in the speech segment is regarded as an alignment unit, and each alignment unit information includes the rhythm point number of the rhythm point existing in the alignment unit, and the rhythm point number associated with the rhythm point. The character serial number of the matched character, and the gear ratio required to align the existing rhythm point with the matched character.

Each alignment unit has alignment unit information including at least the rhythm point serial number, the character serial number and the gear ratio. At the same time, since the number of rhythm points included in the multiple alignment periods is the same and the musical tempo is the same, the alignment unit and the alignment unit information may only be determined for any complete alignment period.

The implementation process of determining the alignment unit and the alignment unit information in S204 can be described as follows: first, the alignment period selected for information determination is recorded as the rhythm segment to be aligned, and the rhythm point information in the alignment period can be directly used as the rhythm segment to be aligned. The rhythm point information of the rhythm point to be aligned; after that, an alignment matching value for aligning the text and the rhythm point can be determined according to the text attribute information and the rhythm point information; then it is determined that the alignment matching value is in the preset rhythm point-text The alignment range belonging to the alignment rule table; finally, based on the alignment rule corresponding to the alignment range, the alignment unit in the rhythm segment to be aligned is determined, and the alignment unit information corresponding to each alignment unit is determined, wherein the rhythm point-text The alignment range in the alignment rule table and the corresponding alignment rules can be preset through historical experience.

FIG. 4 is a flow chart of the realization of determining alignment units and alignment unit information in an alignment period in a method for converting speech into rap music provided by Embodiment 2 of the present application. As shown in FIG. 4 , according to the text attribute information and the information to be Align the rhythm point information of the rhythm points to be aligned in the rhythm segment, and determine at least one alignment unit and the corresponding alignment unit information may include:

S2041, selecting a complete alignment period as the rhythm segment to be aligned, and based on the rhythm point information of a plurality of rhythm points to be aligned in the rhythm segment to be aligned, form a plurality of rhythms to be aligned corresponding to the plurality of rhythm points to be aligned one-to-one block, and record the number of the plurality of rhythm points to be aligned as the initial number of remaining points.

In this embodiment, a complete alignment period may be selected from the above-determined alignment periods as the to-be-aligned rhythm segment determined by the alignment information table, and the to-be-aligned rhythm point in the to-be-aligned rhythm segment is the complete alignment period. All rhythm points included in the alignment cycle, the rhythm point information of the rhythm points included in the complete alignment cycle is the rhythm point information of the rhythm points to be aligned.

In this embodiment, the interval formed between two adjacent to-be-aligned rhythm points may be recorded as a to-be-aligned rhythm block, so that the same number of to-be-aligned rhythm points may be formed first according to the number of to-be-aligned rhythm points included in the to-be-aligned rhythm segment. To align the rhythm blocks, it can be considered that the formed rhythm blocks to be aligned correspond to the rhythm points to be aligned, respectively, and the corresponding block serial number can be set for each rhythm block to be aligned, and the number of rhythm points to be aligned can also be recorded as the initial the number of remaining points.

S2042. Determine the ratio of the number of remaining points to the total amount of characters in the character attribute information, and record the ratio as an alignment matching value.

In this embodiment, in order to realize the matching between each to-be-aligned rhythm point in the to-be-aligned rhythm segment and the text in the speech segment, first determine the ratio of the number of to-be-aligned rhythm points that are not matched with the text in the to-be-aligned rhythm segment to the total amount of text, and This ratio is recorded as the alignment match value.

When there is no matched rhythm point in the to-be-aligned rhythm segment, the number of rhythm points to be matched is all the to-be-aligned rhythm points. Therefore, initially, the number of remaining points is initially the to-be-aligned rhythm points included in the to-be-aligned rhythm segment. quantity.

S2043. Search a preset rhythm point-text alignment rule table, and determine the length ratio range to which the alignment matching value belongs.

This embodiment presets a rhythm point-text alignment rule table, the rule table is a binary association table, and the two associated objects are the length ratio range and the alignment rule respectively. The length ratio range can be set by the ratio of the number of unmatched rhythm points in one alignment period to the total amount of characters included in the entire speech segment. In this embodiment, based on historical experience, six different ranges of length ratios are formed, namely: (0,0.2], (0.2,0.8], (0.8,1], (1,1.1], (1.1,1.3] and ( 1.3,∞).

In this embodiment, the range of length ratios in which the above-obtained alignment matching value is located in the rhythm point-character alignment rule table can be determined.

S2044: Determine according to the alignment rule corresponding to the length ratio range that there is a rhythm block to be aligned that matches the text, and record it as a candidate alignment unit.

After determining the length ratio range to which the alignment matching value belongs, the alignment rule associated with the length ratio range can be obtained, and the candidate alignment unit is divided for the to-be-aligned rhythm segment by the alignment rule.

In this embodiment, the matching of text and rhythm points can be regarded as the matching of text and a rhythm block to be aligned, and based on the alignment rule corresponding to the length ratio range, the matching text can be determined for each rhythm block to be aligned. (The number of characters is uncertain, but the number of characters is at least 1), and the matched rhythm block to be aligned can be used as a candidate alignment unit.

Corresponding to different length ratio ranges, the present embodiment sets corresponding alignment rules. Exemplarily, Table 3 provides a preset rhythm point-text alignment rule table. Character matching is performed for the remaining rhythm points (the remaining rhythm blocks to be aligned) according to the alignment rules corresponding to the multiple length ratio ranges in Table 3.

Table 3 Rhythm point-text alignment rule table

S2045. Count the number of remaining rhythm blocks to be aligned as the new number of remaining points.

After performing one alignment and matching using the above S2044, there may also be unmatched rhythm blocks to be aligned. The number of blocks of the remaining to-be-aligned rhythm blocks in the to-be-aligned rhythm segment is counted, and the number of blocks is used as the new number of remaining points.

S2046: Determine whether the number of remaining points is 0, if the number of remaining points is 0, execute S2047; if the number of remaining points is not 0, return to execute S2042.

In this embodiment, it can be determined whether the number of remaining points is 0. If the number of remaining points is 0, it can be considered that the remaining rhythm blocks to be aligned in the rhythm segment to be aligned are 0, that is, all rhythm blocks to be aligned have been completed. If the number of remaining points is not 0, it can be considered that there are unmatched rhythm blocks to be aligned in the to-be-aligned rhythm segment, and at this time, it can return to re-execute the alignment matching value of S2042 to determine operate.

Based on the operation of S2046, when all the rhythm blocks to be aligned in a rhythm section to be aligned have been matched, the number of candidate alignment units formed by it is actually the same as the number of included rhythm points to be aligned. That is, it can be considered that a rhythm point to be aligned (rhythm block to be aligned) corresponds to a candidate alignment unit, and the unit serial numbers of the multiple candidate alignment units can be marked sequentially increasing from 0 according to their alignment sequence.

To facilitate understanding of the process of determining the candidate alignment unit, this embodiment provides an exemplary description. Exemplarily, it is assumed that the number of rhythm points to be aligned in a rhythm segment to be aligned is 8, and the currently determined number of remaining points is 8; the total amount of text included in the speech segment obtained by the user is 5, such as "light yellow long skirt”, the process of matching the “light yellow long skirt” with the 8 remaining rhythm points to determine each candidate alignment unit is described as:

1) The alignment matching value is: 8/5=1.6, which falls within the length ratio range of (1.3, ∞). By looking up Table 3 above, the corresponding alignment rule can be obtained.

2) According to the alignment rule associated with the length ratio range (1.3, ∞), the text and the rhythm point are matched.

The alignment rule is: "Select 10% of the total text from the first text to match from the first remaining rhythm point, and then match the remaining rhythm points of 100% of the total text with the text in text order, and then For the remaining rhythm points of the following 20% of the total text, starting from the last text, select the text with 20% of the total text for repeated matching". Based on this alignment rule, it is first necessary to start from the first word of "light yellow dress" and select 10% of the total text, that is, 0.5 words to repeat. When the length of the word to be repeated is less than 1, the round-down operation is performed. Therefore, the current number of words to be repeated is 0. After that, you can directly start from the first remaining rhythm point, select a rhythm point with 100% of the total text, and match the 5 text sequences respectively. At this time, the rhythm points 0-4 formed by the rhythm points 0-4 to be aligned correspond respectively. "Light", "Yellow", "Color", "Long" and "Skirt" are 5 characters; then, starting from the last character of "Light Yellow Long Skirt", select 20% of the total text, that is, the last character "Skirt" ", at this time, the rhythm block to be aligned formed by rhythm point 5 corresponds to the word "skirt". So far, the operation of matching text and rhythm points according to the alignment rules associated with the length ratio range (1.3, ∞) has been completed. The unit numbers of the currently determined candidate alignment units are 0-5 respectively, and the six candidate matching units correspond The characters are: "light", "yellow", "color", "long", "skirt" and "skirt".

3) After the above operation, there are still 2 unmatched rhythm blocks to be aligned in the 8 rhythm blocks to be aligned. It is considered that the number of remaining points is greater than 0, so the alignment matching value can be determined again, and the new alignment matching value is 2. /5=0.4, which falls within the length ratio range of (0.2, 0.8], and by looking up Table 3 above, the corresponding alignment rule can be obtained.

4) The text and the rhythm point are matched according to the alignment rules associated with the length ratio range (0.2, 0.8].

The alignment rule is: "When L is less than or equal to 0.5, randomly select the text to be repeated with L * the total amount of text, adjust the position of the matched rhythm point-text, and repeat after the selected text; when L If it is greater than 0.5, randomly select 50% of the total text to be repeated, adjust the position of the matched rhythm point-text, repeat after the selected text, and add the remaining (L-0.5)*total text. The remaining rhythm points are added to the silent segment, where L is the alignment matching value."

Analysis of the alignment matching value of 0.4 shows that according to the alignment rule, the alignment matching value 0.4 is less than 0.5, so the operation of randomly selecting 40% of the total text (that is, 2 characters) can be directly performed, assuming that the font size is randomly selected from 0-4. The fixed font size is 1 and 3, and the corresponding words are "yellow" and "long" respectively, then it is necessary to adjust the "light yellow long skirt" that has been matched and formed, so that the word to be repeated can be located in the selected word. After the position of the text, according to the alignment rule, the remaining two rhythm blocks to be aligned are "yellow" and "long" respectively, thus forming new candidates matching the two characters "yellow" and "long" respectively. The alignment unit, due to the adjustment of the above-mentioned "light yellow long skirt" that has been matched, based on this operation, the characters corresponding to the 8 candidate matching units are: "light", "yellow", "yellow" and "color" "long" "long" "skirt" "skirt".

5) After the above operation, the remaining unmatched rhythm blocks to be aligned are 0, that is, the number of remaining points is 0, which meets the matching conditions for ending the candidate alignment unit, and the above operation can be ended.

After step 5), 8 candidate alignment units with unit serial numbers 0-7 in sequence can be formed. In this way, the alignment and matching of the text in the speech segment to the rhythm segment to be aligned is completed.

S2047: Determine at least one alignment unit and obtain a corresponding gear ratio according to the unit duration of each candidate alignment unit and the matching character attribute information of the characters matched by each candidate alignment unit.

According to the above description, the number of candidate alignment units determined from the to-be-aligned rhythm segment is the same as the number of to-be-aligned rhythm blocks included in the to-be-aligned rhythm segment, and one to-be-aligned rhythm block is the corresponding rhythm point to phase The interval block formed by the next rhythm point or the rhythm end point (this case is mainly for the last rhythm point), that is, the duration of a rhythm block to be aligned is the interval between two rhythm points (or rhythm end points). In this embodiment, since one candidate alignment unit corresponds to one to-be-aligned rhythm block, the duration of the to-be-aligned rhythm block may be used as the unit duration of the corresponding candidate alignment unit.

After determining the text matched by the candidate alignment unit, what needs to be done is to align the text pronunciation and unit duration between the matched text and the candidate alignment unit. In general, the alignment can be directly mixed with the pronunciation of the matched text while playing the audio signal of the candidate alignment unit. Considering that some words have a short pronunciation time, but the unit duration of the matching candidate alignment unit is longer, or, some words have a longer pronunciation time, but the unit duration of the matching candidate alignment unit is shorter, in order to To align the text with the unit to be aligned, it is necessary to adjust the pronunciation rate of the text, such as stretching the pronunciation time of the text (reducing the pronunciation speed) or compressing (speeding up the pronunciation speed) to make it equal to the unit duration.

In this embodiment, the ratio value of the text that needs to be stretched or compressed is recorded as the speed change ratio, which can be based on the unit duration of the candidate alignment unit and the matching text attribute information of the text matched with the candidate alignment unit (such as the start and end of the text of the matched text). time, the starting position of the first vowel in the text, etc.), to determine the gear ratio required when the matched text is aligned with the corresponding candidate alignment unit.

In the realization of aligning the text with the unit to be aligned by stretching or compressing the pronunciation of the text, the extent to which the pronunciation of the text can be stretched or compressed is limited. After the alignment operation, the formed audio has the risk of distortion. Therefore, in this embodiment, it is necessary to set an appropriate range for the compression or stretching of the pronunciation of the text, that is, it is necessary to ensure that the speed change ratio corresponding to the text is in a normal range. The range of ratios can be regarded as suitable conditions for stretching or compression.

Therefore, the gear ratio calculated above can be compared with the set suitable conditions to determine whether the corresponding candidate aligning unit is suitable as the aligning unit. If the candidate aligning unit is suitable as the aligning unit, the candidate aligning unit can be directly It is determined as an alignment unit, and its corresponding gear ratio is determined as the gear ratio of the alignment unit; if the candidate alignment unit is not suitable as an alignment unit, the candidate alignment unit needs to be silenced or filled with two or more candidate alignment units. By combining the processing, an alignment unit that satisfies the above-mentioned suitable conditions is obtained, and the gear ratio for which the suitable condition is determined is used as the gear ratio of the alignment unit.

The above determined number of candidate alignment units for the number of rhythm points to be aligned can eventually form at least one alignment unit, and each alignment unit may include at least one rhythm point and at least one matching character. The ratio can be regarded as the ratio value required to stretch or stretch the text when aligning the included text with the included rhythm points.

S2048: Determine the unit serial number of each alignment unit, the initial rhythm point serial number in the included rhythm points, the character serial number of the matched characters, and the gear ratio as the corresponding alignment unit information.

When the above-mentioned determination of the alignment unit and the corresponding gear ratio is performed, the unit serial number of each alignment unit and the rhythm point serial number of each rhythm point included in the alignment unit are also obtained accordingly. At the same time, the alignment unit can also be obtained. The literal number of each literal matched in the cell. The above-mentioned information may be summarized for each alignment unit, so that corresponding alignment unit information is formed corresponding to each alignment unit.

S205. Summarize the alignment unit information corresponding to the at least one alignment unit to form a current alignment information table of the to-be-aligned rhythm segment, and determine an alignment information table for each remaining alignment period according to the current alignment information table.

In this embodiment, at least one alignment unit included in the rhythm segment to be aligned and the corresponding alignment unit information can be determined through the above S204, and the above determined alignment unit information can be arranged and summarized in the order of the unit serial numbers of the alignment units, thereby forming a Current alignment information table. Afterwards, the alignment information table of each remaining alignment period determined in the above S203 may also be determined according to the current alignment information table.

For the remaining alignment period, if it is a complete alignment period, the above current alignment information table can be copied directly as the corresponding alignment information table; if it is an incomplete alignment period, the current alignment information table can be retrieved from the current alignment information table The alignment unit information of the same row with the same number of rhythm points included in the alignment period forms a corresponding alignment information table.

Table 4 Alignment information table formed based on the information of the alignment unit in an alignment cycle

Exemplarily, Table 4 is an alignment information table formed based on the information of the alignment unit in an alignment cycle. As shown in Table 4, each column in the alignment information table is equivalent to the attribute information of the alignment unit, and may include: Unit serial number, the rhythm point serial number of the starting rhythm point in the alignment unit, the character serial number of the matched text, and the gear ratio required for alignment, the number of rows in the alignment information table represents the unit of the alignment unit provided in the alignment cycle number.

The determining of the alignment information table of each remaining alignment period according to the current alignment information table includes: for each remaining alignment period, if the alignment period is a complete period, then using the current alignment information table as the The alignment information table of the alignment cycle; if the alignment cycle is an incomplete cycle and a row in the current alignment information table corresponds to an alignment unit, then determine the number of rhythm points of the rhythm points included in the alignment cycle; The alignment unit information of the number of lines of the rhythm points is selected from the current alignment information table in reverse order to form the alignment information table of the alignment period.

The above description in this embodiment provides the process of determining the alignment information table of the remaining alignment periods in the background music. For an incomplete alignment period, assuming that the incomplete alignment period includes 2 rhythm points, the current alignment can be directly obtained from the current alignment. In the information table, two rows of alignment unit information are selected from bottom to top to form a corresponding alignment information table.

S206. Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, obtain aligned audio, and perform pitch adjustment on the aligned audio As well as special effects processing to form rap audio.

In this embodiment, the alignment information table formed by each alignment period includes at least one alignment unit and corresponding alignment unit information, and each alignment unit information includes the rhythm point actually used for the alignment of text and rhythm points serial number, matching text serial number, and gear ratio required for alignment, etc. In this embodiment, after the alignment information table of each alignment period is obtained through the above steps, it is possible to control the corresponding rhythm point and the matched text to be aligned according to the corresponding speed change ratio according to the alignment unit information included in each alignment information table, In this way, the text in the speech segment is aligned and matched with the rhythm points in the background music.

When controlling the text in the speech segment to be aligned with the matched rhythm point, for the matching in each alignment period, it is actually equivalent to first occupying the duration of the pronunciation according to the text included in each alignment unit in the alignment period (the alignment The interval duration from the start of the first vowel of the unit to the start of the first vowel of the next unit) to obtain the audio data actually corresponding to the alignment unit in the speech segment, and then according to the gear ratio of the alignment unit to match the actual corresponding Variable speed adjustment is performed on the audio data of the variable speed adjustment, and finally, operations such as pitch adjustment and special effect processing can be performed on the audio data after the variable speed adjustment, so as to form the converted rap music.

A method for converting speech into rap music provided by the second embodiment of the present application provides a determination operation for text attribute information and music rhythm information, and also provides an alignment period for aligning the speech segment with the background music and related The operation of the alignment information table. With the method provided in this embodiment, after selecting background music and recording a speech of random content, the user can determine the match between the word and the rhythm point by obtaining the obtained rhythm point position, the start and end time of a single word, and the start time of a vowel. Alignment and variable-speed alignment strategy, so that the rap music formed by aligning words and rhythm points can be obtained in a short time through the alignment strategy. The realization of the entire technical solution simplifies the tedious process of manual audio editing and production, and provides non-professional audio processing personnel with the possibility of making rap music; at the same time, the above technical solution does not need to limit the content of the voice to be converted, which ensures the quality of the voice content to be converted. Free recording also simplifies the implementation process of voice conversion, avoids the misalignment of voice text and music rhythm points, and improves the application scope of voice conversion rap music.

As an optional embodiment of the second embodiment of the present application, this optional embodiment determines the total amount of rhythm points included in the background music, the sequence number of rhythm points, and the period information of each beat period in the execution of the above S202. Before the music rhythm information of the background music, the method further includes: acquiring a plurality of detected initial rhythm points, and determining the interval duration formed between two adjacent initial rhythm points; The word long time is combined with the interval time length to determine the to-be-deleted rhythm point among the plurality of initial rhythm points and delete the to-be-deleted rhythm point to obtain an effective rhythm point in the background music.

In this optional embodiment, an operation for processing the rhythm points detected from the background music is given, and through this operation, the detected rhythm points (referred to as initial rhythm points in this optional embodiment) can be obtained. Remove the densely spaced rhythm points where the interval between two adjacent rhythm points is less than half of the average word length.

The average character length of a character is the ratio of the time occupied by all characters to the total amount of characters. Generally speaking, if the interval between two adjacent rhythm points is less than half of the average character length, it is not conducive to the difference between characters and rhythm points. Therefore, it is necessary to delete any one of the two adjacent rhythm points, so that the undeleted rhythm point and the rhythm point before or after the deleted rhythm point constitute a new interval duration, and The newly formed interval duration can be determined again in the manner of this optional embodiment, whereby invalid rhythm points are removed by cyclic updating, and valid rhythm points are retained.

As another optional embodiment of the second embodiment of the present application, the execution of the above S2047 is described. FIG. 5 is an expanded flowchart for determining the alignment unit and the alignment unit information in the alignment period provided by the second embodiment of the present application, As shown in FIG. 5 , according to the unit duration of each candidate alignment unit and the matching text attribute information of the text matched by each candidate alignment unit, at least one alignment unit is determined and the corresponding gear ratio is obtained. Describe:

This optional embodiment is the execution process of the foregoing S2047. Through the above operation of S2046, a certain number of candidate alignment units can be obtained in the to-be-aligned rhythm segment. The following operations in this optional embodiment can determine the alignment units in the candidate alignment units and the gear ratio corresponding to the alignment units.

S1. Select an unselected candidate alignment unit as the current processing unit according to the sequence of unit serial numbers.

In this embodiment, a candidate alignment unit in the rhythm segment to be aligned has a corresponding unit serial number, and a candidate alignment unit that has not been selected before can be selected in the order of the unit serial number as the current processing unit. Unselected can be understood as unselected. Selected as the current processing unit.

Exemplarily, the first candidate processing unit is selected as the current processing unit.

S2. Determine the current processing unit according to the unit duration of the current processing unit and in combination with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit the current gear ratio.

According to the above description of this embodiment, it can be seen that the alignment of the text and the candidate alignment unit is mainly manifested in the alignment of the actual pronunciation duration of the text and the unit duration of the candidate alignment unit. The alignment of the two can be achieved by stretching or compressing the pronunciation duration of the text. Realization, and the stretching or compression of the text pronunciation time can be determined by a gear ratio. The gear ratio is equivalent to the ratio of the pronunciation time to the actual pronunciation time of the text.

For a character, the actual pronunciation duration starts from the starting position of the first vowel, and the actual pronunciation ending time can be regarded as the starting position of the first vowel of the next adjacent character. If the text is considered in combination with the candidate alignment unit, in a candidate alignment unit, the time occupied by the actual pronunciation of all the matched text should be from the position of the first vowel of the first matched text in the candidate alignment unit, to Ends at the first vowel position of the first matching text in the next candidate alignment unit adjacent to it. Therefore, the actual pronunciation duration of all characters matched by the current processing unit can be determined by the start and end times of the characters matched by the current processing unit and the adjacent next candidate alignment unit respectively and the start position of the first vowel, and thus The current gear ratio of the current processing unit is obtained according to the known unit duration and the determined actual sounding duration.

In this embodiment, according to the unit duration of the current processing unit, in combination with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit, determine the The current gear ratio of the current processing unit, including:

S21. According to the start and end time of each character in all characters matched by the current processing unit and the start position of the first vowel, determine the duration of pronunciation occupied by all the matched characters in the current processing unit.

In this step, the matching character attribute information of all characters matched in the current processing unit can be obtained. The matching character attribute information can be the starting and ending time of each character and the starting position of the first cause of the character. Based on these information, it can be determined that The pronunciation of all the matched characters in the current processing unit occupies a long time.

Exemplarily, assuming that there is only one character in the current processing unit, the starting and ending times of the character are t1 and t2 respectively, and the starting position of the first vowel is t3, where t1<t3<t2, then the character is currently being processed. The pronunciation occupying time in the unit is actually t2-t3.

Suppose there are two characters in the current processing unit, the start and end times of the first character are t1 and t2 respectively, the start position of the first vowel is t3, the start and end times of the second character are t2 and t4 respectively, the first The starting position of the sound is t5, where t1<t3<t2<t5<t4, then the two characters in the current processing unit have a common pronunciation occupying time: t4-t3 or (t2-t3)+( t4-t5). It can be seen that the pronunciation occupied duration of all characters matched in the current processing unit is only the difference between the sum of the start and end durations of all characters and the interval duration from the start of the first character to the first vowel.

The pronunciation occupying time here is not the actual pronunciation duration of all characters matched in the current processing unit. The actual pronunciation duration also includes the vowel interval duration of the first character matched by the next candidate alignment unit adjacent to the current processing unit. The vowel interval duration can be obtained through the following S22. In this embodiment, the purpose of determining the pronunciation occupied duration in the above-mentioned manner is to enable the position of the first vowel of each character in the speech segment to be aligned with the rhythm point of the matched candidate alignment processing unit, thereby ensuring that alignment is adopted in this manner The playback effect of the following text and rhythm points is better than that of directly aligning the prefix with the rhythm points.

S22. Determine the vowel interval duration of the first character according to the start and end times of the first character matched by the adjacent next candidate alignment unit of the current processing unit and the start position of the first vowel.

Taking the current processing unit including two characters as an example, it is assumed that the start and end times of the first character in the characters matched by the next candidate alignment unit adjacent to the current processing unit are t4 and t6, and the start position of the first vowel is t7, where , t1<t3<t2<t5<t4<t7<t6, then the vowel interval of the first character is t7-t4.

Through S21 and S22, it can be determined that the actual pronunciation duration of all characters matched in the current processing unit is the unit of the first character matched by the pronunciation occupying duration of all characters matched in the current processing unit and the adjacent next candidate alignment unit sound interval. The actual pronunciation duration of all characters matched in the current processing unit is (t4-t3)+(t7-t4).

S23, take the ratio of the unit duration of the current processing unit and the determined actual pronunciation duration as the current speed change ratio of the current processing unit, wherein, the actual pronunciation duration is the sum of the pronunciation occupying duration and the vowel interval duration.

Assuming that the unit duration of the current processing unit is t, the current gear ratio of the current processing unit can be expressed as: t/[(t4-t3)+(t7-t4)].

S3. Comparing the current gear ratio with the set first gear ratio value and the second gear ratio value, wherein the second gear ratio value is greater than the first gear ratio value.

After the current gear ratio of the current processing unit is determined through S2, the current gear ratio can be compared with the set first gear ratio value and the second gear ratio value, so as to determine that the matching text is to be stretched by the current gear ratio Or whether the compression meets the normal tension/compression conditions.

This embodiment assumes that the speed ratio between the first speed ratio and the second speed ratio satisfies the stretching/compression condition, the speed ratio smaller than the first speed ratio does not meet the compression condition, and the speed ratio greater than the second speed ratio does not meet the stretching condition.

S4. If the current gear ratio is greater than or equal to the first gear ratio value and less than or equal to the second gear ratio value, determine the current processing unit as the alignment unit, and record the current gear ratio as the gear shift of the alignment unit ratio, and then execute S7.

When the current gear ratio is greater than or equal to the first gear ratio and less than or equal to the second gear ratio, it is considered that the current gear ratio of the current processing unit satisfies the conventional stretching/compression conditions, and the current processing unit can be directly viewed at this time. Make an alignment unit, and regard the current gear ratio as the gear ratio of the alignment unit, and jump to S7 to continue the operation.

S5. If the current gear ratio is greater than the second gear ratio value, determine the silence duration used to fill the current processing unit, and determine a new current gear ratio according to the silence duration, and then execute S3.

When the current speed ratio is greater than the second speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the normal stretching conditions, which is equivalent to that the unit duration of the current processing unit is longer than the length of all matching characters. For the actual pronunciation duration, a mute duration needs to be added to the current processing unit to increase the actual pronunciation duration of the text.

In one embodiment, it is determined that the added mute duration is the start and end duration of a character, and thus, according to the combination of the mute duration and the determined actual pronunciation duration, a unit duration is re-determined, and the mute duration is the same as the actual pronunciation duration. The sum of the durations is the current gear ratio in the denominator. After that, it returns to S3 to perform the comparison operation of the gear ratio.

S6. If the current gear ratio is smaller than the first gear ratio value, combine the current processing unit with the adjacent next candidate alignment unit to form a new current processing unit, and return to executing S2.

When the current speed ratio is smaller than the first speed ratio value, it is considered that the current speed ratio of the current processing unit does not meet the conventional compression conditions, which is equivalent to that the unit duration of the current processing unit is shorter than the actual length of all matching characters. The pronunciation duration needs to be combined with a candidate alignment unit on the basis of the current processing unit to form a new current processing unit, so as to increase the unit duration of the current processing unit.

In one embodiment, the candidate alignment unit to be merged is the next candidate alignment unit adjacent to the current processing unit. In this case, the unit duration of the newly formed current processing unit is the duration of the original unit and the next candidate alignment unit. After the sum of the unit durations, it is possible to return to S2 to recalculate the actual pronunciation durations of all characters matched in the newly formed current processing unit.

In this embodiment, the operation of selecting the next candidate alignment unit to be incorporated into the current processing unit is performed, and it is also considered that the selected next candidate alignment unit has been selected. When S1 needs to be performed subsequently, the next candidate alignment unit can be skipped. For unit selection, the next candidate alignment unit is no longer individually selected as the current processing unit.

S7, determine whether all candidate alignment units are selected to participate in processing, if all candidate alignment units are selected to participate in processing, then execute S8; If there are candidate alignment units that are not selected to participate in processing in all candidate alignment units, return to execute S1;

After an alignment unit is determined through the above steps, there may be unselected candidate alignment units in the to-be-aligned rhythm segment, which can be determined by S7. If all the candidate alignment units are selected to participate in the above processing, the execution can be executed. S8, if there is a candidate aligning unit that is not selected to participate in the processing among all the candidate aligning units, it is necessary to return to S1 to re-select an unselected candidate aligning unit to perform the above operations in a loop.

S8. Summarize each of the determined alignment units and the corresponding gear ratios.

Each of the above-determined alignment units and corresponding gear ratios may be aggregated to obtain at least one alignment unit and a gear ratio included in the rhythm section to be aligned.

This optional embodiment provides an implementation process for determining the alignment unit in the rhythm segment to be aligned and the corresponding speed change ratio. Through the execution of this optional embodiment, it is possible to ensure that the rhythm points in the rhythm segment to be aligned and the text in the speech segment The effective alignment of the rhythm points avoids the occurrence of misalignment between the speech text and the music rhythm point, thereby providing an effective theoretical support for the conversion of speech to rap music in this embodiment.

Embodiment 3

FIG. 6 is a structural block diagram of an apparatus for converting speech into rap music according to Embodiment 3 of the present application. The apparatus is suitable for converting the voice recorded by a user to rap music. The apparatus can be implemented by software or hardware. , and can generally be integrated on computer equipment. As shown in FIG. 6 , the apparatus includes: an information determination module 31 , an alignment information determination module 32 and a conversion control module 33 .

The information determination module 31 is set to recognize the speech segment and process the background music, and obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music; the alignment information determination module 32 is set to be based on the text attribute information. and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; the conversion control module 33 is set to be in accordance with the at least one The alignment information table of the alignment period controls the alignment of the text in the speech segment with the rhythm points in the background music to obtain aligned audio, and the rap audio is formed after performing pitch adjustment and special effects processing on the aligned audio .

The device for converting voice into rap music provided by the third embodiment of the present application effectively realizes the conversion of voice content clips randomly recorded by users into voice clips matched with background music, simplifies the tedious process of manually performing audio editing, and provides Non-professional audio processing personnel provide the possibility of rap music production; at the same time, the above technical solution does not need to limit the voice content to be converted, which ensures the free recording of the voice content to be converted, and also simplifies the implementation process of voice conversion, avoiding the need for voice and text to be converted. The situation that the music rhythm points are misplaced has improved the application scope of voice-converted rap music.

Embodiment 4

FIG. 7 is a schematic diagram of the hardware structure of a computer device according to Embodiment 4 of the present application, where the computer device includes: a processor and a storage device. At least one instruction is stored in the storage device, and the instruction is executed by the processor, so that the computer device executes the method for converting speech into rap music according to the above method embodiments.

7 , the computer equipment may include: a processor 40 , a storage device 41 , a display screen 42 , an input device 43 , an output device 44 and a communication device 45 . The number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in FIG. 6 . The number of storage devices 41 in the computer device may be one or more, and one storage device 41 is taken as an example in FIG. 7 . The processor 40 , storage device 41 , display screen 42 , input device 43 , output device 44 and communication device 45 of the computer equipment may be connected by a bus or in other ways. In FIG. 7 , the connection by a bus is taken as an example.

In the embodiment, when the processor 40 executes one or more programs stored in the storage device 41, the following operations are implemented: recognizing the speech segment and processing the background music, and obtaining the text attribute information of the text in the speech segment and the information of the background music. Music rhythm information; according to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; The alignment information table of the at least one alignment period controls the text in the speech segment to be aligned with the rhythm points in the background music, to obtain aligned audio, and after the aligned audio is adjusted and processed with special effects Form rap audio.

Embodiments of the present application further provide a computer-readable storage medium, where a program in the storage medium, when executed by a processor of a computer device, enables the computer device to execute the method for converting speech into rap music as described in the foregoing embodiments . Exemplarily, the method for converting speech into rap music described in the above embodiments includes: recognizing speech segments and processing background music, and obtaining text attribute information of words in the speech segment and music rhythm information of the background music; the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period; according to the at least one alignment period The alignment information table controls the text in the speech segment to align with the rhythm points in the background music to obtain the aligned audio, and the rap audio is formed after the alignment of the audio is adjusted and processed with special effects.

For the embodiments of the apparatus, computer equipment, and storage medium, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

Claims

A method of converting speech into rap music, comprising:

Identifying speech segments and processing background music, and obtaining text attribute information of the text in the speech segment and music rhythm information of the background music;

According to the text attribute information and the music rhythm information, determine at least one alignment period for aligning the speech segment with the background music, and obtain an alignment information table for each alignment period;

Control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period to obtain aligned audio, and perform pitch adjustment and special effects on the aligned audio Rap audio is formed after processing.
The method according to claim 1, wherein the identifying the speech segment and processing the background music to obtain the text attribute information of the text in the speech segment and the music rhythm information of the background music, comprising:

Perform noise reduction processing and endpoint detection processing on the voice segment selected by the user, and obtain the character serial number, each character number, and the number of each character in the multiple characters in the processed voice segment through speech recognition of the processed voice segment. The starting and ending time of each character, the starting position of the first vowel of each character, and the total amount of characters of the plurality of characters constitute the character attribute information of the speech segment;

Perform rhythm point detection and rhythm period division on the background music selected by the user, determine the total amount of rhythm points of multiple rhythm points included in the background music, the rhythm point sequence number of each rhythm point, and the The period information of each beat period in the multiple beat periods contained in the background music constitutes the music rhythm information of the background music;

Wherein, the period information includes: the period number of each beat period, the number of rhythm points of multiple rhythm points included in each beat period, and each rhythm point included in each beat period rhythm point number and rhythm point start time.
The method according to claim 2, in the determining the total amount of rhythm points of the plurality of rhythm points included in the background music, the rhythm point sequence number of each rhythm point, and the plurality of beats included in the background music The period information of each beat period in the period, before forming the music rhythm information of the background music, further includes:

Obtain a plurality of initial rhythm points detected from the background music, and determine the interval duration formed between two adjacent initial rhythm points;

Determine the to-be-deleted rhythm point among the plurality of initial rhythm points and delete the to-be-deleted rhythm point to obtain the background Effective rhythm points in music.
The method according to claim 2, wherein, according to the text attribute information and the music rhythm information, determining at least one alignment period for aligning the speech segment with the background music, and obtaining each Alignment information table for the alignment period, including:

According to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information, determine the at least one alignment period for aligning the speech segment with the background music;

Select a complete alignment cycle as the rhythm segment to be aligned, and determine at least one alignment unit and the alignment unit corresponding to each alignment unit according to the text attribute information and the rhythm point information of the to-be-aligned rhythm point in the to-be-aligned rhythm segment information;

Summarizing the alignment unit information corresponding to the at least one alignment unit to form an alignment information table of the to-be-aligned rhythm segment, and determining alignments other than the one complete alignment period in the at least one alignment period according to the alignment information table Periodic alignment information table.
The method according to claim 4, wherein, according to the total amount of text in the text attribute information and the period information of each beat period in the music rhythm information, determining the method for combining the speech segment with the At least one alignment period of the background music alignment includes:

Determine whether the number of rhythm points in the cycle information corresponding to a complete beat cycle in the background music is greater than or equal to the total amount of text;

In response to the number of rhythm points in the period information corresponding to the one complete beat period being greater than or equal to the total amount of text, each beat period is regarded as an alignment period;

In response to the number of rhythm points in the period information corresponding to the one complete beat period being less than the total amount of text, in the case that the number of beat periods included in the background music is greater than 1, according to the period number Perform the pairwise merging of beat cycles in order to form at least one new beat cycle, and return to determine whether the number of rhythm points in the cycle information corresponding to a complete new beat cycle in the music rhythm information is greater than or equal to the specified number of beat cycles. the total amount of text.
The method according to claim 4, wherein the determining, according to the alignment information table, the alignment information table of the alignment periods other than the one complete alignment period in the at least one alignment period comprises:

For each alignment period except the one complete alignment period in the at least one alignment period, in the case that each alignment period is a complete period, the alignment information table of the to-be-aligned rhythm segment is used as the alignment information table of each alignment cycle;

In the case that each alignment period is an incomplete period, and a row in the alignment information table of the to-be-aligned rhythm segment corresponds to an alignment unit, determine the rhythm points of the rhythm points included in each alignment period Number; from the alignment information table of the rhythm segment to be aligned, select the alignment unit information of the number of lines of rhythm points corresponding to each alignment period in reverse order, and select the alignment unit of the number of lines of the selected rhythm points information as the alignment information table for each alignment period.
The method according to claim 4, wherein the at least one alignment unit and the alignment corresponding to each alignment unit are determined according to the text attribute information and the rhythm point information of the to-be-aligned rhythm points in the to-be-aligned rhythm segment Unit information, including:

Based on the rhythm point information of a plurality of to-be-aligned rhythm points in the to-be-aligned rhythm segment, a plurality of to-be-aligned rhythm blocks corresponding to the plurality of to-be-aligned rhythm points are formed, and the plurality of to-be-aligned rhythms The number of points is recorded as the initial number of remaining points;

Determine the ratio of the number of remaining points to the total amount of text in the text attribute information, and record the ratio as an alignment matching value;

Find a preset rhythm point-text alignment rule table, and determine the length ratio range where the alignment matching value is located in the rhythm point-text alignment rule table;

According to the alignment rule corresponding to the length ratio range, determine that there is a rhythm block to be aligned with matching text, and record the rhythm block to be aligned with matching text as a candidate alignment unit;

Count the number of blocks of rhythm blocks to be aligned that are not recorded as candidate alignment units, take the number of blocks as the new number of remaining points, and return to performing the described determination of the number of remaining points and the text attribute information. The ratio of the total amount of text, and the ratio is recorded as the operation of aligning the matching value until the final number of remaining points is 0;

According to the unit duration of each candidate aligning unit and the matching text attribute information of the text matched by each candidate aligning unit, determine the at least one aligning unit and obtain the gear ratio corresponding to each aligning unit;

The unit serial number of each aligning unit, the starting rhythm point serial number in the included rhythm points, the character serial number of the matched characters and the gear ratio are determined as the aligning unit information corresponding to each aligning unit.
The method according to claim 7, wherein the at least one alignment unit is determined according to the unit duration of each candidate alignment unit and the matching character attribute information of the characters matched by the each candidate alignment unit, and each alignment unit is obtained. The gear ratios corresponding to each alignment unit, including:

Select an unselected candidate alignment unit as the current processing unit in the order of unit serial numbers;

According to the unit duration of the current processing unit, combined with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit, determine the current processing unit of the current processing unit. gear ratio;

comparing the current speed ratio with a set first speed ratio and a set second speed ratio, wherein the second speed ratio is greater than the first speed ratio;

In the case that the current gear ratio is greater than or equal to the first gear ratio value and less than or equal to the second gear ratio value, the current processing unit is determined as the alignment unit, and the current gear ratio is denoted as Determine the gear ratio of the alignment unit; determine whether all candidate alignment units are selected to participate in processing, and in response to all candidate alignment units being selected to participate in processing, summarize each determined alignment unit and each alignment unit The gear ratio corresponding to the unit, in response to the existence of a candidate alignment unit that is not selected to participate in the processing in the all candidate alignment units, returns to perform the operation of selecting an unselected candidate alignment unit as the current processing unit according to the sequence of unit serial numbers;

In the case that the current gear ratio is greater than the second gear ratio value, determine the silence duration for filling the current processing unit, determine a new current gear ratio according to the silence duration, and return to executing the The operation of comparing the current gear ratio with the set first gear ratio and the second gear ratio;

In the case that the current gear ratio is smaller than the first gear ratio value, combine the current processing unit with the adjacent next candidate alignment unit to form a new current processing unit, and return to executing the process according to the current processing unit The operation of determining the current gear ratio of the current processing unit is combined with the start and end times of the characters and the start position of the first vowel respectively matched by the current processing unit and the adjacent next candidate alignment unit.
The method according to claim 8, wherein according to the unit duration of the current processing unit, the start and end times and the first element of the characters matched in the current processing unit and the adjacent next candidate alignment unit respectively are combined The starting position of the sound to determine the current gear ratio of the current processing unit, including:

According to the start and end time of each character in all the characters matched by the current processing unit and the starting position of the first vowel, determine the duration of the pronunciation of the matched all characters in the current processing unit;

According to the start and end time of the first character matched by the adjacent next candidate alignment unit of the current processing unit and the start position of the first vowel, determine the vowel interval duration of the first character;

The ratio of the unit duration of the current processing unit to the determined actual pronunciation duration is taken as the current speed change ratio of the current processing unit, wherein the actual pronunciation duration is the duration of the pronunciation occupation duration and the vowel interval duration. and.
The method according to any one of claims 1-9, wherein at least one alignment period for aligning the speech segment with the background music is determined according to the text attribute information and the music rhythm information, And before getting the alignment information table for each alignment cycle, also include:

In the case where the total amount of text in the text attribute information is greater than the total amount of rhythm points in the music rhythm information, the process of converting the speech segment into rap music is ended, and a re-obtained speech segment or background music is output. hint.
A device for converting speech into rap music, comprising:

an information determination module, configured to recognize a speech segment and process background music, and obtain text attribute information of the text in the speech segment and music rhythm information of the background music;

an alignment information determination module, configured to determine at least one alignment period for aligning the speech segment with the background music according to the text attribute information and the music rhythm information, and obtain an alignment information table for each alignment period ;

The conversion control module is configured to control the text in the speech segment to align with the rhythm points in the background music according to the alignment information table of the at least one alignment period, to obtain the aligned audio, and to align the aligned audio. The audio is transposed and processed with special effects to form rap audio.
A computer device comprising:

at least one processor;

a storage device configured to store at least one program;

The at least one program is executed by the at least one processor such that the at least one processor implements the method of converting speech into rap music as claimed in any one of claims 1-10.
A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the method for converting speech into rap music according to any one of claims 1-10 is implemented.